You are here

Feed aggregator

Open Knowledge Foundation: Data to control, Data for participation – Open Data Day 2017 in Chernivtsi, Ukraine

planet code4lib - Tue, 2017-03-28 10:25

This blog is part of the event report series on International Open Data Day 2017. On Saturday 4 March, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. 44 events received additional support through the Open Knowledge International mini-grants scheme, funded by SPARC, the Open Contracting Program of Hivos, Article 19, Hewlett Foundation and the UK Foreign & Commonwealth Office. This event was supported through the mini-grants scheme under the Open contracting and tracking public money flows theme.

This blog has been reposted from http://oporacv.org/make-chernivtsi-open-again-or-open-data-day-2017/

It was the meeting of friends who care about the city. We gathered in the cosy 28/33 Home Bar which is situated in historical part of Chernivtsi city (regional centre in the western part of Ukraine). A bit partisan and hipster atmosphere inside and hot drinks took us from spring coldness and brought us to the world of ideas.

The idea of open data is just starting to be talked about in Ukraine. From one side, this worldwide trend is influencing the situation in our country. From another side, there are only a few organisations, cities and groups of people who have enough capacity not only to speak but also to do something about open data. Unfortunately, our city is still in the beginning of this process. Nevertheless, we want to move further.

So we figured out several spheres we could try to start from. First of all, we discussed about the idea that there are two different “groups of data”:

Data to control [national and local government, officials]

Data to participate and to make city more efficient

Data to control

As the great majority of our group were NGO activists, the “control” and “watchdog” functions are important for us. The best examples of data to control we want to work with are the e-declarations, budget spending data and results of voting of local deputies.

Last year we had a big portal of e-declarations launched in Ukraine. All the officials must fill in declarations online which will become open to the citizens. This portal has an open API with data in JSON format. This is a good opportunity to make some tools for automatic analysis of that data. We played a little in this API and decided to make a separate meeting before or after 1st of April deadline to work with declarations of deputies of the local city council. Also, we understood that we need to learn how to work with JSON format.

One more thing that we spoke about was the open contracting and budget spending data. In Ukraine, at this moment two perfect instruments are working: the open contract system Prozzoro and spending.gov.ua (all the budget transactions are published there). In our company that night we had an admin of the investigative media project Procurement in Chernivtsi, so she shared her experience on how to use the information to make some successful investigation on the theme. We also spoke a little how to work with API ofspending.gov.ua.

Lastly, I presented the national project Rada4You. This project is the Ukrainian replication of the Australian “They vote for you” project. The replication was made by the Civil Network OPORA activists in 2016. The main idea is to scrape voting results from the Ukrainian parliament and to use these data for some analysis. For example, at this moment it is possible to unite several voting results into one policy and to use this instrument for fact checking (to check the public speeches of MPs and how do they vote for different policies). It is also possible to analyse the compatibility of different MPs as there is such an instrument in this project as “Friends for voting”. At last, the project shares all its data through API in JSON format. During our meeting, we decided to work on replication of this tool for the local level and to make the similar tool for Chernivtsi city council. That evening we had a city mayor adviser between us. So it helped us to understand if we could rely on support from that side.

Data to participate

We spoke a lot about how open data can be used by the local authorities as an instrument for participation. To be honest, we understand that the city council has a lack of technical and financial capacity to work deep and intensive on open data. Also, we know that are not a lot of open data specialists in the city. Nevertheless, there are some spheres that we can and must speak about the open data approach. So we detected these crucial directions.

Transport data. At this moment we had some kind of transport crisis in the city. The city council is working on some ideas for improving the situation. So we need to speak with all the stakeholders to achieve the situation that transport data won’t be closed from the community. In addition, we understand that this type of data is not easy to be work with so we need to learn how to use and work with transport data.

Data from Education system. We talked about how the education system accumulates a lot of data and is not sharing them. These data types can be used to make some relevant tools for parents to choose schools for their children.

Data from Health Care System. In our point of view, this type of data should also be in focus, as Ukraine currently going under a health care reform. The datasets dealing with the free of charge medicines, lists of hospitals and pharmacies, medical equipment should be opened.

GIS (geographic information system). In Chernivtsi, the city council is working on GIS implementation. The situation is the same as with transport data. There are some risks that information from GIS can be closed from the community. So we need to have an advocacy campaign to make it open.

This Open Data Day meeting was possible thanks to the Open Knowledge International support. And I hope it is only the first but not the last. We have some plans, they are not clear now but we are ready to make them not only clear but also real.

DuraSpace News: REGISTRATION Open for the VIVO 2017 Conference

planet code4lib - Tue, 2017-03-28 00:00

From the organizers of VIVO 2017

Registration is now open for the 2017 VIVO Conference, August 2-4.

District Dispatch: Bill to make Copyright Register a Presidential appointment “mystifying”

planet code4lib - Mon, 2017-03-27 20:27

Late last Thursday, in a relatively rare bicameral announcement, five senior members of the House and Senate Judiciary Committees endorsed legislation to transfer the power to appoint the Register of Copyrights from the Librarian of Congress to the President. The Register of Copyrights Selection and Accountability Act (H.R. 1695) was authored by House Judiciary Committee Chairman Bob Goodlatte (R-VA6). It also was cosponsored on its introduction by the Committee’s Ranking Member, John Conyers (D-MI13), and 29 other members of the House (21 Republicans and 8 Democrats). Senate supporters currently are Judiciary Committee Chairman Charles Grassley (R-IA), Ranking Member Dianne Feinstein (D-CA) and Sen. Patrick Leahy (D-VT).

Sources: http://feedyoursoul.com/2014/08/06/the-gift-of-confusion/

The bill was referred to Mr. Goodlatte’s Committee for consideration and is widely expected to be voted upon (at least in Committee, if not the full House of Representatives) prior to the upcoming spring recess beginning April 10. No parallel Senate bill yet has been introduced and the pace of H.R. 1695’s or a similar bill’s review in that chamber, as well as which committee or committees will have jurisdiction over it, is currently unclear.

In a sharply worded statement, the Library Copyright Alliance (LCA) unqualifiedly opposed the bill on multiple grounds, particularly that it would politicize the Register’s position to the public’s detriment and inevitably slow the critically needed modernization of the Copyright Office. LCA, comprised of ALA, ACRL and ARL, also called the bill “mystifying” given that – if passed – Congress would voluntarily give up its power to appoint its own copyright advisor to the President to whom the bill also grants the power to fire the appointee at any time (despite the bill also confusingly specifying a 10-year renewable term of office for the Register)! Further, while the Senate would at least retain the power to confirm the nominee, the House would no longer have any influence on the selection process.

LCA’s statement was quoted at length by the widely read Beltway publications Washington Internet Daily (behind a paywall) and Broadcasting & Cable. ALA and its LCA partners will be monitoring H.R. 1695’s progress in the House and Senate closely.

The post Bill to make Copyright Register a Presidential appointment “mystifying” appeared first on District Dispatch.

Islandora: Meet Your Developer: Jonathan Green

planet code4lib - Mon, 2017-03-27 13:07

It has been a while since we had a entry for Meet Your Developer, but there is no better person to re-launch the series than longtime contributor Jonathan Green. The architect of Tuque and a part of the Islandora community since 2010, Jonathan Green is a Committer on both the 7.x and CLAW branches of Islandora. He returns to the community as a DevOps Engineer at LYRASIS, after a hiatus in another industry. Here's what Jon had to say:

Please tell us a little about yourself. What do you do when you’re not at work?

When I’m not at work I’m often still tinkering with computers in one way or another. I’ve always been interested in hacking on both hardware and software. Recently I’ve been playing with the Rust programming language and machine learning.

My other hobby is brewing beer, building things to brew beer and being an amateur beer snob. Recently I converted an old refrigerator into a keg fridge for my homebrew. Right now, I have a stir plate going in my kitchen to grow some yeast for a brew on Saturday.

How long have you been working with Islandora? How did you get started?

After moving back to PEI in late 2009, I started working at discoverygarden, Inc. in January 2010 and quickly started hacking on the 6.x version of Islandora. Spent a few years at DGI, working on the 6.x and 7.x versions of Islandora. In my time at DGI I was involved in building the 7.x version of Islandora and wrote the initial version of the Tuque library.

Then I took a bit of a detour in my Islandora experience and spent a couple years working on embedded software for the marine industry, primarily control systems for power distribution on oil rigs. After a few years floating around on oil rigs in Korea and the Gulf of Mexico, I joined LYRASIS as a contract developer and have been getting back into Islandora development.

Sum up your area of expertise in three words:

All the things.

What are you working on right now?

Right now, I’m focusing on two things primarily. I am working on updating and improving the LYRASIS hosting environment for Islandora 7.x. We are always working on continually rolling out improvements for our Islandora hosting clients, so they can use the latest and greatest Islandora features.

The most exciting thing I’ve been working on is Islandora CLAW. LYRASIS has been generous enough to donate a day or two of my time every week to the CLAW project, so I’ve been jumping into that stack and trying to help with development of the MVP. Recently I committed the CLAW equivalent of the D6 and D7 Drupal filter. This time we are using Json Web Tokens to provide authentication against the external services like Fedora. I’m very excited about CLAW and I feel privileged to be involved in its development.

What contribution to Islandora are you most proud of?

I’m really proud of the work I did at DGI while developing the architecture for Islandora 7.x. It was a huge team effort when moving from Islandora 6.x to Islandora 7.x, and I was a very small part of it, but it’s been great to see how the initial small kernel of Islandora 7.x has grown into an amazing collection of modules and features, and to see the open source community grow around the Islandora project.

What new feature or improvement would you most like to see?

Usability and focus on user experience, especially new user experience. I think that we could do a much better job making the software work as one would expect out of the box.

What’s the one tool/software/resource you cannot live without?

There are so many fundamental pieces of open source software that I couldn’t develop as efficiently without, it’s hard to name just one. I spend my days standing on the shoulders of open source giants.

If you could leave the community with one message from reading this interview, what would it be?

Jump in and contribute, an open source community like Islandora depends on its members. Breaking things is the best way to learn.

Jonathan Green - LYRASIS

Open Knowledge Foundation: Open Data got to Paraguay to stay – Open Data Day 2017

planet code4lib - Mon, 2017-03-27 12:47

This blog is part of the event report series on International Open Data Day 2017. On Saturday 4 March, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. 44 events received additional support through the Open Knowledge International mini-grants scheme, funded by SPARC, the Open Contracting Program of Hivos, Article 19, Hewlett Foundation and the UK Foreign & Commonwealth Office. This event was supported through the mini-grants scheme under the Follow Public Money theme.

The original post was published Girolabs blog in Spanish and was translated by Oscar Montiel 

Open Data got to Paraguay to stay! This was proven in this year’s edition of Open Data Day which took place at the Loffice Terrace.

 

The event brought together more than 40 people and friends of the data community, from Civil Society and Government. The event was organized by Girolabs, TEDIC, CIRD and SENATICs. We started with a small snack to warm up and then we continued with lightning talks and discussions.

 

Maps and Open Data

 

The first ones to speak were José Gonzalez and the creators of TOPA app. José introduced the Open Street Map (OSM) Paraguay community and how everyone can contribute to open data and this big repository of geodata. Then came the TOPA team, who introduced an app based on OSM to create a Traffic and Transport Information System; an integral, sustainable, collaborative platform where users, transit drivers and government offices collaborate to gather and process information about mobility, traffic and transport in real time, from crowdsourced and their own data.

 

Open Data and Civil Society

 

In this edition of #ODD17, civil society generated and presented relevant data for better transparency and control by the citizens.

 

This was the case of CIRD, who presented their project Ñañomoiru, where they try to make more transparent and improve the services provided by the Social Accion Secretary. This office’s goal is to improve the quality of life of people living in a state of poverty and vulnerability, providing easy access to nourishment, health and education, by increasing access to these basic services and strengthening relations to reduce intergenerational poverty. In another CIRD project, called “A quienes Elegimos”, they released their tools of authority monitoring and a data set of municipal authorities that they gathered.

Techo presented their project Relevamiento de Asentamientos Informales (RAP) and their platform mapadeasentamientos.org.py which makes the data about life conditions of settlements in Asunción available as open data.

Gabriela Gaona told us about her experience working in many apps based on open data and how citizens can request information through http://informacionpublica.paraguay.gov.py/.

 

Where’s our money?

 

One of the main subjects of the night was government data about public money. Federico Sosa, the director of Open Government in Paraguay showed the progress of the government’s open data.

 

Right now, everybody can monitor what is done with public money. All the data from the Tax Office about the budgeting, public debt and spending are available in their portal.

Let’s request data!

 

To end the night, SENATICs, using Open Data Day as a platform, in the presence of Leticia Romero and the Minister David Campos, launched the Ideathon InnovandoPY challenge, where they want citizens to support government, companies and civil society organizations to know which data should be available. The challenge will be open until March 31, 2017. The SENATICs will provide mentorship to show participants how to open data.
This was a relaxed event but full of information, debate and sharing between people committed to transparency, innovation and citizen participation. We also gave people stickers and t-shirts from the event. We want to thank the Open Data Day organizers for the support and for making Paraguay visible in the map of open data communities.

Library of Congress: The Signal: Centralized Digital Accessioning at Yale University

planet code4lib - Mon, 2017-03-27 12:33

This is a guest post from Alice Prael, Digital Accessioning Archivist for Yale Special Collections at the Beinecke Rare Book & Manuscript Library at Yale University.

Photo by Alice Prael

As digital storage technology progresses, many archivists are left with boxes of obsolete storage media, such as floppy disks and ZIP disks.  These physical storage media plague archives that struggle to find the time and technology to access and rescue the content trapped in the metal and plastic. The Digital Accessioning Service was created to fix this problem across Yale University Libraries and Museums for special collections units that house unique and rare materials and require specialized preservation and supervised access. Nine of Yale’s special collections units participate in the Digital Accessioning Service.

The goal of the Service is to centralize the capture of content off physical media and eliminate the backlog of born-digital material that has not yet been captured for preservation. Until now, this work was completed in an ad hoc fashion (often in response to researcher requests), which has led to a large backlog of disks that may have been described and separated from the collection but have never been fully processed. By centralizing digital accessioning, Yale Libraries leverages its hardware, software and expertise to make the Service available to special collections units that may lack the resources to capture and preserve born digital collections.

The Di Bonaventura Digital Archaeology and Preservation Lab, shared by the Beinecke Rare Book and Manuscript Library and the YUL Preservation Department, hosts the Digital Accessioning workstations. There are two custom-built computers created to  capture content from storage media such as floppy disks, CDs and hard drives. One non-networked computer is used to scan media for viruses prior to capturing the content (it is disconnected from the network so that viruses cannot get loose and spread throughout the network). Another computer has specialized software to scan the content for private information as well as for other in-depth processing tasks. These machines form the technological base of the Digital Accessioning Service.

The Service is mainly staffed by me (with guidance from Gabby Redwine, Beinecke’s Digital Archivist) and the Born Digital Archives Working Group, which is made up of practitioners from across Yale University Libraries and Museums. The Service also employs student assistants to help with disk imaging and data entry.

Before Drafting Policies and Procedures
Before we drafted policies and procedures, the Digital Archivist and I met with the participating special collection units and talked with each unit about the collections they hold and their expectations for future born-digital acquisitions. (It’s important that the Service be able to provide services for all the major media types within our collections.) We completed an informal environmental scan prior to the creation of the Service to determine what media types the Service should be ready for and how much storage would be necessary to preserve all the captured content. Once the challenges began to take shape, I consulted with the Born Digital Archives Working Group and began building workflows and testing tools.

The Service uses a variety of software and hardware tools, including Kryoflux, Forensic Toolkit, IsoBuster and the BitCurator environment. More details about our usage of these tools are available in the Disk Imaging and Content Capture Manual on our Digital Accessioning Service Libguide. I tested the workflow with dummy media, mostly using software-installation disks. In an effort to stay as transparent as possible to special collections units and the larger digital-archives community, I published much of the Service’s documentation — including workflows, manuals and frequently asked questions — on the Born Digital Archives Working Group Libguide.

The main steps of the workflow are:

  1. Complete a submission form (done by the special collections unit) and deliver media securely to the Lab
  2. Confirm that the boxes of media that arrived match the content described by the special collections unit
  3. Photograph the disks
  4. Scan the disks for viruses
  5. Connect to writeblockers (which block archivists from making any changes — accidentally or deliberately — to the original disk) and attempt to create an exact copy, called a disk image, of the content
  6. If disk-image creation fails, attempt to transfer files off storage media
  7. Scan captured content for personally identifiable information
  8. Package all captured content, photographs and associated metadata files for ingest into the preservation system.

Some record creators use every inch of their labels, leaving little room for archivists to apply their own naming conventions. Photo by Alice Prael.

In creating the Service, I encountered some unexpected challenges, many of which I documented on the Saving Digital Stuff blog. One challenge was determining a standard method for labeling the storage media. It is important that media is labeled in order to correctly identify content and ensure that the description is permanently associated with the storage media. Each special collections unit labels storage media prior to submission to the Service. We had challenges in labeling media that were already covered with text from the original record creator. We also faced difficulties labeling fragile media such as CDs and DVDs. Another challenge was the need for different tools for handling Compact Disks-Digital Audio, or CD-DAs, which have a higher error rate than CDs that contain other data. The Service ultimately decided to use Exact Audio Copy, a software tool created for capturing content from CD-DAs.

The Digital Accessioning Service is only one piece of a larger digital preservation and processing environment. The Service requires that special collections units provide a minimum level of description via spreadsheets that get imported into ArchivesSpace, the archival description management system adopted at Yale University Libraries. However not all of the special collection units have fully implemented ArchivesSpace yet. By using the spreadsheets as an intermediate step, the Service can accommodate all special collections units’ needs regardless of their current stage of ArchivesSpace implementation.

Once the Service’s disk processing is complete, the disk image, photographs, log files and other associated files get moved into the Library’s digital-preservation system, Preservica. Yale University Libraries’ implementation of Preservica is integrated with ArchivesSpace descriptions, which will aid future archivists in locating digital material described in our finding aids. Content from each disk is ingested into Preservica and listed as a digital object in ArchivesSpace, associated with the item-level description for the disk.

After Drafting Policies and Procedures
After drafting and revising the policies and procedures in collaboration with the Born Digital Archives Working Group, the Digital Archivist and I returned to the special collections units to make sure that our workflows would be sufficient for their materials.

One concern was regarding the immediate ingest of material into Preservica. Since many special collections units do not have the hardware to preview disks prior to submission for accessioning, the files themselves have not yet been appraised to determine their archival value. Once content is ingested for preservation, deletion is possible but much more onerous. For special collections units that require appraisal post-accessioning, the Service decided to use the SIP Creator tool, developed by Preservica to package content and maintain the integrity of the files, then move the packaged content onto a shared network storage folder. Special collections units may then access and appraise their content prior to ingest for long-term preservation.

The focus of the Service at this point is to address the significant backlog of material that has been acquired but not yet captured for preservation. The Service is currently funded as a two-year project. As we approach the eight-month mark, we are using this time to determine the ongoing needs for special collections units at Yale. I hope that, as the backlog is diminished, the existence of the Service will aid in future born-digital collection development. Some special collections units have noted that in the past they were hesitant to accept certain donated material because they could not ensure the capture and preservation of the content. By removing this barrier, I hope that donors, curators and archivists across Yale University will be more comfortable working with born-digital material.

DuraSpace News: June DSpace User Group Meeting in Geneva

planet code4lib - Mon, 2017-03-27 00:00

From Atmire  

Heverlee, Belgium  DuraSpace and Atmire invite you to take part in a free DSpace User Group Meeting on June 20th, prior to the OAI10 conference in Geneva.

DuraSpace News: The Spanish Institute of Oceanography Repository Launches New Service

planet code4lib - Mon, 2017-03-27 00:00

From Emilio Lorenzo, Arvo Consulting The Spanish Institute of Oceanography is a public body devoted to multidisciplinary research into oceanographic and marine sciences in order to attain scientific knowledge of the oceans and sustainability of fishery resources and the marine environment.

District Dispatch: We’re only as good as our members are engaged

planet code4lib - Fri, 2017-03-24 20:29


This week at the Association of College and Research Libraries (ACRL) 2017 Conference, academic librarians and information professionals convened around the emerging issues challenging higher education, due to federal funding cuts and new regulations.

On Thursday morning, ALA and ACRL jointly hosted a Postcards to Lawmakers town hall, during which member leaders Emily Drabinski, coordinator of library instruction at Long Island University in Brooklyn, and Clay Williams, deputy chief librarian at Hunter College in Manhattan and our very own Lisa Lindle, grassroots communication specialist at the ALA Washington Office, offered insight to those seeking advice and encouragement about how to effectively advocate for libraries in the face of drastic cuts. The panel offered insight on how to sign up for and use ALA’s Legislative Action Center and collected questions from the audience. American Libraries magazine covered their talk here.

On Friday morning’s Academic Libraries and New Federal Regulations session, Corey Williams, a federal lobbyist at the National Education Association (and formally an ALA lobbyist in the Washington Office) again urged members to step up to the plate. Corey made two illustrative points: ALA has 3 lobbyists for our nearly 60,000 members and one person is only one voice. In other words: Lobbyists are only as good as our members are engaged.

Advocacy is akin to a muscle; you can flex it once, by sending a tweet or an email. But we are one mile into a marathon and advocacy is a muscle that needs to be exercised constantly. Both town halls offered some immediate steps you can take in this next leg of the race.

Do Your Reps
• Did you write a postcard? Great. Now tweet a picture of that postcard to your representatives with the line: “No cuts for funding for museums and libraries. #SaveIMLS

Short Sprints
• Sign up for ALA’s Legislative Action Center.
• Then, prepare a talking point about why IMLS is important to your community and share it with a friend or patron so you can customize your emails to Congress.

Intervals
• Invite your representatives to visit your library (ProTip: Work with your organization’s government relations office to coordinate.)
• Attend a constituent coffee when your reps are home during the weeks of April 10 and April 17 (Note: This period of time that they’re home also happens to be National Library Week. If that time is not possible, other times are good, too, whenever the Member is at home.)
• Think about who can you partner or create a coalition with in your community.
• Pair your data (i.e., how much LSTA funding you receive) with anecdote (i.e. how that money made a transformative difference to your patrons).

In response to other that came up, here are two other helpful references:
• Here’s what the National Archives and Records Administration says about irradiated mail
• Here’s where you can look up your representative’s social media handle

The post We’re only as good as our members are engaged appeared first on District Dispatch.

District Dispatch: Look Back, Move Forward: Library Services and Technology Act

planet code4lib - Fri, 2017-03-24 18:01

Thank you to everyone for sharing your #SaveIMLS stories. Please keep it coming – more than 7,700 tweets (nearly doubling our count since last Thursday). As we prepare for the appropriations process, here’s a look back on how ALA Council resolved to support the Library Services and Technology Act in June 1995.

As we move forward into the “Dear Appropriator Letters” be sure to sign up for our Legislative Action Center today.

The post Look Back, Move Forward: Library Services and Technology Act appeared first on District Dispatch.

FOSS4Lib Recent Releases: Evergreen - 2.12.0

planet code4lib - Fri, 2017-03-24 13:36
Package: EvergreenRelease Date: Wednesday, March 22, 2017

Last updated March 24, 2017. Created by gmcharlt on March 24, 2017.
Log in to edit this page.

With this release, we strongly encourage the community to start using the new web client on a trial basis in production. All current Evergreen functionality is available in the web client with the exception of serials and offline circulation. The web client is scheduled to be available for full production use with the September 3.0 release.
Other notable new features and enhancements for 2.12 include:

Ed Summers: Teaching Networks

planet code4lib - Fri, 2017-03-24 04:00

Yesterday I had the good fortune to speak with Miriam Posner, Scott Weingart and Thomas Padilla about their experiences teaching digital humanities students about network visualization, analysis and representation. This started as an off the cuff tweet about teaching Gephi, which led to an appointment to chat, and then to a really wonderful broader discussion about approaches to teaching networks:

(???) (???) have either of you taught a DH class about how to use Gephi? Or do you know of someone else who has?

— Ed Summers ((???)) March 10, 2017

Scott suggested that other folks who teach this stuff in a digital humanities context might be interested as well so we decided to record it, and share it online (see below).

The conversation includes some discussion of tools (such as Gephi, Cytoscape, NodeXL, Google Fusion Tables, DataBasic, R) but also some really neat exercises for learning about networks with yarn, balls, short stories and more.

A particular fun part of discussion focuses on approaches to teaching graph measurement and analytics as well as humanistic approaches to graph visualization that emphasize discovery and generative creativity.

During the email exchange that led up to our chat Miriam, Scott and Thomas shared some of their materials which you may find useful in your own teaching/learning:

I’m going to be doing some hands-on exercises about social media, networks and big data in Matt Kirschenbaum‘s Digital Studies class this Spring – and I was really grateful for Miriam, Scott and Thomas’ willingness to share their experiences with me.

Anyhow, here’s the video! If you want to get to the good stuff skip to 8:40 where I stop selfishly talking about the classes were teaching at MITH.

PS. this post was brought to you by the letter B since (as you will see) Thomas thinks that blogs are sooooo late 2000s :-) I suspect he is right, but I’m clearly still tightly clutching to my vast media empire.

Eric Hellman: Reader Privacy for Research Journals is Getting Worse

planet code4lib - Thu, 2017-03-23 17:22
Ever hear of Grapeshot, Eloqua, Moat, Hubspot, Krux, or Sizmek? Probably not. Maybe you've heard of Doubleclick, AppNexus, Adsense or Addthis? Certainly you've heard of Google, which owns Doubleclick and Adsense. If you read scientific journal articles on publisher websites, these companies that you've never heard of will track and log your reading habits and try to figure out how to get you to click on ads, not just at the publisher websites but also at websites like Breitbart.com and the Huffington Post.

Two years ago I surveyed the websites of 20 of the top research journals and found that 16 of the top 20 journals placed trackers from ad networks on their web sites. Only the journals from the American Physical Society (2 of the 20) supported secure (HTTPS) connections, and even now APS does not default to being secure.

I'm working on an article about advertising in online library content, so I decided to revisit the 20 journals to see if there had been any improvement. Over half the traffic on the internet now uses secure connections, so I expected to see some movement. One of the 20 journals, Quarterly Journal of Economics, now defaults to a secure connection, significantly improving privacy for its readers. Let's have a big round of applause for Oxford University Press! Yay.

So that's the good news. The bad news is that reader privacy at most of the journals I looked at got worse. Science, which could be loaded securely 2 years ago, has reverted to insecure connections. The two Annual Reviews journals I looked at, which were among the few that did not expose users to advertising network tracking, now have trackers for AddThis and Doubleclick. The New England Journal of Medicine, which deployed the most intense reader tracking of the 20, is now even more intense, with 19 trackers on a web page that had "only" 14 trackers two years ago. A page from Elsevier's Cell went from 9 to 16 trackers.

Despite the backwardness of most journal websites, there are a few signs of hope. Some of the big journal platforms have begun to implement HTTPS. Springer Link defaults to HTTPS, and Elsevier's Science Direct is delivering some of its content with secure connections. Both of them place trackers for advertising networks, so if you want to read a journal article securely and privately, your best bet is still to use Tor.

David Rosenthal: Threats to stored data

planet code4lib - Thu, 2017-03-23 15:32
Recently there's been a lively series of exchanges on the pasig-discuss mail list, sparked by an inquiry from Jeanne Kramer-Smyth of the World Bank as to any additional risks posed by media such as disks that did encryption or compression. It morphed into discussion of the "how many copies" question and related issues. Below the fold, my reflections on the discussion.

The initial question was pretty clearly based on a misunderstanding of the way self-encrypting disk drives (SED) and hardware compression in tape drives work. Quoting the Wikipedia article Hardware-based full disk encryption:
The drive except for bootup authentication operates just like any drive with no degradation in performance. The encrypted data is never visible outside the drive, and the same is true for the compressed data on tape. So as far as systems using them are concerned, whether the drive encrypts or not is irrelevant. Unlike disk, tape capacities are quoted assuming compression is enabled. If your data is already compressed, you likely get no benefit from the drive's compression.

SED have one additional failure mode over regular drives; they support a crypto erase command which renders the data inaccessible. The effect as far as the data is concerned is the same as a major head crash. Archival systems that fail if a head crashes are useless, so they must be designed to survive total loss of the data on a drive. There is thus no reason not to use self-encrypting drives, and many reasons why one might want to.

But note that their use does not mean there is no reason for the system to encrypt the data sent to the drive. Depending on your threat model, encrypting data at rest may be a good idea. Depending on the media to do it for you, and thus not knowing whether or how it is being done, may not be an adequate threat mitigation.

Then the discussion broadened but, as usual, it was confusing because it was about protecting data from loss, but not based on explicit statements about what the threats to the data were, other than bit-rot.

There was some discussion of the "how many copies do we need to be safe?" question. Several people pointed to research that constructed models to answer this question. I responded:
Models claiming to estimate loss probability from replication factor, whether true replication or erasure coding, are wildly optimistic and should be treated with great suspicion. There are three reasons:
  • The models are built on models of underlying failures. The data on which these failure models are typically based are (a) based on manufacturers' reliability claims, and (b) ignore failures upstream of the media. Much research shows that actual failures in the field are (a) vastly more likely than manufacturers' claims, and (b) more likely to be caused by system components other than the media.
  • The models almost always assume that the failures are un-correlated, because modeling correlated failures is much more difficult, and requires much more data than un-correlated failures. In practice it has been known for decades that failures in storage systems are significantly correlated. Correlations among failures greatly raise the probability of data loss.
  • The models ignore almost all the important threats, since they are hard to quantify and highly correlated. Examples include operator error, internal or external attack, and natural disaster.
For replicated systems, three replicas is the absolute minimum IF your threat model excludes all external or internal attacks. Otherwise four (see Byzantine Fault Tolerance).

For (k of n) erasure coded systems the absolute minimum is three sites arranged so that k shards can be obtained from any two sites. This is because shards in a single site are subject to correlated failures (e.g. earthquake).This is a question I've blogged about in 2016 and 2011 and 2010, when I concluded:
  • The number of copies needed cannot be discussed except in the context of a specific threat model.
  • The important threats are not amenable to quantitative modeling.
  • Defense against the important threats requires many more copies than against the simple threats, to allow for the "anonymity of crowds".
In the discussion Matthew Addis of Arkivum made some excellent points, and pointed to two interesting reports:
  • A report from the PrestoPrime project. He wrote:
    There’s some examples of the effects that bit-flips and other data corruptions have on compressed AV content in a report from the PrestoPRIME project. There’s some links in there to work by Heydegger and others, e.g. impact of bit errors on JPEG2000. The report mainly covers AV, but there are some references in there about other compressed file formats, e.g. work by CERN on problems opening zips after bit-errors. See page 57 onwards.
  • A report from the EU's DAVID project. He wrote:
    This was followed up by work in the DAVID project that did a more extensive survey of how AV content gets corrupted in practice within big AV archives. Note that bit-errors from storage, a.k.a bit rot was not a significant issue, well not compared with all the other problems!
Matthew wrote the 2010 PrestoPrime report, building on among others Heydegger's 2008 and 2009 work on the effects of flipping bits in compressed files (Both links are paywalled but the 2008 paper is available via the Wayback Machine). The 2013 DAVID report concluded:
It was acknowledged that some rare cases or corruptions might have been explained by the occurrence of bit rot, but the importance and the risk of this phenomenon was at the present time much lower than any other possible causes of content losses. On the other hand, they were clear that:
Human errors are a major cause of concern. It can be argued that most of the other categories may also be caused by human errors (e.g. poor code, incomplete checking...), but we will concentrate here on direct human errors. In any complex system, operators have to be in charge. They have to perform essential tasks, maintaining the system in operation, checking that resources are sufficient to face unexpected conditions, and recovering the problems that can arise. However vigilant an operator is, he will always make errors, usually without consequence, but sometimes for the worst. The list is virtually endless, but one can cite:
  • Removing more files than wanted
  • Removing files in the wrong folder
  • Pulling out from a RAID a working disk instead of the faulty one
  • Copying and editing a configuration file, not changing all the necessary parameters
  • Editing a configuration file into a bad one, having no backup
  • Corrupting a database
  • Dropping a data tape / a hard disk drive
  • Introducing an adjustment with unexpected consequences
  • Replacing a correct file or setup from a wrong backup.
Such errors have the potential for affecting durably the performances of a system, and are not always reversible. In addition, the risk of error is increased by the stress introduced by urgency, e.g. when trying to make some room on in storage facilities approaching saturation, or introducing further errors when trying to recover using backup copies. We agree, and have been saying so since at least 2005. And the evidence keeps rolling in. For example, on January 31st Gitlab.com suffered a major data loss. Simon Sharwood at The Register wrote:
Source-code hub GitLab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. ... Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated. Commendably, Gitlab made a Google Doc public with a lot of detail about the problem and their efforts to mitigate it:
  1. LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
  2. Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
    1. SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
  3. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
  4. The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
  5. The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
    1. SH: We learned later the staging DB refresh works by taking a snapshot of the gitlab_replicator directory, prunes the replication configuration, and starts up a separate PostgreSQL server.
  6. Our backups to S3 apparently don’t work either: the bucket is empty
  7. We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now.
So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. => we're now restoring a backup from 6 hours ago that workedThe operator error revealed the kind of confusion and gradual decay of infrastructure processes that is common when procedures are used only to recover from failures, not as a routine. Backups that are not routinely restored are unlikely to work when you need them. The take-away is that any time you reach for the backups, you're likely already in big enough trouble that your backups can't fix it. I was taught this lesson in the 70s. The early Unix dump command failed to check the return value from the write() call. If you forgot to write-enable the tape by inserting the write ring the dump would appear to succeed, the tape would look like it was spinning, but no data would be written to the backup tape.

Fault injection should be, but rarely is, practiced at all levels of the system. The results of not doing so are shown by UW Madison's work injecting faults into file systems and distributed storage. My blog posts on this topic include Injecting Faults in Distributed Storage, More bad news on storage reliability, and Forcing Frequent Failures.

Update: much as I love Kyoto, as a retiree I can't afford to attend iPRES2017. Apparently, there's a panel being proposed on the "bare minimum" for digital preservation. If I were on this panel I'd be saying something like the following.

We know the shape of the graph of loss probability against cost - it starts at one at zero cost and is an S-curve that gets to zero at infinite cost. Unfortunately, because the major threats to stored data are not amenable to quantitative modeling (see above), and technologies differ in their cost-effectiveness, we cannot actually plot the graph. So there are no hard-and fast answers.

The real debate here is how to distinguish between "digital storage" and "digital preservation". We do have a hard-and-fast answer for this. There are three levels of certification; the Data Seal of Approval (DSA), NESTOR's DIN31644, and TRAC/ISO16363. If you can't even pass DSA then what you're doing can't be called digital preservation.

Especially in the current difficult funding situation, it is important NOT to give the impression that we can "preserve" digital information with ever-decreasing resources, because then what we will get is ever-decreasing resources. Because there will always be someone willing to claim that they can do the job cheaper. Their short-cuts won't be exposed until its too late. That's why certification is important.

We need to be able to say "I'm sorry, but preserving this stuff costs this much. Less money, no preservation, just storage.".

Open Knowledge Foundation: The Global Open Data Index – an update and the road ahead

planet code4lib - Thu, 2017-03-23 14:00

The Global Open Data Index is a civil society collaborative effort to track the state of open government data around the world. The survey is designed to assess the openness of specific government datasets according to the Open Definition. Through this initiative, we want to provide a civil society audit of how governments actually publish data with input and review from citizens and organisations. This post describe our future timeline for the project. 

 

Here at Open Knowledge International, we see the Global Open Data Index (aka GODI) as a community effort. Without community contributions and feedback there is no index. This is why it is important for us to keep the community involved in the index as much as we can (see our active forum!). However, in the last couple of months, lots has been going on with GODI. In fact so much was happening that we neglected our duty to report back to our community. So based on your feedback, here is what is going on with GODI 2016:

 

New Project Management

Katelyn Rogers, who managed the project until January 2017, is now leading the School of Data program. I have stepped in to manage the Index until its launch this year. I am an old veteran to GODI, being its research and community lead for 2014 and 2015, so this is a natural fit for me and the project. This is done with my work as the International Community Coordinator and the Capacity team lead, but fear not, GODI is a priority!

 

This change in project management allowed us to take some time and modify the way we manage the project internally. We moved all of our current and past tasks: code content and research to the public Github account. You can see our progress on the project here- https://github.com/okfn/opendatasurvey/milestones

 

Project timeline

Now, after the handover is done, it is easier for us to decide on the road forward for GODI (in coordination with colleagues at the World Wide Web Foundation, which publishes the Open Data Barometer). We are happy to share with you the future timeline and approach of the Index:

  • Finalising review: In the last 6 weeks, we have been reviewing the different index categories of 94 places. Like last year, we took the thematic reviewer approach, in which each reviewer checked all the countries under one category. We finished the review by March 20th, and we are now running quality assurance for the reviewed submissions, mainly looking for false positives of datasets that have been defined as complying with the Open Definition.

 

  • Building the GODI site: This year we paid a lot of attention to the development of our methodology and changed the survey site to reflect it and allow easy customization (see Brook’s blog). We are now finalising the result site so it will have even better user experience than past years.
  • Launch! The critical piece of information that many of you wanted! We will launch the Index on May 2nd, 2017! And what a launch it is going to be!
    Last year we gave a 3 weeks period for government and civil society to review and suggest corrections for our assessment of the Index on the survey app, before publishing the permanent index results. This was not obvious to many, and we got many requests for corrections or clarifications after publishing the final GODI.
    This year, we will publish the index results, and data publishers and civil society will have the opportunity to contest the results publicly through our forum for 30 days. We will follow the discussions to decide if we should change some results or not. The GODI team believes that if we are aspiring to be a tool for not only measuring but also for learning open data publication, we need to allow civil society and government to engage around the results in the open. We already see the great engagement of some governments in the review process of GODI (See Mexico and Australia), and we would like to take this even one step further, making this a tool that can help and improve open data publication around the world.
  • Report: After fixing the Index result, we will publish a report on our learnings from GODI 2016. This is the first time that we will write a report on the Global Open Data Index findings, and we hope that this will help us not only in creating better GODI in the future but also to promote and publish better datasets.

 

Have any question? Want to know more about the upcoming GODI? Have ideas for improvements? Start a topic in the forum:  https://discuss.okfn.org/c/open-data-index/global-open-data-index-2016

 

Open Knowledge Foundation: Open data day 2017 in Uganda: Open contracting, a key to inclusive development

planet code4lib - Thu, 2017-03-23 13:56

This blog is part of the event report series on International Open Data Day 2017. On Saturday 4 March, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. 44 events received additional support through the Open Knowledge International mini-grants scheme, funded by SPARC, the Open Contracting Program of Hivos, Article 19, Hewlett Foundation and the UK Foreign & Commonwealth Office. This event was supported through the mini-grants scheme under the Open contracting and tracking public money flows theme.

On Friday 3rd March 2017, the Anti-Corruption Coalition Uganda (ACCU) commemorated the International Open Data Day 2017 with a meetup of 37 people from Civil Society Organizations (CSOs), development partners, the private sector and the general public. The goal of this meetup was to inform Ugandan citizens, media and government agencies on the importance of open data in improving public service delivery.

Process  

The process started with an overview of open data since the concept seemed to be new to most participants. Ms. Joy Namunoga, Advocacy Officer at ACCU, highlighted the benefits of open data, including value for money for citizens and taxpayers, knowing governments transactions, holding leaders accountable, constructive criticism to influence policy, boosting transparency, reducing corruption and increasing social accountability.

With such a background, participants observed the fact that in Uganda, 19% of people have access to the internet. Hence the need to embrace media as a third party to interpret data and take the information closer to citizens. Participants noted that, while Uganda has an enabling policy framework for information sharing; the Access to Information Act and regulations require information to be paid for, namely $6, yet the majority of Ugandans live below $2 a day. The financial requirement denies a percentage of Ugandans their right to know. It was also noted that CSOs and government agencies equally do not avail all the information on their websites, which further underscores this fact.

Issues discussed Open contracting

Mr. Agaba Marlon, Communications Manager ACCU took participants through the process of open contracting as highlighted below:

Figure 1: Open contracting process

He showcased ACCU’s Open Contracting platform commonly known as USER (Uganda System for Electronic open data Records), implemented in partnership with Kampala Capital City Authority (KCCA), a government agency, and funded by the United Nations Development Programme. This platform created a lively conversation amongst the participants, and the following issues were generated to strengthen open contracting in Uganda:

  • Popularizing open data and contracting in Uganda by all stakeholders.
  • Mapping people and agencies in the open contracting space in Uganda to draw lines on how to complement each other.
  • Lobbying and convincing government institutions to embrace the open contracting data standards.
  • Stakeholders like civil society should be open before making the government open up.
  • Simplification of Uganda’s procurement laws for easier understanding by citizens.
  • Faming and shaming of the best and worst contractors as well as advocating for penalties to those who fraud rules.
  • Initiating and strengthening of information portals i.e., both offline and online media.
Bringing new energy and people to the open data movement

Mr. Micheal Katagaya, an open data activist, chaired this session. Some suggestions were made that can bring new energy to the open data movement, such as renegotiate open data membership with the government, co-opting celebrities (especially musicians) to advocate for open data, simplifying data and packaging it in user-friendly formats and linking data to problem-solving principles. Also, thematic days like International women’s day, youth day or AIDS day could be used to spread a message on open data, and local languages could be used to localise the space for Ugandans to embrace open data. Finally, it was seen as important to understand audiences and package messages accordingly, and to identify open data champions and ambassadors.

Sharing open data with citizens who lack internet access

This session was chaired by Ms. Pheona Wamayi an independent media personality. Participants agreed that civil society and government agencies should strengthen community interfaces between government and citizens because these enable citizens to know of government operations. ACCU was encouraged to use her active membership in Uganda to penetrate the information flow and disseminate it to the citizens. Other suggestions included:

  • Weekly radio programs on open data and open contracting should be held. Programs should be well branded to suit the intended audiences.
  • Simplified advocacy materials should be produced for community members’ i.e., leaflets, posters to inform the citizens on open data. Community notice boards could be used to disseminate information on open data.
  • Civil society and government should liaise with telecom companies to provide citizens with the internet.
  • Edutainment through music and forum theatre should be targeted to reach citizens on open data.

Way forward

Ms. Ephrance Nakiyingi, Environmental Governance officer-ACCU took participants through the action planning process. The following points were suggested as key steps to pursue as stakeholders:

  • Consider offline strategies like SMS to share data with citizens
  • Design  massive open data campaigns to bring new energy to the movement
  • Develop a multi-media strategy based on consumer behaviour
  • Creating synergies between different open data initiatives
  • Embrace open data communication
  • Map out other actors in the open data fraternity
  • In-house efforts to share information/stakeholder openness

pinboard: Twitter

planet code4lib - Thu, 2017-03-23 12:45
Have not read the full report but based on the abstract seems useful to those involved in the #code4lib incorporati…

Terry Reese: MarcEdit and Alma Integration: Working with holdings data

planet code4lib - Thu, 2017-03-23 11:52

Ok Alma folks,

 I’ve been thinking about a way to integrate holdings editing into the Alma integration work with MarcEdit.  Alma handles holdings via MFHDs, but honestly, the process for getting to holdings data seems a little quirky to me.  Let me explain.  When working with bibliographic data, the workflow to extract records for edit and then update, looks like the following:

 Search/Edit

  1. Records are queried via Z39.50 or SRU
  2. Data can be extracted directly to MarcEdit for editing

 

Create/Update

  1. Data is saved, and then turned into MARCXML
  2. If the record has an ID, I have to query a specific API to retrieve specific data that will be part of the bib object
  3. Data is assembled in MARCXML, and then updated or created.

 

Essentially, an update or create takes 2 API calls.

For holdings, it’s a much different animal.

Search/Edit:

  1. Search via Z39.50/SRU
  2. Query the Bib API to retrieve the holdings link
  3. Query the holdings link api to retrieve a list of holding ids
  4. Query each holdings record API individually to retrieve a holdings object
  5. Convert the holdings object to MARCXML and then into a form editable in the MarcEditor
    1. As part of this process, I have to embed the bib_id and holdin_id into the record (I’m using a 999 field) so that I can do the update

 

For Update/Create

  1. Convert the data to MARCXML
  2. Extract the ids and reassemble the records
  3. Post via the update or create API

 

Extracting the data for edit is a real pain.  I’m not sure why so many calls are necessary to pull the data.

 Anyway – Let me give you an idea of the process I’m setting up.

First – you query the data:

Couple things to note – to pull holdings, you have to click on the download all holdings link, or right click on the item you want to download.  Or, select the items you want to download, and then select CTRL+H.

When you select the option, the program will prompt you to ask if you want it to create a new holdings record if one doesn’t exist. 

 

The program will then either download all the associated holdings records or create a new one.

Couple things I want you to notice about these records.  There is a 999 field added, and you’ll notice that I’ve created this in MarcEdit.  Here’s the problem…I need to retain the BIB number to attach the holdings record to (it’s not in the holdings object), and I need the holdings record number (again, not in the holdings object).  This is a required field in MarcEdit’s process.  I can tell if a holdings item is new or updated by the presence or lack of the $d. 

 

Anyway – this is the process that I’ve come up with…it seems to work.  I’ve got a lot of debugging code to remove because I was having some trouble with the Alma API responses and needed to see what was happening underneath.  Anyway, if you are an Alma user, I’d be curious if this process looks like it will work.  Anyway, as I say – I have some cleanup left to do before anyone can use this, but I think that I’m getting close.

 

–tr

Pages

Subscribe to code4lib aggregator