You are here

Feed aggregator

Alf Eaton, Alf: Styling and theming React Components

planet code4lib - Tue, 2017-08-15 09:56

When styling applications with global CSS stylesheets, generic class names become specialised over time (e.g. section_heading__active), and eventually cause problems when a style change in one part of the app accidentally affects usage of the same class name in another part of the app.

Importantly, when authoring app components, you don’t know (or want to know) what CSS classes are being used in the rest of the app.

When authoring components in JSX it’s tempting to just write inline styles, but these can be slow and make the element code more difficult to read.

There are some standardised solutions to this - scoped styles in Shadow DOM - but lack of browser support means that JS polyfills are required.

Another alternative is CSS Modules: when a CSS file is imported into a JS file, the class names in the CSS file are replaced with globally unique names and are provided as an object*.

A solution: CSS in JS

Define the styles inside the component and inject them with a Higher Order Component, generating unique class names at the same time.

In order for components to be themeable, there needs to be a way for theme variables to be available when writing styles*. A theme provider HOC makes the theme object available via React’s context, so the variables can be used in styles.

There are a few leading contenders:

  • styled-components: write styles in CSS, interpolate theme variables, injects a className string containing a single class name.
  • react-jss: write styles in JSS (objects), use theme variables as values, injects a classes object containing multiple class names.
  • styled-jsx: write CSS in <style> tags, scoped to the parent component.
  • material-ui: uses react-jss and adds an name to each style object, allowing any class to be targetted for overriding by the theme.
Further reading
  • CSS Modules aren’t currently supported by react-scripts. SASS is one way to define and use global variables in CSS, but is also not currently supported by react-scripts

Ed Summers: UTR

planet code4lib - Tue, 2017-08-15 04:00

I’ve always intended to use this blog as more of a place for rough working notes as well as somewhat more fully formed writing. So in that spirit here are some rough notes for some digging into a collection of tweets that used the #unitetheright hashtag. Specifically I’ll describe a way of determining what tweets have been deleted.

Note: be careful with deleted Twitter data. Specifically be careful about how you publish it on the web. Users delete content for lots of reasons. Republishing deleted data on the web could be seen as a form of Doxing. I’m documenting this procedure for identifying deleted tweets because it can provide insight into how particularly toxic information is traveling on the web. Please use discretion in how you put this data to use.

So I started by building a dataset of #unitetheright data using twarc:

twarc search '#unitetheright' > tweets.json

I waited two days and then was able to gather some information about the tweets that were deleted. I was also interested in what content and websites people were linking to in their tweets because of the implications this has for web archives. Here are some basic stats about the dataset:

number of tweets: 200,113

collected at: 2017-08-13 11:46:05 EDT

date range: 2017-08-04 11:44:12 - 2017-08-13 15:45:39 UTC

tweets deleted: 16,492 (8.2%)

Top 10 Domains in Tweeted URls Domain Count 518 91 83 47 32 22 17 16 15 15 Top 25 Tweeted URLs (after unshortening) URL Count 1460 929 613 384 351 338 244 242 223 208 202 189 187 184 167 143 127 123 107 100 99 90 87 81 80 Deletes

So how do you get a sense of what has been deleted from your data? While it might make sense to write a program to do this eventually, I find it can be useful to work in a more a more exploratory way on the command line first and then when I’ve got a good workflow I can put that into a program. I guess if I were a real data scientist I would be doing this in R or a Jupyter notebook at least. But I still enjoy working at the command line, so here are the steps I took to identify tweets that had been deleted from the original dataset:

First I extracted and sorted the tweet identifiers into a separate file using jq:

jq -r '.id_str' tweets.json | sort -n > ids.csv

Then I hydrated those ids with twarc. If the tweet has been deleted since it was first collected it cannot be hydrated:

twarc hydrate ids.csv > hydrated.json

I extracted these hydrated ids:

jq -r .id_str hydrated.json | sort -n > ids-hydrated.csv

Then I used diff to compare the pre and post hydration ids, and used a little bit of Perl to strip of the diff formatting, which results in a file of tweet ids that have been deleted.

diff ids.csv ids-hydrated.csv | perl -ne 'if (/< (\d+)/) {print "$1\n";}' > ids-deleted.csv

Since we have the data that was deleted we can now build a file of just deleted tweets. Maybe there’s a fancy way to do this on the command line but I found it easiest to write a little bit of Python to do it:

After you run it you should have a file delete.json. You might want to convert it to CSV with something like twarc’s utility to inspect in a spreadsheet program.

Calling these tweets deleted is a bit of a misnomer. A user could have deleted their tweet, deleted their account, protected their account or Twitter could have decided to suspend the users account. Also, the user could have done none of these things and simply retweeted a user who had done one of these things. Untangling what happened is left for another blog post. To be continued…

District Dispatch: ALA celebrates 10 years of Google Policy Fellows

planet code4lib - Mon, 2017-08-14 19:50

ALA celebrates the 10th anniversary of the Google Policy Fellowship Program.

Last Friday, we said goodbye to our 2017 Google Policy Fellow Alisa Holahan. The week before her departure, she and OITP hosted a lunch and discussion for this year’s cohort of Google Policy Fellows.

Similar to the six Policy Fellows lunches we have hosted in the past (in 2016, 2015, 2014, 2013, 2012 and 2011), the gathering was an opportunity for the Fellows to explore the intersection of information technology policy and libraries. Fellows from various policy organizations including the Center for Democracy and Technology, the National Hispanic Media Coalition, and Public Knowledge attended to learn more about ALA’s role in shaping technology policy and addressing library needs.

Alan Inouye, Marijke Visser and Carrie Russell shared a brief overview of their roles and the focus at OITP and I represented the intersection between OITP and the OGR. After introductions, the conversation turned to a series of questions: How does the Ready to Code initiative support workforce innovation? How does the Washington Office set priorities? How do we decide our portfolios of work? The informal question-and-answer format generated an interesting exchange around libraries’ roles and interests in technology and innovation.

Most notably, this summer’s lunch marked the 10th anniversary of the Google Policy Fellow Program, of which ALA is a founding host organization. Since 2008, we have encouraged master’s and doctoral students in library and information studies or related areas with an interest in national public policy to apply and have now amassed a decade of alumni, including:

As the expanding role of libraries of all types evolves, the need for information professionals with Washington experience and savvy will continue to grow. The Washington Office is privileged to have hosted ten early-career professionals and to provide the means for them to obtain direct experience in national policy making.

The post ALA celebrates 10 years of Google Policy Fellows appeared first on District Dispatch.

Islandora: Islandora 7.x Committers Calls are on a new platform

planet code4lib - Mon, 2017-08-14 18:27

Ever wanted to come to the bi-weekly Islandora Committers Call but don't like using Skype? Good news! We have moved the call to the FreeConferenceCallHD line used by the weekly Islandora CLAW Calls and many of our Interest Groups. You can join by phone or using the web interface in your browser, and join us in the #islandora IRC channel on Freenode for sharing links and text comments (there's a web version for that as well, if you don't want to run an IRC client).

How to join:

Lucidworks: Using A Query Classifier To Dynamically Boost Solr Ranking

planet code4lib - Mon, 2017-08-14 18:14

As we countdown to the annual Lucene/Solr Revolution conference in Las Vegas next month, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Target’s Howard Wan’s talk, “Using A Query Classifier To Dynamically Boost Solr Ranking”.

About 40% of our queries at are ambiguous, which can result in products from many categories. For example, the query “red apple” can match the following products: a red apple ipod (electronic category), red apple fruit ( fresh produce ), red apple iphone case ( accessories). It is desirable to have a classifier to instruct Solr to boost items from the desire category. In addition, for a search engine with a small index, a good percentage of the queries may have little or no results. Is it possible to use the classifier to solve both problems? This talk discusses a classifier built from behavior data which can dynamically re-classify the query to solve both problems.


Join us at Lucene/Solr Revolution 2017, the biggest open source conference dedicated to Apache Lucene/Solr on September 12-15, 2017 in Las Vegas, Nevada. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Using A Query Classifier To Dynamically Boost Solr Ranking appeared first on Lucidworks.

Evergreen ILS: Evergreen 3.0 development update #14: results of the second Feedback Fest

planet code4lib - Mon, 2017-08-14 16:43

Display at the circulation desk of the Lucius Beebe Memorial Library in Wakefield, MA. Photo courtesy Jeff Klapes.

Since the previous update, another 61 patches have made their way into Evergreen’s master branch. Last week was the second Feedback Fest for the 3.0 release cycle, so let’s talk about some numbers.

A total of 52 bugs either had a pull request when the Feedback Fest started or got one during the course of the week. Of those, 23 resulted in patches getting merged. Some merges of particular note include:

  • A new database view for more efficiently retrieving lists of current and aged loans (bug 1695007)
  • Teaching Evergreen about the time zone that a library belongs to to help format due date and due displays (bug 1705524)
  • Teaching the web staff client how to print spine and item labels (bug 1704873)
  • Adding two more popularity metrics (bug 1688096)
  • Several improvements to Evergreen’s SIP server
  • Numerous fixes to the the internationalization infrastructure; the net effect is that more strings are available to translators

The fix for bug 1480432 warrants a particular shout-out, as it clarifies a subtle detail of the staff permissions system. It’s possible for a staff member to receive a given permission at more than one depth. For example, a staff member could have VIEW_USER permission at the branch level because of their primary profile, but could also have it at system level because of a secondary permission group. Prior to the bugfix, the actual depth that was applied would depend on hidden details of exactly when and how the permission was granted. As a result of the bugfix, there’s now a clear rule: if a staff account is granted a given permission multiple times, the depth applied is the one that grants the permission as broadly as possible. Many thanks to Michele Morgan for the bugfix and to Cesar Velez for writing unit tests.

Back to the numbers: three of the feedback fest bugs were tested and signed off, but not yet merged. Four of them got a pull request during the fest, while nine received comments of various sorts. In total, 75% of the feedback fest bugs received substantive feedback, which matches the results of the first Feedback Fest.

I would like to thank the following people who participated in Feedback Fest #2:

  • Galen Charlton
  • Jeff Davis
  • Bill Erickson
  • Jason Etheridge
  • Rogan Hamby
  • Blake Henderson
  • Kyle Huckins
  • Kathy Lussier
  • Terran McCanna
  • Michele Morgan
  • Andrea Neiman
  • Dan Pearl
  • Mike Rylander
  • Dan Scott
  • Chris Sharp
  • Ben Shum
  • Cesar Velez

It should be noted that several people also reported or commented on bugs that didn’t have a pull request.

As a reminder, the deadline for feature slush is this Friday, 18 August. Feature slush means that new features meant for inclusion in 3.0 should have a pull request on their bugs by end of business on the 18th (although as release manager and as somebody who hopes to be viewing the solar eclipse on the 21st, there is some wiggle room with that deadline). There will not be similar wiggle room for the feature freeze deadline of 1 September.

There are some large pull requests awaiting review, including the web staff serials module and the copy alert and suppression matrix, so please keep testing, y’all!

Duck trivia

Ducks can help streamline work at the circulation desk. The image in today’s post is courtesy of Jeff Klapes of the Lucius Beebe Memorial Library in Wakefield, MA, who says, “We have a monthly duck display at our circulation desk to divert the attention of the little ones while their parents check things out. It’s been wildly successful.”


Updates on the progress to Evergreen 3.0 will be published every Friday until general release of 3.0.0. If you have material to contribute to the updates, please get them to Galen Charlton by Thursday morning.

District Dispatch: Make a nomination for the 2018 National Medal for Museum and Library Service

planet code4lib - Mon, 2017-08-14 15:00

The Institute of Museum and Library Services is now accepting nominations for the 2018 National Medal for Museum and Library Service awards. Anyone — an employee, a board member, a member of the public, or an elected official — can nominate an institution. To be considered, the institution must complete and return a nomination form by October 2, 2017.

In 2017, libraries from Iowa, California, South Carolina, Minnesota and Maine were selected to receive this high honor. In 2018, IMLS is particularly interested in libraries with programs that build community cohesion and serve as catalysts for positive community change, including programs that provide services for veterans and military families, at-risk children and families, the un- and under-employed and youth confronting barriers to STEM-related employment.

The ten winning institutions will be honored at a ceremony D.C. and are invited to host a two-day visit from StoryCorps to record community member stories. You can hear some of htese these moving impact stories, dating back to 2009, here.

Institutions interested in being considered should read the nomination form carefully and contact the designated program contacts with questions. The library-specific program contact for the National Medal for Museum and Library Service is Laura McKenzie, who can be reached at (202) 653-4644 or by email at

As we noted with the annoucement of the National Leadership Grants for Libraries and the Laura Bush 21st Century Librarian programs (deadline September 1!), an increase in nominations for the National Medal would send a signal to our Members of Congress that libraries are vital institutions in communities across the country. So, don’t delay — write your nomination today and celebrate the library workers who make our country a better place to live.

The post Make a nomination for the 2018 National Medal for Museum and Library Service appeared first on District Dispatch.

District Dispatch: Voting now open! Bring ALA to SxSW

planet code4lib - Fri, 2017-08-11 13:22

For a third year, ALA is planning for Austin’s annual South by Southwest (SXSW) festival. As in years past, we need your help to bring our programs to the SXSW stage. Public voting counts for 30 percent of SXSW’s decision to pick a panel, so please join us in voting for these two ALA programs.

YALSA Past President Linda Braun and OITP Fellow Mega Subramaniam have partnered with IMLS and Google to put on a panel called “Ready to Code: Libraries Supporting CS Education.” Here’s the description:

In the last decade, libraries have transformed, from the traditional book provider to become the community anchor where the next generation technology innovations take place. Drawing from initiatives such as the Libraries Ready to Code project and IMLS grants, this session provides perspectives from thought leaders in industry, government, universities, and libraries on the role libraries play in our national CS education ecosystem and work together with communities to support youth success. You can view the video here.

The Office for Diversity, Literacy and Outreach Services and the Office for Intellectual Freedom are partnering to offer a worshop entitled “Free Speech or Hate Speech?” Here is the quick summary:

The Supreme Court agrees with the rock group, The Slants, that their name is protected under the first amendment. An increase in uses of hate speech in the United States has sparked a new fire in the debate: Is hate speech free speech? Is it a hate crime? The lines can be blurry. We will explore the history of intellectual freedom challenges and how to respond to traumatic interactions involving hate speech that are not seen as “crimes.” See the video here.

As you might remember, in 2016, ALA and Benetech collaborated on a session about leveraging 3D printers to create new learning opportunities for students with disabilities. And, in 2015, OITP partnered with D.C. Public Library and MapStory to present an interactive panel about the ways that libraries foster entrepreneurship and creativity.

Become a registered voter in the Panel Picker process by signing up for an account and get your votes in before Friday, August 25. (Also, be sure to keyword search “library” in the Panelpicker – there are over 30 related programs!)

You will have the opportunity to “Vote Up” or “Vote Down” on all session ideas (votes will be kept private) and add comments to each page. We encourage you to use this commenting feature to show support and even engage with the voting community.

The post Voting now open! Bring ALA to SxSW appeared first on District Dispatch.

FOSS4Lib Recent Releases: veraPDF - 1.8.1

planet code4lib - Fri, 2017-08-11 12:48

Last updated August 11, 2017. Created by Peter Murray on August 11, 2017.
Log in to edit this page.

Package: veraPDFRelease Date: Wednesday, August 9, 2017

Harvard Library Innovation Lab: Git physical

planet code4lib - Thu, 2017-08-10 20:40

This is a guest blog post by our summer fellow Miglena Minkova.

Last week at LIL, I had the pleasure of running a pilot of git physical, the first part of a series of workshops aimed at introducing git to artists and designers through creative challenges. In this workshop I focused on covering the basics: three-tree architecture, simple git workflow, and commands (add, commit, push). These lessons were fairly standard but contained a twist: The whole thing was completely analogue!

The participants, a diverse group of fellows and interns, engaged in a simplified version control exercise. Each participant was tasked with designing a postcard about their summer at LIL. Following basic git workflow, they took their designs from the working directory, through the staging index, to the version database, and to the remote repository where they displayed them. In the process they “pushed” five versions of their postcard design, each accompanied by a commit note. Working in this way allowed them to experience the workflow in a familiar setting and learn the basics in an interactive and social environment. By the end of the workshop everyone had ideas on how to implement git in their work and was eager to learn more.

Timelapse gif by Doyung Lee (

Not to mention some top-notch artwork was created.

The workshop was followed by a short debriefing session and Q&A.

Check GitHub for more info.

Alongside this overview, I want to share some of the thinking that went behind the scenes.

Starting with some background. Artists and designers perform version control in their work but in a much different way than developers do with git. They often use error-prone strategies to track document changes such as saving files in multiple places using obscure file naming conventions, working in large master files, or relying on in-built software features. At best these strategies result in inconsistencies, duplication and a large disc storage, and at worst, irreversible mistakes, loss of work, and multiple conflicting documents. Despite experiencing some of the same problems as developers, artists and designers are largely unfamiliar with git (exceptions exist).

The impetus for teaching artists and designers git was my personal experience with it. I had not been formally introduced to the concept of version control or git through my studies, nor my work. I discovered git during the final year of my MLIS degree when I worked with an artist to create a modular open source digital edition of an artist’s book. This project helped me see git as an ubiquitous tool with versatile application across multiple contexts and practices, the common denominator of which is making, editing, and sharing digital documents.

I realized that I was faced with a challenge: How do I get artists and designers excited about learning git?

I used my experience as a design educated digital librarian to create relatable content and tailor delivery to the specific characteristics of the audience: highly visual, creative, and non-technical.

Why create another git workshop? There are, after all, plenty of good quality learning resources out there and I have no intention of reinventing the wheel or competing with existing learning resources. However, I have noticed some gaps that I wanted to address through my workshop.

First of all, I wanted to focus on accessibility and have everyone start on equal ground with no prior knowledge or technical skills required. Even the simplest beginner level tutorials and training materials rely heavily on technology and the CLI (Command Line Interface) as a way of introducing new concepts. Notoriously intimidating for non-technical folk, the CLI seems inevitable given the fact that git is a command line tool. The inherent expectation of using technology to teach git means that people need to learn the architecture, terminology, workflow, commands, and the CLI all at the same time. This seems ambitious and a tad unrealistic for an audience of artists and designers.

I decided to put the technology on hold and combine several pedagogies to leverage learning: active learning, learning through doing, and project-based learning. To contextualize the topic, I embedded elements of the practice of artists and designers by including an open ended creative challenge to serve as a trigger and an end goal. I toyed with different creative challenges using deconstruction, generative design, and surrealist techniques. However this seemed to steer away from the main goal of the workshop. It also made it challenging to narrow down the scope, especially as I realized that no single workflow can embrace the diversity of creative practices. At the end, I chose to focus on versioning a combination of image and text in a single document. This helped to define the learning objectives, and cover only one functionality: the basic git workflow.

I considered it important to introduce concepts gradually in a familiar setting using analogue means to visualize black-box concepts and processes. I wanted to employ abstraction to present the git workflow in a tangible, easily digestible, and memorable way. To achieve this the physical environment and set up was crucial for the delivery of the learning objectives.

In terms of designing the workspace, I assigned and labelled different areas of the space to represent the components of git’s architecture. I made use of directional arrows to illustrate the workflow sequence alongside the commands that needed to be executed and used a “remote” as a way of displaying each version on a timeline. Low-tech or no-tech solution such as carbon paper were used to make multiple copies. It took several experiments to get the sketchpad layering right, especially as I did not want to introduce manual redundancies that do little justice to git.

Thinking over the audience interaction, I had considered role play and collaboration. However these modes did not enable each participant to go through the whole workflow and fell short of addressing the learning objectives. Instead I provided each participant with initial instructions to guide them through the basic git workflow and repeat it over and over again using their own design work. The workshop was followed with debriefing which articulated the specific benefits for artists and designers, outlined use cases depending on the type of work they produce, and featured some existing examples of artwork done using git. This was to emphasize that the workshop did not offer a one-size fits all solution, but rather a tool that artists and designers can experiment with and adopt in many different ways in their work.

I want to thank Becky and Casey for their editing work.

Going forward, I am planning to develop a series of workshops introducing other git functionality such as basic merging and branching, diff-ing, and more, and tag a lab exercise to each of them. By providing multiple ways of processing the same information I am hoping that participants will successfully connect the workshop experience and git practice.

Terry Reese: MarcEdit 7 Z39.50/SRU Client Wireframes

planet code4lib - Thu, 2017-08-10 17:37

One of the appalling discoveries when taking a closer look at the MarcEdit 6 codebase, was the presence of 3(!) Z39.50 clients (all using slightly different codebases.  This happened because of the ILS integration, the direct Z39.50 Database editing, and the actual Z39.50 client.  In the Mac version, these clients are all the same thing – so I wanted to emulate that approach in the Windows/Linux version.  And as a plus, maybe I would stop (or reduce) my utter distain at having support Z39.50 generally, within any library program that I work with. 

* Sidebar – I really, really, really can’t stand working with Z39.50.  SRU is a fine replacement for the protocol, and yet, over the 10-15 years that its been available, SRU remains a fringe protocol.  That tells me two things:

  1. Library vendors generally have rejected this as a protocol and there are some good reason for this…most vendors that support (and I’m thinking specifically about ExLibris), use a custom profile.  This is a pain in the ass because the custom profile requires code to handle foreign namespaces.  This wouldn’t be a problem if this only happened occasionally, but it happens all the time.  Every SRU implementation works best if you use their custom profiles.  I think what made Z39.50 work, is the well-defined set of Bib-1 attributes.  The flexibility in SRU is a good thing, but I also think it’s why very few people support it, and fewer understand how it actually works.
  2. That SRU is a poor solution to begin with.  Hey, just like OAI-PMH, we created library standards to work on the web.  If we had it to do over again, we’d do it differently.  We should probably do it differently at this point…because supporting SRU in software is basically just checking a box.  People have heard about it, they ask for it, but pretty much no one uses it.

By consolidating the Z39.50 client code, I’m able to clean out a lot of old code, and better yet, actually focus on a few improvements (which has been hard because I make improvements in the main client, but forget to port them everywhere else).  The main improvements that I’ll be applying has to do with searching multiple databases.  Single search has always allowed users to select up to 5 databases to query.  I may remove that limit.  It’s kind of an arbitrary one.  However, I’ll also be adding this functionality to the batch search.  When doing multiple database searches in batch, users will have an option to take all records, the first record found, or potentially (I haven’t worked this one out), records based on order of database preference. 


Main Window:

Z39.50 Database Settings:

SRU Settings:

There will be a preferences panel as well (haven’t created it yet), but this is where you will set proxy information and notes related to batch preferences.  You will no longer need to set title field or limits, as the limits are moving to the search screen (this has always needed to be variable) and the title field data is being pulled from preferences already set in the program preferences.

One of the benefits of making the changes is that this folds the z39.50/sru client into the Main MarcEdit application (rather than as a program that was shelled to), which allows me to leverage the same accessibility platform that has been developed for the rest of the application.  It also highlights one of the other changes happening in MarcEdit 7.  MarcEdit 6- is a collection of about 7 or 8 individual executables.  This makes sense in some cases, less sense in others.  I’m evaluating all the stand-alone programs and if I replicate the functionality in the main program, then it means that while initially, having these as separate program might have been a good thing, the current structure of the application has changed, and so the code (both external and internal) code needs to be re-evaluated and put in one spot.  In the application, this has meant that in some cases, like the Z39.50 client, the code will move into MarcEdit proper (rather being a separate program called mebatch.exe) and for SQL interactions, it will mean that I’ll create a single shared library (rather than replicating code between three different component parts….the sql explorer, the ILS integration, and the local database query tooling).

Questions, let me know.


FOSS4Lib Recent Releases: veraPDF - 1.8.1

planet code4lib - Thu, 2017-08-10 12:04

Last updated August 10, 2017. Created by Peter Murray on August 10, 2017.
Log in to edit this page.

Package: veraPDFRelease Date: Wednesday, August 9, 2017

Open Knowledge Foundation: An approach to building open databases

planet code4lib - Thu, 2017-08-10 07:30

This post has been co-authored by Adam Kariv, Vitor Baptista, and Paul Walsh.

Open Knowledge International (OKI) recently coordinated a two-day work sprint as a way to touch base with partners in the Open Data for Tax Justice project. Our initial writeup of the sprint can be found here.

Phase I of the project ended in February 2017 with the publication of What Do They Pay?, a white paper that outlines the need for a public database on the tax contributions and economic activities of multinational companies.

The overarching goal of the sprint was to start some work towards such a database, by replicating data collection processes we’ve used in other projects, and to provide a space for domain expert partners to potentially use this data for some exploratory investigative work. We had limited time, a limited budget, and we are pleased with the discussions and ideas that came out of the sprint.

One attendee, Tim Davies, criticised the approach we took in the technical stream of the sprint. The problem with the criticism is the extrapolation of one stream of activity during a two-day event to posit an entire approach to a project. We think exploration and prototyping should be part of any healthy project, and that is exactly what we did with our technical work in the two-day sprint.

Reflecting on the discussion presents a good opportunity here to look more generally at how we, as an organisation, bring technical capacity to projects such as Open Data for Tax Justice. Of course, we often bring much more than technical capacity to a project, and Open Data for Tax Justice is no different in that regard, being mostly a research project to date.

In particular, we’ll take a look at the technical approach we used for the two-day sprint. While this is not the only approach towards technical projects we employ at OKI, it has proven useful on projects driven by the creation of new databases.

An approach

Almost all projects that OKI either leads on, or participates in, have multiple partners. OKI generally participates in one of three capacities (sometimes, all three):

  • Technical design and implementation of open data platforms and apps.
  • Research and thought leadership on openness and data.
  • Dissemination and facilitating participation, often by bringing the “open data community” to interact with domain specific actors.

Only the first capacity is strictly technical, but each capacity does, more often than not, touch on technical issues around open data.

Some projects have an important component around the creation of new databases targeting a particular domain. Open Data for Tax Justice is one such project, as are OpenTrials, and the Subsidy Stories project, which itself is a part of OpenSpending.

While most projects have partners, usually domain experts, it does not mean that collaboration is consistent or equally distributed over the project life cycle. There are many reasons for this to be the case, such as the strengths and weaknesses of our team and those of our partners, priorities identified in the field, and, of course, project scope and funding.

With this as the backdrop for projects we engage in generally, we’ll focus for the rest of this post on aspects when we bring technical capacity to a project. As a team (the Product Team at OKI), we are currently iterating on an approach in such projects, based on the following concepts:

  • Replication and reuse
  • Data provenance and reproducibility
  • Centralise data, decentralise views
  • Data wrangling before data standards

While not applicable to all projects, we’ve found this approach useful when contributing to projects that involve building a database to, ultimately, unlock the potential to use data towards social change.

Replication and reuse

We highly value the replication of processes and the reuse of tooling across projects. Replication and reuse enables us to reduce technical costs, focus more on the domain at hand, and share knowledge on common patterns across open data projects. In terms of technical capacity, the Product Team is becoming quite effective at this, with a strong body of processes and tooling ready for use.

This also means that each project enables us to iterate on such processes and tooling, integrating new learnings. Many of these learnings come from interactions with partners and users, and others come from working with data.

In the recent Open Data for Tax Justice sprint, we invited various partners to share experiences working in this field and try a prototype we built to extract data from country-by-country reports to a central database. It was developed in about a week, thanks to the reuse of processes and tools from other projects and contexts.

When our partners started looking into this database, they had questions that could only be answered by looking back to the original reports. They needed to check the footnotes and other context around the data, which weren’t available in the database yet. We’ve encountered similar use cases in both and OpenTrials, so we can build upon these experiences to iterate towards a reusable solution for the Open Data for Tax Justice project.

By doing this enough times in different contexts, we’re able to solve common issues quickly, freeing more time to focus on the unique challenges each project brings.

Data provenance and reproducibility

We think that data provenance, and reproducibility of views on data, is absolutely essential to building databases with a long and useful future.

What exactly is data provenance? A useful definition from wikipedia is “… (d)ata provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins”. Depending on the way provenance is implemented in a project, it can also be a powerful tool for reproducibility of the data.

Most work around open data at present does not consider data provenance and reproducibility as an essential aspect of working with open data. We think this is to the detriment of the ecosystem’s broader goals of seeing open data drive social change: the credible use of data from projects with no provenance or reproducibility built in to the creation of databases is significantly diminished in our “post truth” era.

Our current approach builds data provenance and reproducibility right into the heart of building a database. There is a clear, documented record of every action performed on data, from the extraction of source data, through to normalisation processes, and right to the creation of records in a database. The connection between source data and processed data is not lost, and, importantly, the entire data pipeline can be reproduced by others.

We acknowledge that a clear constraint of this approach, in its current form, is that it is necessarily more technical than, say, ad hoc extraction and manipulation with spreadsheets and other consumer tools used in manual data extraction processes. However, as such approaches make data provenance and reproducibility harder, because there is no history of the changes made, or where the data comes from, we are willing to accept this more technical approach and iterate on ways to reduce technical barriers.

We hope to see more actors in the open data ecosystem integrating provenance and reproducibility right into their data work. Without doing so, we greatly reduce the ability for open data to be used in an investigative capacity, and likewise, we diminish the possibility of using the outputs of open data projects in the wider establishment of facts about the world. Recent work on beneficial ownership data takes a step in this direction, leveraging the PROV-DM standard to declare data provenance facts.

Centralise data, decentralise views

In OpenSpending, OpenTrials, and our initial exploratory work on Open Data for Tax Justice, there is an overarching theme to how we have approached data work, user stories and use cases, and co-design with domain experts: “centralise data, decentralise views”.

Building a central database for open data in a given domain affords ways of interacting with such data that are extremely difficult, or impossible, by actively choosing to decentralise such data. Centralised databases make investigative work that uses the data easier, and allows for the discovery, for example, of patterns across entities and time that can be very hard to discover if data is decentralised.

Additionally, by having in place a strong approach to data provenance and reproducibility, the complete replication of a centralised database is relatively easily done, and very much encouraged. This somewhat mitigates a major concern with centralised databases, being that they imply some type of “vendor lock-in”.

Views on data are better when decentralised. By “views on data” we refer to visualisations, apps, websites – any user-facing presentation of data. While having data centralised potentially enables richer views, data almost always needs to be presented with additional context, localised, framed in a particular narrative, or otherwise presented in unique ways that will never be best served from a central point.

Further, decentralised usage of data provides a feedback mechanism for iteration on the central database. For example, providing commonly used contextual data, establishing clear use cases for enrichment and reconciliation of measures and dimensions in the data, and so on.

Data wrangling before data standards

As a team, we are interested in, engage with, and also author, open data standards. However, we are very wary of efforts to establish a data standard before working with large amounts of data that such a standard is supposed to represent.

Data standards that are developed too early are bound to make untested assumptions about the world they seek to formalise (the data itself). There is a dilemma here of describing the world “as it is”, or, “as we would like it to be”. No doubt, a “standards first” approach is valid in some situations. Often, it seems, in the realm of policy. We do not consider such an approach flawed, but rather, one with its own pros and cons.

We prefer to work with data, right from extraction and processing, through to user interaction, before working towards public standards, specifications, or any other type of formalisation of the data for a given domain.

Our process generally follows this pattern:

  • Get to know available data and establish (with domain experts) initial use cases.
  • Attempt to map what we do not know (e.g.: data that is not yet publicly accessible), as this clearly impacts both usage of the data, and formalisation of a standard.
  • Start data work by prescribing the absolute minimum data specification to use the data (i.e.: meet some or all of the identified use cases).
  • Implement data infrastructure that makes it simple to ingest large amounts of data, and also to keep the data specification reactive to change.
  • Integrate data from a wide variety of sources, and, with partners and users, work on ways to improve participation / contribution of data.
  • Repeat the above steps towards a fairly stable specification for the data.
  • Consider extracting this specification into a data standard.

Throughout this entire process, there is a constant feedback loop with domain expert partners, as well as a range of users interested in the data.


We want to be very clear that we do not think that the above approach is the only way to work towards a database in a data-driven project.

Design (project design, technical design, interactive design, and so on) emerges from context. Design is also a sequence of choices, and each choice has an opportunity cost based on various constraints that are present in any activity.

In projects we engage in around open databases, technology is a means to other, social ends. Collaboration around data is generally facilitated by technology, but we do not think the technological basis for this collaboration should be limited to existing consumer-facing tools, especially if such tools have hidden costs on the path to other important goals, like data provenance and reproducibility. Better tools and processes for collaboration will only emerge over time if we allow exploration and experimentation.

We think it is important to understand general approaches to working with open data, and how they may manifest within a single project, or across a range of projects. Project work is not static, and definitely not reducible to snapshots of activity within a wider project life cycle.

Certain approaches emphasise different ends. We’ve tried above to highlight some pros and cons of our approach, especially around data provenance and reproducibility, and data standards.

In closing, we’d like to invite others interested in approaches to building open databases to engage in a broader discussion around these themes, as well as a discussion around short term and long term goals of such projects. From our perspective, we think there could be a great deal of value for the ecosystem around open data generally – CSOs, NGOs, governments, domain experts, funders – via a proactive discussion or series of posts with a multitude of voices. Join the discussion here if this is of interest to you.

Library Tech Talk (U of Michigan): Problems with Authority

planet code4lib - Thu, 2017-08-10 00:00

MARC Authority records can be used to create a map of the Federal Government that will help with collection development and analysis. Unfortunately, MARC is not designed for this purpose, so we have to find ways to work around the MARC format's limitations.

District Dispatch: The Copyright Office belongs in the Library of Congress

planet code4lib - Wed, 2017-08-09 23:10

In “Lessons From History: The Copyright Office Belongs in the Library of Congress,” a new report from the American Library Association (ALA), Google Policy Fellow Alisa Holahan compellingly documents that Congress repeatedly has considered the best locus for the U.S. Copyright Office (CO) and consistently reaffirmed that the Library of Congress (Library) is its most effective and efficient home.

The U.S. Copyright Office is located in the James Madison Memorial Building of the Library of Congress in Washington, D.C. Photo credit: The Architect of the Capitol

Prompted by persistent legislative and other proposals to remove the CO from the Library in both the current and most recent Congresses, Holahan’s analysis comprehensively reviews the history of the locus of copyright activities from 1870 to the present day. In addition to providing a longer historical perspective, the Report finds that Congress has examined this issue at roughly 20-year intervals, declining to separate the CO and Library each time.

Notable developments occurred, for example, in the deliberations leading to the Copyright Act of 1976. In particular, there was argument made that the CO performs executive branch functions, and thus its placement in the legislative branch is unconstitutional. The 1976 Act left the U.S. Copyright Office in the Library. Moreover, in 1978, the U.S. Court of Appeals for the Fourth Circuit in Eltra Corp. v. Ringer directly addressed this constitutionality question. It found no constitutional problem with the CO’s and Library’s co-location because the Copyright Office operates under the direction of the Librarian of Congress, an appointee of the president.

Holahan also notes another challenge via the Omnibus Patent Act of 1996, which proposed that copyright, patent and trademark activities be consolidated under a single government corporation. This Act was opposed by then-Register of Copyrights Marybeth Peters and then-Librarian of Congress James Billington, as well as an array of stakeholders that included the American Society of Composers, Authors and Publishers (ASCAP); American Society of Journalists and Authors; as well as the library, book publishing and scholarly communities. This legislation was not enacted, thereby leaving the placement of the Copyright Office unchanged.

The neutral question that launched this research was to identify anything of relevance in the historical record regarding the placement of the Copyright Office. ALA recommends Holahan’s research (refer to her full report for additional historical milestones and further details) to anyone contemplating whether the Register of Copyrights should be appointed by the President or whether the Copyright Office should be relocated from the Library.

In a nutshell, these questions have been asked and answered the same way many times already: “it ain’t broke, so don’t fix it.” Holahan’s research and report will inform ALA’s continuing lobbying and policy advocacy on these questions as we work to protect and enhance copyright’s role in promoting the creation and dissemination of knowledge for all.

The post The Copyright Office belongs in the Library of Congress appeared first on District Dispatch.

LITA: Jobs in Information Technology: August 9, 2017

planet code4lib - Wed, 2017-08-09 19:36

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Oregon State University Libraries and Press, Library Technician 3, Corvallis, OR

New York University Division of Libraries, Supervisor, Metadata Production & Management, New York, NY

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Lucidworks: Customizing Ranking Models in Solr to Improve Relevance for Enterprise Search

planet code4lib - Wed, 2017-08-09 17:38

As we countdown to the annual Lucene/Solr Revolution conference in Las Vegas next month, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Salesforce’s Ammar Haris & Joe Zeimen’s talk, “Customizing Ranking Models in Solr to Improve Relevance for Enterprise Search”.

Solr provides a suite of built-in capabilities that offer a wide variety of relevance related parameter tuning. Index and/or query time boosts along with function queries can provide a great way to tweak various relevance related parameters to help improve the search results ranking. In the enterprise space however, given the diversity of customers and documents, there is a much greater need to be able to have more control over the ranking models and be able to run multiple custom ranking models.

This talk discusses the motivation behind creating an L2 ranker and the use of Solr Search Component for running different types of ranking models at Salesforce.

Join us at Lucene/Solr Revolution 2017, the biggest open source conference dedicated to Apache Lucene/Solr on September 12-15, 2017 in Las Vegas, Nevada. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Customizing Ranking Models in Solr to Improve Relevance for Enterprise Search appeared first on Lucidworks.

HangingTogether: The Transformation of Academic Library Collecting

planet code4lib - Wed, 2017-08-09 15:12

The Transformation of Academic Library Collecting

In October 2016, I was privileged to attend a seminal event, The Transformation of Academic Library Collecting: A Symposium Inspired by Dan C. Hazen, along with colleagues Lorcan Dempsey and Constance Malpas who were speaking. This occasion brought together a group of eminent library leaders, research collections specialists and scholars at Norton’s Woods Conference Center in Cambridge, MA, to commemorate the career of Dan Hazen (1947–2015) and reflect upon the transformation of academic library collections. Hazen was a towering figure in the world of research collections management and was personally known to many attendees; his impact on the profession of academic librarianship and the shape of research collections is widely recognized and continues to shape practice and policy in major research libraries.

Sarah Thomas (Vice President for the Harvard Library and University Librarian & Roy E. Larsen Librarian for the Faculty of Arts and Sciences) and other colleagues had done a remarkable job not only selecting speakers but designing an event that allowed for discussion and reflection. We felt that the event needed to be documented in some way, and were pleased that Sarah endorsed this idea. The resulting publication, The Transformation of Academic Library Collecting: A Synthesis of the Harvard Library’s Hazen Memorial Symposium, is now freely available from our website.

Drawing from presentations and audience discussions at the symposium, this publication examines of some central themes important to a broader conversation about the future of academic library collections, in particular, collective collections and the reimagination of what have traditionally been called “special” and archival collections (now referred to as unique and distinctive collections). The publication also includes a foreword about Dan Hazen and his work by Sarah Thomas.

The Transformation of Academic Library Collecting: A Synthesis of the Harvard Library’s Hazen Memorial Symposium is not only a tribute to Hazen’s impact on the academic library community, but also a primer on where academic library collections could be headed in the future. We hope you will read, share, and use this as a basis for continuing an important conversation.

FOSS4Lib Upcoming Events: VIVO Camp 2017, Duke Univ

planet code4lib - Wed, 2017-08-09 15:02
Date: Thursday, November 9, 2017 - 08:30 to Saturday, November 11, 2017 - 12:00Supports: Vivo

Last updated August 9, 2017. Created by Peter Murray on August 9, 2017.
Log in to edit this page.

VIVO Camp registration information


Subscribe to code4lib aggregator