You are here

Feed aggregator

Terry Reese: MarcEdit 6.3 Updates (all versions)

planet code4lib - Sat, 2017-08-19 16:19

I spent sometime this week working on a few updates for MarcEdit 6.3.  Full change log below (for all versions).


* Bug Fix: MarcEditor: When processing data with right to left characters, the embedded markers were getting flagged by the validator.
* Bug Fix: MarcEditor: When processing data with right to left characters, I’ve heard that there have been some occasions when the markers are making it into the binary files (they shouldn’t).  I can’t recreate it, but I’ve strengthen the filters to make sure that these markers are removed when the mnemonic file format is saved.
* Bug Fix: Linked data tool:  When creating VIAF entries in the $0, the subfield code can be dropped.  This was missed because viaf should no longer be added to the $0, so I assumed this was no longer a valid use case.  However local practice in some places is overriding best practice.  This has been fixed.

A note on the MarcEditor changes.  The processing of right to left characters is something I was aware of in regards to the validator – but in all my testing and unit tests, the data was always filtered prior to compiling the data.  These markers that are inserted are for display, as noted here:  However, on the pymarc list, there was apparently an instance where these markers slipped through.  The conversation can be found here:!topic/pymarc/5zxuOh0fVuc.  I posted a long response on the list, but I think i t’s being held in moderation (I’m a new member to the list), but generally, here’s what I found.  I can’t recreate it, but I have updated the code to ensure that this shouldn’t happen.  Once a mnemonic file is saved (and that happens prior to compiling), these markers are removed from the file.  I guess if you find this isn’t the case, let me know.  I can add the filter down into the MARCEngine level, but I’d rather not, as there are cases where these values may be present (legally)…this is why the filtering happens in the Editor, where it can assess their use and if the markers are present already, determine if they are used correctly.

Downloads can be picked up through the automated update tool, or via


Karen G. Schneider: Neutrality is anything but

planet code4lib - Sat, 2017-08-19 15:22

“We watch people dragged away and sucker-punched at rallies as they clumsily try to be an early-warning system for what they fear lies ahead.” — Unwittingly prophetic me, March, 2016.

Sheet cake photo by Flickr user Glane23. CC by 2.0

Sometime after last November, I realized something very strange was happening with my clothes. My slacks had suddenly shrunk, even if I hadn’t washed them. After months of struggling to keep myself buttoned into my clothes, I gave up and purchased slacks and jeans one size larger. I call them my T***p Pants.

This post is about two things. It is about the lessons librarians are learning in this frightening era about the nuances and qualifications shadowing our deepest core values–an era so scary that quite a few of us, as Tina Fey observed, have acquired T***p Pants. And it’s also some advice, take it or leave it, on how to “be” in this era.

I suspect many librarians have had the same thoughts I have been sharing with a close circle of colleagues. Most librarians take pride in our commitment to free speech. We see ourselves as open to all viewpoints. But in today’s new normal, we have seen that even we have limits.

This week, the ACRL Board of Directors put out a statement condemning the violence in Charlottesville. That was the easy part. The Board then stated, “ACRL is unwavering in its long-standing commitment to free exchange of different viewpoints, but what happened in Charlottesville was not that; instead, it was terrorism masquerading as free expression.”

You can look at what happened in Charlottesville and say there was violence “from many sides,” some of it committed by “very fine people” who just happen to be Nazis surrounded by their own private militia of heavily-armed white nationalists. Or you can look at Charlottesville and see terrorism masquerading as free expression, where triumphant hordes descended upon a small university town under the guise of protecting some lame-ass statue of an American traitor, erected sixty years after the end of the Civil War, not coincidentally during a very busy era for the Klan. Decent people know the real reason the Nazis were in Charlottesville: to tell us they are empowered and emboldened by our highest elected leader.

There is no middle ground. You can’t look at Charlottesville and see everyday people innocently exercising First Amendment rights.

As I and many others have argued for some time now, libraries are not neutral.  Barbara Fister argues, “we stand for both intellectual freedom and against bigotry and hate, which means some freedoms are not countenanced.” She goes on to observe, “we don’t have all the answers, but some answers are wrong.”

It goes to say that if some answers are wrong, so are some actions. In these extraordinary times, I found myself for the first time ever thinking the ACLU had gone too far; that there is a difference between an unpopular stand, and a stand that is morally unjustifiable. So I was relieved when the national ACLU concurred with its three Northern California chapters that “if white supremacists march into our towns armed to the teeth and with the intent to harm people, they are not engaging in activity protected by the United States Constitution. The First Amendment should never be used as a shield or sword to justify violence.”

But I was also sad, because once again, our innocence has been punctured and our values qualified. Every asterisk we put after “free speech” is painful. It may be necessary and important pain, but it is painful all the same. Many librarians are big-hearted people who like to think that our doors are open to everyone and that all viewpoints are welcome, and that enough good ideas, applied frequently, will change people. And that is actually very true, in many cases, and if I didn’t think it was true I would conclude I was in the wrong profession.

But we can’t change people who don’t want to be changed. Listen to this edition of The Daily, a podcast from the New York Times, where American fascists plan their activities. These are not people who are open to reason. As David Lankes wrote, “there are times when a community must face the fact that parts of that community are simply antithetical to the ultimate mission of a library.”

We urgently need to be as one voice as a profession around these issues. I was around for–was part of–the “filtering wars” of the 1990s, when libraries grappled with the implications of the Internet bringing all kinds of content into libraries, which also challenged our core values. When you’re hand-selecting the materials you share with your users, you can pretend you’re open to all points of view. The Internet challenged that pretense, and we struggled and fought, and were sometimes divided by opportunistic outsiders.

We are fortunate to have strong ALA leadership this year. The ALA Board and President came up swinging on Tuesday with an excellent presser that stated unequivocally that “the vile and racist actions and messages of the white supremacist and neo-Nazi groups in Charlottesville are in stark opposition to the ALA’s core values,” a statement that (in the tradition of ensuring chapters speak first) followed a strong statement from our Virginia state association.  ARL also chimed in with a stemwinder of a statement.  I’m sure we’ll see more.

But ALA’s statement also describes the mammoth horns of the library dilemma. As I wrote colleagues, “My problem is I want to say I believe in free speech and yet every cell in my body resists the idea that we publicly support white supremacy by giving it space in our meeting rooms.” If you are in a library institution that has very little likelihood of exposure to this or similar crises, the answers can seem easy, and our work appears done. But for more vulnerable libraries, it is crucial that we are ready to speak with one voice, and that we be there for those libraries when they need us. How we get there is the big question.

I opened this post with an anecdote about my T***p pants, and I’ll wrap it up with a concern. It is so easy on social media to leap in to condemn, criticize, and pick apart ideas. Take this white guy, in an Internet rag, the week after the election, chastising people for not doing enough.  You know what’s not enough? Sitting on Twitter bitching about other people not doing enough. This week, Siva Vaidhyanathan posted a spirited defense of a Tina Fey skit where she addressed the stress and anxiety of these political times.  Siva is in the center of the storm, which gives him the authority to state an opinion about a sketch about Charlottesville. I thought Fey’s skit was insightful on many fronts. It addressed the humming anxiety women have felt since last November (if not earlier). It was–repeatedly–slyly critical of inaction: “love is love, Colin.” It even had a Ru Paul joke. A lot of people thought it was funny, but then the usual critics came out to call it naive, racist, un-funny, un-woke, advocating passivity, whatever.

We are in volatile times, and there are provocateurs from outside, but also from inside. Think. Breathe. Step away from the keyboard. Take a walk. Get to know the mute button in Twitter and the unfollow feature in Facebook. Pull yourself together and think about what you’re reading, and what you’re planning to say. Interrogate your thinking, your motives, your reactions.

I’ve read posts by librarians deriding their peers for creating subject guides on Charlottesville, saying instead we should be punching Nazis. Get a grip. First off, in real life, that scenario is unlikely to transpire. You, buried in that back cubicle in that library department, behind three layers of doors, are not encountering a Nazi any time soon, and if you did, I recommend fleeing, because that wackdoodle is likely accompanied by a trigger-happy militiaman carrying a loaded gun. (There is an entire discussion to be had about whether violence to violence is the politically astute response, but that’s for another day.) Second, most librarians understand that their everyday responses to what is going on in the world are not in and of themselves going to defeat the rise of fascism in America. But we are information specialists and it’s totally wonderful and cool to respond to our modern crisis with information, and we need to be supportive and not go immediately into how we are all failing the world. Give people a positive framework for more action, not scoldings for not doing enough.

In any volatile situation, we need to slow the eff down and ask how we’re being manipulated and to what end; that is a lesson the ACLU just learned the hard way. My colleague Michael Stephens is known for saying, “speak with a human voice.” I love his advice, and I would add, make it the best human voice you have. We need one another, more than we know.


Bookmark to:

Eric Lease Morgan: Freebo@ND and library catalogues

planet code4lib - Fri, 2017-08-18 19:06

Freebo@ND is a collection of early English book-like materials as well as a set of services provided against them. In order to use & understand items in the collection, some sort of finding tool — such as a catalogue — is all but required. Freebo@ND supports the more modern full text index which has become the current best practice finding tool, but Freebo@ND also offers a set of much more traditional library tools. This blog posting describes how & why the use of these more traditional tools can be beneficial to the reader/researcher. In the end, we will learn that “What is old is new again.”

An abbreviated history

A long time ago, in a galaxy far far away, library catalogues were simply accession lists. As new items were brought into the collection, new entries were appended to the list. Each item would then be given an identifier, and the item would be put into storage. It could then be very easily located. Search/browse the list, identify item(s) of interest, note identifier(s), retrieve item(s), and done.

As collections grew, the simple accession list proved to be not scalable because it was increasingly difficult to browse the growing list. Thus indexes were periodically created. These indexes were essentially lists of authors, titles, or topics/subjects, and each item on the list was associated with a title and/or a location code. The use of the index was then very similar to the use of the accession list. Search/browse the index, identify item(s) of interest, note location code(s), retrieve item(s), and done. While these indexes were extremely functional, they were difficult to maintain. As new items became a part of the collection it was impossible to insert them into the middle of the printed index(es). Consequently, the printed indexes were rarely up-to-date.

To overcome the limitations of the printed index(es), someone decided not to manifest them as books, but rather as cabinets (drawers) of cards — the venerable card catalogue. Using this technology, it was trivial to add new items to the index. Type up cards describing items, and insert cards into the appropriate drawer(s). Readers could then search/browse the drawers, identify item(s) of interest, note location code(s), retrieve item(s), and done.

It should be noted that these cards were formally… formatted. Specifically, they included “cross-references” enabling the reader to literally “hyperlink” around the card catalogue to find & identify additional items of interest. On the downside, these cross-references (and therefore the hyperlinks) where limited by design to three to five in number. If more than three to five cross-references were included, then the massive numbers of cards generated would quickly out pace the space allotted to the cabinets. After all, these cabinets came dominate (and stereotype) libraries and librarianship. They occupied hundreds, if not thousands, of square feet, and whole departments of people (cataloguers) were employed to keep them maintained.

With the advent of computers, the catalogue cards became digitally manifested. Initially the digital manifestations were used to transmit bibliographic data from the Library of Congress to libraries who would print cards from the data. Eventually, the digital manifestations where used to create digital indexes, which eventually became the online catalogues of today. Thus, the discovery process continues. Search/browse the online catalogue. Identify items of interest. Note location code(s). Retrieve item(s). Done. But, for the most part, these catalogues do not meet reader expectations because the content of the indexes is merely bibliographic metadata (authors, titles, subjects, etc.) when advances in full text indexing have proven to be more effective. Alas, libraries simply do not have the full text of the books in their collections, and consequently libraries are not able to provide full text indexing services. †

What is old is new again

The catalogues representing the content of Freebo@ND are perfect examples of the history of catalogues as outlined above.

For all intents & purposes, Freebo@ND is YAVotSTC (“Yet Another Version of the Short-Title Catalogue”). In 1926 Pollard & Redgrave compiled an index of early English books entitled A Short-title Catalogue of books printed in England, Scotland, & Ireland and of English books printed abroad, 1475-1640. This catalogue became know as the “English short-title catalogue” or ESTC. [1] The catalog’s purpose is succinctly stated on page xi:

The aim of this catalogue is to give abridged entries of all ‘English’ books, printed before the close of the year 1640, copies of which exist at the British Museum, the Bodleian, the Cambridge University Library, and the Henry E. Huntington Library, California, supplemented by additions from nearly one hundres and fifty other collections.

The 600-page book is essentially an author index beginning with likes of George Abbot and ending with Ulrich Zwingli. Shakespeare begins on page 517, goes on for four pages, and includes STC (accession) numbers 22273 through 22366. And the catalogue functions very much like the catalogues of old. Articulate an author of interest. Look up the author in the index. Browse the listings found there. Note the abbreviation of libraries holding an item of interest. Visit library, and ultimately, look at the book.

The STC has a history and relatives, some of which is documented in a book entitled The English Short-Title Catalogue: Past, present, future and dating from 1998. [2] I was interested in two of the newer relatives of the Catalogue:

  1. English short title catalogue on CD-ROM 1473-1800 – This is an IBM DOS-based package supposably enabling the researcher/scholar to search & browse the Catalogue’s bibliographic data, but I was unable to give the package a test drive since I did not have ready access to DOS-based computer. [3] From the bibliographic description’s notes: “This catalogue on CD-ROM contains more than 25,000 of the total 36,000 records of titles in English in the British Library for the period 1473-1640. It also includes 105,000 records for the period 1641-1700, together with the most recent version of the ESTC file, approximately 312,000 records.”
  2. English Short Title Catalogue [as a website] – After collecting & indexing the “digital manifestations” describing items in the Catalogue, a Web-accessible version of the catalogue is available from the British Library. [4] From the about page: “The ‘English Short Title Catalogue’ (ESTC) began as the ‘Eighteenth Century Short Title Catalogue’, with a conference jointly sponsored by the American Society for Eighteenth-Century Studies and the British Library, held in London in June 1976. The aim of the original project was to create a machine-readable union catalogue of books, pamphlets and other ephemeral material printed in English-speaking countries from 1701 to 1800.” [5]

As outlined above, Freebo@ND is a collection of early English book-like materials as well as a set of services provided against them. The source data originates from the Text Creation Partnership, and it is manifested as a set of TEI/XML files with full/rich metadata as well as the mark up of every single word in every single document. To date, there are only 15,000 items in Freebo@ND, but when the project is complete, Freebo@ND ought to contain close to 60,000 items dating from 1460 to 1699. Given this data, Freebo@ND sports an online, full text index of the works collected to date. This online interface is both field searchable, free text searchable, and provides a facet browse interace. [6]

But wait! There’s more!! (And this is the point.) Because the complete bibliographic data is available from the original data, it has been possible to create printed catalogs/indexes akin to the catalogs/indexes of old. These catalogs/indexes are available for downloading, and they include:

  • main catalog – a list of everything ordered by “accession” number; use this file in conjunction with your software’s find function to search & browse the collection [7]
  • author index – a list of all the authors in the collection & pointers to their locations in the repository; use this to learn who wrote what & how often [8]
  • title index – a list of all the works in the collection ordered by title; this file is good for “known item searches” [9]
  • date index – like the author index, this file lists all the years of item publication and pointers to where those items can be found; use this to see what was published when [10]
  • subject index – a list of all the Library Of Congress subject headings used in the collection, and their associated items; use this file to learn the “aboutness” of the collection as a whole [11]

These catalogs/indexes are very useful. It is really easy to load these them into your favorite text editor and to peruse them for items of interest. They are even more useful if they are printed! Using these catalogues/indexes it is very quick & easy to see how prolific any author was, how many items were published a given year, and what the published items were about. The library profession’s current tools do not really support such functions. Moreover, and unlike the cool (“kewl”) online interfaces alluded to above, these printed catalogs are easily updated, duplicated, shareable, and if bound can stand the test of time. Let’s see any online catalog last more than a decade and be so inexpensively produced.

“What is old is new again.”


† Actually, even if libraries where to have the full text of their collection readily available, the venerable library catalogues would probably not be able to use the extra content. This is because the digital manifestations of the bibliographic data can not be more than 100,000 characters long, and the existing online systems are not designed for full text indexing. To say the least, the inclusion of full text indexing in library catalogues would be revolutionary in scope, and it would also be the beginning of the end of traditional library cataloguing as we know it.

[1] Short-title catalogue or ESTC –
[2] Past, present, future –
[3] STC on CD-ROM –
[4] ESTC as website –
[5] ESTC about page –
[6] Freebo@ND search interface –
[7] main catalog –
[8] author index –
[9] title index –
[10] date index –
[11] subject index –

District Dispatch: Victory near in 20-year fight to provide public with CRS reports

planet code4lib - Fri, 2017-08-18 14:00

After nearly 20 years of advocacy by ALA, Congress has recently taken significant steps toward permanently assuring free public access to reports by the Congressional Research Service (CRS). Taxpayers fund these reports but generally have not been able to read them. ALA welcomes these moves to ensure the public can use these valuable aids to understanding public policy issues.

What are CRS Reports?
CRS is an agency, housed within the Library of Congress, that prepares public policy research for members of Congress. All members of Congress and their staffs have immediate access to these reports on topics ranging from avocado growing to zinc mining.

Political insiders know that these reports, produced by the nonpartisan expert staff at CRS, are excellent sources of information about nearly every conceivable public policy topic. But CRS reports have not been routinely published, and so they have only been accessible to those with a connection on Capitol Hill or through an unofficial third-party source.

ALA’s Calls for Public Access
ALA has long called for public access to CRS reports. ALA’s Council adopted a resolution on the topic in 1998, shortly before Sen. John McCain (R-Ariz.) and then-Rep. Chris Shays (R-Conn.) introduced the first legislation to post CRS reports online for public access. We have continued to advocate on the issue over the years, most recently by supporting the latest iteration of that legislation, the Equal Access to Congressional Research Service Reports Act.

What’s New
Both House and Senate appropriators have recently approved language to provide public access to CRS reports. Because appropriations are needed to fund the government, these are considered must-pass bills.

In the Senate, S. 1648 includes the language of the Equal Access to CRS Reports Act. In the House, similar provisions were included in H. Rept. 115-199: the report accompanying H.R. 3162 (which in turn was compiled into H.R. 3219).

What’s Next
Four key steps remain before we and our allies can declare victory in our nearly 20-year effort to provide public access to CRS reports:

  1. The House and Senate have to reconcile the (relatively minor) differences between their language on this issue.
  2. The provision has to survive any attempts to weaken or remove the language on the floor of the House or Senate when a reconciled bill or Report is considered;
  3. Both houses of Congress have to pass an identical bill; and
  4. The President has to sign it.

These are significant “ifs.” But, because these appropriations bills are necessary to keep the government open, there’s a real chance it will get done. Until then, ALA will continue to speak up for the public’s right to access this useful information.

The post Victory near in 20-year fight to provide public with CRS reports appeared first on District Dispatch.

FOSS4Lib Updated Packages: Spark OAI Harvester

planet code4lib - Fri, 2017-08-18 13:11

Last updated August 18, 2017. Created by Peter Murray on August 18, 2017.
Log in to edit this page.

The DPLA is launching an open-source tool for fast, large-scale data harvests from OAI repositories. The tool uses a Spark distributed processing engine to speed up and scale up the harvesting operation, and to perform complex analysis of the harvested data. It is helping us improve our internal workflows and provide better service to our hubs.  The Spark OAI Harvester is freely available and we hope that others working with interoperable cultural heritage or science data will find uses for it in their own projects.

Package Type: Metadata ManipulationLicense: MIT License Package Links Development Status: In DevelopmentOperating System: Linux

Open Knowledge Foundation: Open Data Conference in Switzerland

planet code4lib - Fri, 2017-08-18 09:56

This year’s, the Open Data Conference in Switzerland, was all about Open Smart Cities, Open Tourism & Transport Data, Open Science & Open Food Data. We learnt how Open Data can be a catalyst of digital transformation and a crucial factor for advancing data quality. We got insights into the role of open data in the daily work of journalists and learnt how open data portals make an important contribution to enable Switzerland to remain a leader and innovator in the digital world.

Over 200 people attended the conference: its’ program was composed of 8 keynotes and parallel afternoon tracks with a total of 18 workshops. A highlight of the conference was the visit of Pavel Richter, CEO of Open Knowledge International. Pavel emphasized the purpose of Open Knowledge International lying in empowering civil society organisation to use open data to improve people’s life, for instance by collaborating with human rights institutions. Recent key arguments for open data being “I can take it and put it somewhere else, in a safer place […] it works as a concept, because the data is not lost, it can be secured and re-used”. The entire Q&A with Pavel Richter and Barnaby Skinner in English is available here:

Another highlight was the closing keynote which was held by the president of the École Polytechnique Fédérale de Lausanne who spoke about “The role of “open” in digital Switzerland” and emphasized that public access to scientific data should be the norm so that the rest of the world can also profit from it. His entire talk is available in English here:

Furthermore we curated the following material for you:

Ed Summers: Delete Forensics

planet code4lib - Fri, 2017-08-18 04:00

TL;DR Deleted tweets in a #unitetheright dataset seem to largely be the result of Twitter proactively suspending accounts. Surprisingly, a number of previously identified deletes now appear to be available, which suggests users are temporarily taking their accounts private. Image and video URLs from protected, suspended and deleted accounts/tweets appear to still be available. The same appears to be true of Facebook.

Note: Data Artist Erin Gallagher provided lots of feedback and ideas for what follows in this post. Follow her on Medium to see more of her work, and details about this dataset shortly.

In my last post I jotted down some notes about how to identify deleted Twitter data using the technique of hydration. But, as I said near the end, calling these tweets deletes obscures what actually happened to the tweet. A delete implies that a user has decided to delete their tweet. Certainly this can happen, but the reality is a bit more complicated. Here are the scenarios I can think of (please get in touch if you can think of more):

  1. The user could have decided to protect their account, or take it private. This will result in all their tweets becoming unavailable except to those users who are an approved followers of the account.
  2. The user could have decided to delete their account, which has the effect of deleting all of their tweets.
  3. The user account could have been suspended by Twitter because it was identified as a source of spam or abuse of some kind.
  4. If the tweet is not itself a retweet the user could have simply decided to delete the individual tweet.
  5. If the tweet is a retweet then 1,2,3 or 4 may have have happened to the original tweet.
  6. If the tweet is a retweet and none of 1-4 hold then the user deleted their retweet. The original tweet still exists, but it is no longer marked as retweeted by the given user.

I know, this is like an IRS form from hell right? So how could we check these things programmatically? Let’s take a look at them one by one.

  1. If an account has been protected you can go to the user’s Twitter profile on the web and look for the text “This account’s Tweets are protected.” in the HTML.
  2. If the account has been completely deleted you can go to the user’s Twitter profile on the web and you will get a HTTP 404 Not Found error.
  3. If the account has been suspended, attempting to fetch the user’s Twitter profile on the web will result in a HTTP 302 Found response that redirects to
  4. If the tweet is not a retweet and fetching the tweet on the web results in a HTTP 404 Not Found then the individual tweet has been deleted.
  5. If the tweet is a retweet and one of 1, 2, 3 or 4 happened to the original tweet then that’s why it is no longer available.
  6. If the tweet is a retweet and the original tweet is still available on the web then the user has decided to delete their retweet, or unretweet (I really hope that doesn’t become a word).

With this byzantine logic in hand it’s possible to write a program to do automated lookups on the live web, with some caching to prevent looking up the same information more than once. It is a bit slow because I added a sleep to not go at too hard. The script also identifies itself with a link to the program on GitHub in the User-Agent string. I added this program to the utility scripts in the twarc repository.

So I ran on the #unitetheright deletes I identified previously and here’s what it found:


I think it’s interesting to see that, at least with this dataset, the majority of the deletes were a result of Twitter proactively suspending users because of a tweet that had been retweeted a lot. Perhaps this is the result Twitter monitoring other users flagging the user’s tweets as abusive or harmful, or blocking the user entirely. I think it speaks well of Twitter’s attempts to try to make their platform a more healthy environment. But of course we don’t know how many tweets ought to have been suspended, so we’re only seeing part of the story–the content that Twitter actually made efforts to address. But they appear to be trying, which is good to see.

Another stat that struck me as odd was the number of tweets that were actually available on the web (TWEET_OK). These are tweets that appeared to be unavailable three days ago when I hydrated my dataset. So in the past three days 980 tweets that appeared to be unavailable have reappeared. Since there’s no trash can on Twitter (you can’t undelete a tweet) that means that the creators of these tweets must have protected their account, and then flipped it back to public. I guess it’s also possible that Twitter suspended them, and then reversed their action. I’ve heard from other people who will protect their account when a tweet goes viral to protect themselves from abuse and unwanted attention, and then turn it back to public again when the storm passes. I think this could be evidence of that happening.

One unexpected thing that I noticed in the process of digging through the results is that even after an account has been suspended it appears that media URLs associated with their tweets still resolve. For example the polNewsForever account was suspended but their profile image still resolves. In fact videos and images that polNewsForever have posted also still seem to resolve. The same is true of actual deletes. I’m not going to reference the images and videos here because they were suspended for a reason. So you will need to take my word for it…or run an experiment yourself…

FWIW, a quick test on Facebook shows that it works the same way. I created a public post with an image, copied the URL for the image, deleted the post, and the image URL still worked. Maybe the content expires in their CDNs at some point? It would be weird if it just lived their forever like a dead neuron. I guess this highlights why it’s important to limit the distribution of the JSON data that contain these URLs.

Since the avatar URLs are still available it’s possible to go through the suspended accounts and look at their avatar images. Here’s what I found:

Notice the pattern? They aren’t eggheads, but pretty close. Another interesting thing to note is that 52% of the suspended accounts were created August 11, 2017 or after (the date of the march). So a significant amount of the suspensions look like Twitter trying to curb traffic created by bots.

Open Knowledge Foundation: Open Data Handbook now available in the Nepali Language

planet code4lib - Thu, 2017-08-17 11:18

On 7 August 2017 Open Knowledge Nepal launched the first version of Nepali Open Data Handbook – An introductory guidebook used by governments and civil society organizations around the world as an introduction and blueprint for open data projects. The book was launched by Mr. Krishna Hari Baskota, Chief Information Commissioner of National Information Commission, Dr. Nama Raj Budhathoki, Executive Director of Kathmandu Living Labs and Mr. Nikesh Balami, CEO of Open Knowledge Nepal at the event organized at Moods Lounge, Bluebird Mall, Kathmandu. Around 50 people working in the different sectors of open data attended the launch program.

The Open Data Handbook has been translated into more than 18 languages including Chinese, French, German, Indonesian, Italian, Japanese, Portuguese, Russian, Spanish. Now the Nepali language is also available at At the event a hard copy version of the Open Data Handbook was launched, which included the content from Global Open Data Handbook, Licensing terms from the Open Definition, some notable featured Open Data projects of Nepal and the process of asking information of the Nepal government using the Right To Information Act.

Open Knowledge Nepal believes the Nepali version of the Open Data Handbook will work as a perfect resource for government and civil society organizations (CSOs) to expand their understandings of open data and, ultimately, reap its benefits. Speaking at the event,

Mr. Nikesh Balami, CEO of Open Knowledge Nepal said “I believe that this Nepali version of the Open Data Handbook will help government policymakers, leaders, and citizens understand open data in their native language. It will also be a useful resource for CSOs to use for their own open data awareness programs, as well as data journalists who rely on data openness to report on local stories.” He thanked the volunteer who contributed on the translation, feedback, and review of the Handbook.

Mr. Krishna Hari Baskota, Chief Information Commissioner stressed the need for people in government to understand the value of open data. He also remarked that while the Nepal government is already a treasure trove of data, there is a need for more data to be created and made open. He highlighted the journey traveled by the Nepal Government in the path of open data and motivated youths to join the momentum.

Dr. Nama Raj Budhathoki, Executive Director of Kathmandu Living Labs said, “There should be an equal balance between supply and demand side of data and it’s a perfect time for Nepal to shift from Creation to Use”. Dr. Budhathoki shared his experiences of creating open data with OpenStreetMap and household surveys, and acknowledged the need for use of open data.

Open Knowledge Nepal envisions the impact of the Open Data Handbook to be mainly around the four different themes of open data: improving government, empowering citizens, creating opportunity, and solving public problems. To achieve impact within these different themes, solely having a good supply of data is not enough. We also need to ensure that the demand side is strong by increasing innovation, engagement, and reusability of published data. This Handbook will make it easier for government officials and the citizens of Nepal to learn more about open data in their native language. In doing so, it will help create a balanced environment between the supply and demand side of data, which in the long run will help promote and institutionalize transparency, accountability and citizen engagement in Nepal.

Lucidworks: Fusion and JavaScript: Shared Scripts, Utility Functions and Unit Tests

planet code4lib - Wed, 2017-08-16 21:08

Lucidworks Fusion uses a data pipeline paradigm for both data ingestion (Index Pipelines) and for search (Query Pipelines).  A Pipeline consists of one or more ordered Pipeline Stages.  Each Stage takes input from the previous Stage and provides input to the following Stage. In the Index Pipeline case, the input is a document to be transformed prior to indexing in Apache Solr.

In the Query Pipelines case, the first stages manipulate a Query Request. A middle stage submits the request to Solr and the following stages can be used to manipulate the Query Response.

The out-of-the-box stages included in Lucidworks Fusion let the user perform many common tasks such as field mapping for an Index Pipeline or specialized Facet queries for the Query Pipeline.  However, as described in a previous article, many projects have specialized needs in which the flexibility of the JavaScript stage is needed.

The code snippets in this article have been simplified and shortened for convenience.  The full examples can be downloaded from my GitHub repo

Taking JavaScript to the Next Level with Shared Scripts, Utility Functions and Unit Tests

Throwing a few scripts into a pipeline to perform some customized lookups or parsing logic is all well and good, but sophisticated ingestion strategies could benefit from some more advanced techniques.

  • Reduce maintenance problems by reusing oft-needed utilities and functions.  Some of the advanced features of the Nashorn JavaScript engine largely eliminate the need to copy/paste code into multiple Pipelines.  Keeping a single copy reduces code maintenance problems.
  • Use a modern IDE for editing.  The code editor in Fusion is functional but it provides little help with code completion, syntax highlighting, identifying typos illuminating global variables or generally speeding development.
  • Use Unit Tests to help reduce bugs and ensure the health of a deployment.
Reusing Scripts

Lucidworks Fusion uses the standard Nashorn JavaScript engine which ships with Java 8.  The load() command, combined with an Immediately Invoked Function Expression (IIFE) allows a small pipeline script to load another script.  This allows common functionality to be shared across pipelines.  Here’s an example:

var loadLibrary = function(url){ var lib = null; try{'\n\n*********\n*Try to library load from: ' + url); lib = load(url);// jshint ignore:line'\n\n**********\n* The library loaded from: ' + url); }catch(e){ logger.error('\n\n******\n* The script at ' + url + ' is missing or invalid\n’ + e.message); } return lib; }; Get Help From an IDE

Any sort of JavaScript function or objects could be contained in the utilLib.js as shown above.  Below is a simple example of a library containing two handy functions.
Explanatory notes:

  • The wrapping structure i.e. (function(){…}).call(this); makes up the IIFE structure used to encapsulate the  util object.  While this is not strictly necessary, it provides a syntax easily understood by the IntelliJ IDE.
  • The globals comment at the top, as well as the jshint comment at the bottom, are hints to the JSHint code validation engine used in the IDE.  These suppress error conditions resulting from the Nashorn load() functionality and global variables set by the Java environment which invokes the JavaScript Pipeline Stage.
  • The IDE will have underlined potentially illegal code in red. The result is an opportunity to fix typos without having to repeatedly test-load the script and hunt thru a log file only to find a cryptic error message from the Nashorn engine.  Also, note the use of the “use strict” directive.  This tells JSHint to also look for things like the inadvertent declaration of global variables.
/* globals Java,arguments*/ (function(){ "use strict"; var util = {}; util.isJavaType = function(obj){ return (obj && typeof obj.getClass === 'function' && typeof obj.notify === 'function' && typeof obj.hashCode === 'function'); } /** * For Java objects, return the short name, * e.g. 'String' for a java.lang.String * * JavaScript objects, usually use lower case. * e.g. 'string' for a JavaScript String * */ util.getTypeOf = function getTypeOf(obj){ 'use strict'; var typ = 'unknown'; //test for java objects if( util.isJavaType(obj)){ typ = obj.getClass().getSimpleName(); }else if (obj === null){ typ = 'null'; }else if (typeof(obj) === typeof(undefined)){ typ = 'undefined'; }else if (typeof(obj) === typeof(String())){ typ = 'string'; }else if (typeof(obj) === typeof([])) { typ = 'array'; } else if ( === '[object Date]'){ typ = 'date'; }else { typ = obj ? typeof(obj) :typ; } return typ; }; //return util to make it publicly accessible return util; }).call(this); // jshint ignore: line Overview of Utility Functions

Here is a summary description of some of the utility functions included in utilLib.js

index.concatMV(doc, fieldName, delim) Return a delimited String containing all values for a given field. If the names field contains values for ‘James’, ‘Jim’, ‘Jamie’, and ‘Jim’, calling index.concatMV(doc, ‘names’, ‘, ‘) would return “James, Jim, Jamie”

index.getFieldNames(doc, pattern) Return an array of field names in doc which match the pattern regular expression.

index.trimField(doc, fieldName) Remove all whitespace from all values of the field specified.  Leading and trailing whitespace is truncated and redundant whitespace within values is replaced with a single space.

util.concat(varargs) Here varargs can be one or more arguments of String or String[].  They will all be concatenated into a single String and returned.

util.dateToISOString(date) Convert a Java Date or JavaScript Date into an ISO 8601 formatted String.

util.dedup(arr) Remove redundant elements in an array.

util. decrypt(toDecrypt) Decrypt an AES encrypted String.

util. encrypt(toEncrypt) Encrypt a string with AES encryption.

util. getFusionConfigProperties() Read in the default Fusion config/ file and return it as a Java Properties object.

util.isoStringToDate(dateString) Convert an ISO 8601 formatted String into a Java Date.

util. queryHttp2Json(url) Perform an HTTP GET on a URL and parse the response into JSON.

util.stripTags(markupString) Remove markup tags from an HTML or XML string.

util.truncateString(text, len, useWordBoundary) Truncate text to a length of len.  If useWordBoundary is true break on the word boundary just before len.

Testing the Code

Automated unit testing of Fusion stages can be complicated.  Unit testing shared utility functions intended for use in Fusion stages is even more difficult.  A full test harness is beyond the scope of this Blog, but the essentials can be accomplished with the command-line curl utility or an REST client like Postman.

  • Start with a well-known state in the form of a pre-made PipelineDocument. To see an example of the needed JSON, look at what is produced by the Logging Stage which comes with Fusion.
  •  POST the PipelineDocument Fusion using the Index Pipelines API.  You will need to pass an ID, and Collection name as parameters as well as the trailing “/index” path in order to invoke the pipeline.
  • The POST operation should return the document as modified by the pipeline.  Inspect it and signal Pass or Fail events as needed.

Unit tests can also be performed manually by running the Pipeline within Fusion.  This could be part of a Workbench simulation or an actual Ingestion/Query operation.  The utilLib.js contains a rudimentary test harness for executing tests and comparing the results to an expected String value.  The results of tests are written both to the connections.log or api.log as well as being pushed into the Stage’s context map in the _runtime_test_results element as shown below.  The first test shows that util.dedup(‘a’, ‘b’, ‘c’, ‘a’, ‘b’) but the results do not contain the duplicates. Other common tests are also performed.  For complete details see the index.runTests() function in utilLib.js.


This article demonstrates how to load shareable JavaScript into Fusion’s Pipeline Stages so that common functions can be shared across pipelines.  It also contains several handy utility functions which can be used as-is or as a building blocks in more complex data manipulations.  Additionally, ways to avoid common pitfalls such as JavaScript syntax typos and unintended global variables were shown.  Finally, a Pipeline Simulation was run and the sample unit-test results were shown.


Special thanks to Carlos Valcarcel and Robert Lucarini of Lucidwoks as well as Patrick Hoeffel and Matt Kuiper at Polaris Alpha for their help and sample scripts.

The post Fusion and JavaScript: Shared Scripts, Utility Functions and Unit Tests appeared first on Lucidworks.

Eric Hellman: PubMed Lets Google Track User Searches

planet code4lib - Wed, 2017-08-16 19:56
CT scan of a Mesothelioma patient.
CC BY-SA by Frank GaillardIf you search on Google for "Best Mesothelioma Lawyer" and then click on one of the ads, Google can earn as much as a thousand dollars for your click. In general, Google can make a lot of money if it knows you're the type of user who's interested in rare types of cancer. So you might be surprised that Google gets to know everything you search for when you use PubMed, the search engine offered by the National Center for Biotechnology Information (NCBI), a service of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). Our tax dollars work really hard and return a lot of value at NCBI, but I was surprised to discover Google's advertising business is getting first crack at that value!

You may find this hard to believe, but you shouldn't take may word for it. Go and read the NLM Privacy Policy,  in particular the section on "Demographic and Interest Data"
On some portions of our website we have enabled Google Analytics and other third-party software (listed below), to provide aggregate demographic and interest data of our visitors. This information cannot be used to identify you as an individual. While these tools are used by some websites to serve advertisements, NLM only uses them to measure demographic data. NLM has no control over advertisements served on other websites.DoubleClick: NLM uses DoubleClick to understand the characteristics and demographics of the people who visit NLM sites. Only NLM staff conducts analyses on the aggregated data from DoubleClick. No personally identifiable information is collected by DoubleClick from NLM websites. The DoubleClick Privacy Policy is available at
You can opt-out of receiving DoubleClick advertising at will try to explain what this means and correct some of the misinformation it contains.

DoubleClick is Google's display advertising business. DoubleClick tracks users across websites using "cookies" to collect "demographic and interest information" about users. DoubleClick uses this information to improve its ad targeting. So for example, if a user's web browsing behavior suggests an interest in rare types of cancer, DoubleClick might show the user an ad about mesothelioma. All of this activity is fully disclosed in the DoubleClick Privacy Policy, which approximately 0% of PubMed's users have actually read. Despite what the NLM Privacy Policy says, you can't opt-out of receiving DoubleClick Advertising, you can only opt out of DoubleClick Ad Targeting. So instead of Mesothelioma ads, you'd probably be offered deals at

It's interesting to note that before February 21 of this year, there was no mention of DoubleClick in the privacy policy (see the previous policy ). Despite the date, there's no reason to think that the new privacy policy is related to the change in administrations, as NIH Director Francis Collins was retained in his position by President Trump. More likely it's related to new leadership at NLM. In August of 2016, Dr. Patricia Flatley Brennan became NLM director. Dr. Brennan, a registered nurse and an engineer, has emphasized the role of data to the Library's mission. In an interview with the Washington Post, Brennan noted:
In the 21st century we’re moving into data as the basis. Instead of an experiment simply answering a question, it also generates a data set. We don’t have to repeat experiments to get more out of the data. This idea of moving from experiments to data has a lot of implications for the library of the future. Which is why I am not a librarian.The "demographic and interest data" used by NLM is based on individual click data collected by Google Analytics. As I've previously written, Google Analytics  only tracks users across websites if the site-per-site tracker IDs can be connected to a global tracker ID like the ones used by DoubleClick. What NLM is allowing Google to do is to connect the Google Analytics user data to the DoubleClick user data. So Google's advertising business gets to use all the Google Analytics data, and the Analytics data provided to NLM can include all the DoubleClick "demographic and interest" data.

What information does Google receive when you do a search on Pubmed?
For every click or search, Google's servers receive:
  • your search term and result page URL
  • your DoubleClick user tracking ID
  • your referring page URL
  • your IP address
  • your browser software and operating system
While "only NLM staff conducts analyses on the aggregated data from DoubleClick", the DoubleClick tracking platform analyzes the unaggregated data from PubMed. And while it's true that "the demographic and interest data" of PubMed visitors cannot be used to identify them as  individuals, the data collected by the Google trackers can trivially be used to identify as individuals any PubMed users who have Google accounts. Last year, Google changed its privacy policy to allow it to associate users' personal information with activity on sites like PubMed.
"Depending on your account settings, your activity on other sites and apps may be associated with your personal information in order to improve Google’s services and the ads delivered by Google.So the bottom line is that Google's stated policies allow Google to associate a user's activity on PubMed with their personal information. We don't know if Google makes use of PubMed activity or if the data is saved at all, but NLM's privacy policy is misleading at best on this fact.

Does this matter? I have written that commercial medical journals deploy intense advertising trackers on their websites, far in excess of what NLM is doing. "Everybody" does it. And  we know that agencies of the US government spend billions of dollars sifting through web browsing data looking for terrorists, so why should NLM be any different? So what if Google gets a peek at PubMed user activity - they see such a huge amount of user data that PubMed is probably not even noticeable.

Google has done some interesting things with search data. For example, the "Google Flu Trends" and "Google Dengue Trends" projects studied patterns of searches for illness - related terms. Google could use the PubMed Searches for similar investigations into health provider searches.

The puzzling thing about NLM's data surrender is the paltry benefit it returns. While Google gets un-aggregated, personally identifiable data, all NLM gets is some demographic and interest data about their users. Does NLM really want to better know the age, gender, and education level of PubMed users??? Turning on the privacy features of Google Analytics (i.e. NOT turning on DoubleClick) has a minimal impact on the usefulness of the usage data it provides.

Lines need to be drawn somewhere. If Google gets to use PubMed click data for its advertising, what comes next? Will researchers be examined as terror suspects if they read about nerve toxins or anthrax? Or perhaps inquiries into abortifactants or gender-related hormone therapies will be become politically suspect. Perhaps someone will want a list of people looking for literature on genetically modified crops, or gun deaths, or vaccines? Libraries should not be going there.

So let's draw the line at advertising trackers in PubMed. PubMed is not something owned by a publishing company,  PubMed belongs to all of us. PubMed has been a technology leader worthy of emulation by libraries around the world. They should be setting an example. If you agree with me that NLM should stop letting Google track PubMed Users, let Dr. Brennan know (politely, of course.)

  1. You may wonder if the US government has a policy about using third party services like Google Analytics and DoubleClick. Yes, there is a policy, and NLM appears to be pretty much in compliance with that policy.
  2. You might also wonder if Google has a special agreement for use of its services on US government websites. It does, but that agreement doesn't amend privacy policies. And yes, the person signing that policy for Google subsequently became the third CTO of the United States.
  3.  I recently presented a webinar which covered the basics of advertising in digital libraries in the National Network of Libraries of Medicine [NNLM] "Kernal of Knowledge" series.
  4. (8/16) Yes, this blog is served by Google. So if you start getting ads for privacy plug-ins...
  5. (8/16) is a tool you can use to see what goes on under the cover when you search on PubMed. Tip from Gary Price.

LITA: Jobs in Information Technology: August 16, 2017

planet code4lib - Wed, 2017-08-16 18:42

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

University of North Florida-Thomas G. Carpenter Library, Online Learning Librarian, Jacksonville, FL

Miami University Libraries, Web Services Librarian, Oxford, OH

John M. Pfau Library, CSU, San Bernardino, Information Technology Librarian, San Bernardino, CA

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Dan Cohen: Age of Asymmetries

planet code4lib - Wed, 2017-08-16 17:56

Cory Doctorow’s 2008 novel Little Brother traces the fight between hacker teens and an overactive surveillance state emboldened by a terrorist attack in San Francisco. The novel details in great depth the digital tools of the hackers, especially the asymmetry of contemporary cryptography. Simply put, today’s encryption is based on mathematical functions that are really easy in one direction—multiplying two prime numbers to get a large number—and really hard in the opposite direction—figuring out the two prime numbers that were multiplied together to get that large number.

Doctorow’s speculative future also contains asymmetries that are more familiar to us. Terrorist attacks are, alas, all too easy to perpetrate and hard to prevent. On the internet, it is easy to be loud and to troll and to disseminate hate, and hard to counteract those forces and to more quietly forge bonds.

The mathematics of cryptography are immutable. There will always be an asymmetry between that which is easy and that which is hard. It is how we address the addressable asymmetries of our age, how we rebalance the unbalanced, that will determine what our future actually looks like.

Lucidworks: PlayStation and Lucene: Indexing 1M Docs per Second on 18 Servers

planet code4lib - Wed, 2017-08-16 17:19

As we countdown to the annual Lucene/Solr Revolution conference in Las Vegas next month, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Sony Interactive Entertainment’s Alexander Filipchik’s talk, “PlayStation and Lucene: Indexing 1M Docs per Second on 18 Servers”.

PlayStation4 is a not just a gaming console. The PlayStation Network is a system that handles more than 70 millions active users, and in order to create an awesome gaming experience has to support  personalized search at scale. The systems that provide this personalized experience indexes up to 1M documents per second using Lucene and only uses 18 mid sized Amazon instances.  This talk covers how the PlayStation team personalizes search for their users at scale with Lucene.


Join us at Lucene/Solr Revolution 2017, the biggest open source conference dedicated to Apache Lucene/Solr on September 12-15, 2017 in Las Vegas, Nevada. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post PlayStation and Lucene: Indexing 1M Docs per Second on 18 Servers appeared first on Lucidworks.

District Dispatch: Message from Reshma Saujani, founder of Girls Who Code

planet code4lib - Wed, 2017-08-16 15:00

Reshma Saujani, the founder and CEO of the national non-profit organization Girls Who Code, has taught computing skills to and inspired more than 10,000 girls across America. At the opening general session of the 2017 ALA Annual Conference this past June, Reshma spoke about Girls Who Code, how they are working to teach 100,000 girls to code by the end of 2018, and the organization’s many intersections with libraries.

Reshma is motivated to make sure that libraries – especially those who are interested in developing coding resources and programs – know about her free resources. As you will read in her message below, she invites ALA members and advocates to join the Girls Who Code movement.

To request a free Girls Who Code Starter Kit, including tips for leaders, giveaways and more, email:

I’m Reshma Saujani, the CEO & Founder of Girls Who Code, the national nonprofit working to close the gender gap in tech.

Computing skills are the most sought-after in the US job market, but girls across the US are being left behind. Today, less than a quarter of computing jobs are held by women, and that number is declining.

First off, I am not a coder. My background is as a lawyer and politician. In 2010, I was the first South Asian-American woman to run for Congress. When I was running for office, I spent a lot of time visiting schools. That’s when I noticed something. In every computer lab, I saw dozens of boys learning to code and training to be tech innovators. But there were barely any girls! This didn’t seem right to me. I did some research and learned that by 2020, there will be 1.4 million open jobs in computing, but fewer than 1 in 5 computer science graduates are women. With women making up almost half of our work force, it’s imperative for our economy that we’re preparing our girls for the future of work.

I decided I was going to teach girls to code and close the gender gap in tech. What started as an experiment with 20 girls in a New York City classroom has grown to a movement of 40,000 middle and high school girls across the states.

In 2017, we’re expanding our movement with the launch of a 13-book series as an invitation for girls everywhere to learn to code and change the world. These books include explanations of computer science concepts using real life examples; relatable characters and profiles of women in tech. It’s one of the first times that the story of computer science has been told through so many girls’ voices. We’re doing this because literary representation matters; one of the best ways to spark girls’ interest is to share stories of girls who look like them. When you teach girls to code, they become change agents and can build apps, programs, and movements to help tackle our country’s toughest problems.

With these books and our Clubs Program, Girls Who Code seeking to teach 100,000 girls to code by the end of 2018. Clubs are free after-school programs for girls to use computer science to impact their community and join our sisterhood of supportive peers and role models. Clubs are led by Facilitators, who can be librarians, teachers, computer scientists, parents or volunteers from any background or field. Many Facilitators have no technical experience and learn to code alongside their Club members.

We hope you’ll join our movement by bringing these books and a Club to your library.

The post Message from Reshma Saujani, founder of Girls Who Code appeared first on District Dispatch.

Alf Eaton, Alf: ES6 export/import

planet code4lib - Wed, 2017-08-16 14:21
Without default export utils.js (exporting) export const foo = () => 'ooo' export const bar = () => 'xxx'


const foo = () => 'ooo' const bar = () => 'xxx' export { foo, bar } app.js (importing) import { foo, bar } from './utils' foo()


import * as utils from './utils' index.js (re-exporting) export * from './utils' With default export utils.js (exporting) export const bar = () => 'xxx' export default () => 'ooo' app.js (importing) import foo, { bar } from './utils' index.js (re-exporting) export { default as foo, bar } from './utils'


export default from './utils' // re-export only the default export

Open Knowledge Foundation: OpenSpending platform update

planet code4lib - Wed, 2017-08-16 10:11

OpenSpending is a free, open and global platform to search, visualise, and analyse fiscal data in the public sphere. This week, we soft launched an updated technical platform, with a newly designed landing page. Until now dubbed “OpenSpending Next”, this is a completely new iteration on the previous version of OpenSpending, which has been in use since 2011.

Landing page at


At the core of the updated platform is Fiscal Data Package. This is an open specification for describing and modelling fiscal data, and has been developed in collaboration with GIFT. Fiscal Data Package affords a flexible approach to standardising fiscal data, minimising constraints on publishers and source data via a modelling concept, and enabling progressive enhancement of data description over time. We’ll discuss in more detail below.

From today:

  • Publishers can get started publishing fiscal data with the interactive Packager, and explore the possibilities of the platform’s rich API, advanced visualisations, and options for integration.
  • Hackers can work on a modern stack designed to liberate fiscal data for good! Start with the docs, chat with us, or just start hacking.
  • Civil society can access a powerful suite of visualisation and analysis tools, running on top of a huge database of open fiscal data. Discover facts, generate insights, and develop stories. Talk with us to get started.

All the work that went into this new version of OpenSpending was only made possible by our funders along the way. We want to thank Hewlett, Adessium, GIFT, and the consortium for helping fund this work.

As this is now completely public, replacing the old OpenSpending platform, we do expect some bugs and issues. If you see anything, please help us by opening a ticket on our issue tracker.


The updated platform has been designed primarily around the concept of centralised data, decentralised views: we aim to create a large, and comprehensive, database of fiscal data, and provide various ways to access that data for others to build localised, context-specific applications on top. The major features of relevance to this approach are described below.

Fiscal Data Package

As mentioned above, Fiscal Data Package affords a flexible approach to standardising fiscal data. Fiscal Data Package is not a prescriptive standard, and imposes no strict requirements on source data files.

Instead, users “map” source data columns to “fiscal concepts”, such as amount, date, functional classification, and so on, so that systems that implement Fiscal Data Package can process a wide variety of sources without requiring change to the source data formats directly.

A minimal Fiscal Data Package only requires mapping an amount and a date concept. There are a range of additional concepts that make fiscal data usable and useful, and we encourage the mapping of these, but do not require them for a valid package.

Based on this general approach to specifying fiscal data with Fiscal Data Package, the updated OpenSpending likewise imposes no strict requirements on naming of columns, or the presence of columns, in the source data. Instead, users (of the graphical user interface, and also of the application programming interfaces) can provide any source data, and iteratively create a model on top of that data that declares the fiscal measures and dimensions.

GUIs Packager

The Packager is the user-facing app that is used to model source data into Fiscal Data Packages. Using the Packager, users first get structural and schematic validation of the source files, ensuring that data to enter the platform is validly formed, and then they can model the fiscal concepts in the file, in order to publish the data. After initial modelling of data, users can also remodel their data sources for a progressive enhancement approach to improving data added to the platform.


The Explorer is the user-facing app for exploration and discovery of data available on the platform.


The Viewer is the user-facing app for building visualisations around a dataset, with a range of options, for presentation, and embedding views into 3rd party websites.


The DataMine is a custom query interface powered by Re:dash for deep investigative work over the database. We’ve included the DataMine as part of the suite of applications as it has proved incredibly useful when working in conjunction with data journalists and domain experts, and also for doing quick prototype views on the data, without the limits of API access, as one can use SQL directly.

APIs Datastore

The Datastore is a flat file datastore with source data stored in Fiscal Data Packages, providing direct access to the raw data. All other databases are built from this raw data storage, providing us with a clear mechanism for progressively enhancing the database as a whole, as well as building on this to provide such features directly to users.

Analytics and Search

The Analytics API provides a rich query interface for datasets, and the search API provides exploration and discovery capabilities across the entire database. At present, search only goes over metadata, but we have plans to iterate towards full search over all fiscal data lines.

Data Importers

Data Importers are based on a generic data pipelining framework developed at Open Knowledge International called Data Package Pipelines. Data Importers enable us to do automated ETL to get new data into OpenSpending, including the ability to update data from the source at specified intervals.

We see Data Importers as key functionality of the updated platform, allowing OpenSpending to grow well beyond the one thousand plus datasets that have been uploaded manually over the last five or so years, towards tens of thousands of datasets. A great example of how we’ve put Data Importers to use is in the EU Structural Funds data that is part of the Subsidy Stories project.


It is slightly misleading to announce the launch today, when we’ve in fact been using and iterating on OpenSpending Next for almost 2 years. Some highlights from that process that have led to the platform we have today are as follows. with Adessium

Adessium provided Open Knowledge International with funding towards fiscal transparency in Europe, which enabled us to build out significant parts of the technical platform, commision work with J++ on Agricultural Subsidies , and, engage in a productive collaboration with Open Knowledge Germany on what became, which even led to another initiative from Open Knowledge Germany called The Story Hunt.

This work directly contributed to the technical platform by providing an excellent use case for the processing of a large, messy amount of source data into a normalised database for analysis, and doing so while maintaining data provenance and the reproducibility of the process. There is much to do in streamlining this workflow, but the benefits, in terms of new use cases for the data, are extensive.

We are particularly excited by this work, and the potential to continue in this direction, by building out a deep, open database as a potential tool for investigation and telling stories with data. via Horizon 2020

As part of the consortium, we were able to both build out parts of the technical platform, and have a live use case for the modularity of the general architecture we followed. A number of components from the core OpenSpending platform have been deployed into the platform with little to no modification, and the analytical API from OpenSpending was directly ported to run on top of a triple store implementation of the data model.

An excellent outcome of this project has been the close and fruitful work with both Open Knowledge Germany and Open Knowledge Greece on technical, community, and journalistic opportunities around OpenSpending, and we plan for continuing such collaborations in the future.

Work on Fiscal Data Package with GIFT

Over three phases of work since 2015 (the third phase is currently running), we’ve been developing Fiscal Data Package as a specification to publish fiscal data against. Over this time, we’ve done extensive testing of the specification against a wide variety of data in the wild, and we are iterating towards a v1 release of the specification later this year.

We’ve also been piloting the specification, and OpenSpending, with national governments. This has enabled extensive testing of both the manual modeling of data to the specification using the OpenSpending Packager, and automated ETL of data into the platform using the Data Package Pipelines framework.

This work has provided the opportunity for direct use by governments of a platform we initially designed with civil society and civic tech actors in mind. We’ve identified difficulties and opportunities in this arena at both the implementation and the specification level, and we look forward to continuing this work and solving use cases for users inside government.


Many people have been involved in building the updated technical platform. Work started back in 2014 with an initial architectural vision articulated by our peers Tryggvi Björgvinsson and Rufus Pollock. The initial vision was adapted and iterated on by Adam Kariv (Technical Lead) and Sam Smith (UI/X), with Levko Kravets, Vitor Baptista, and Paul Walsh. We reused and enhanced code from Friedrich Lindenberg. Lazaros Ioannidis and Steve Bennett made important contributions to the code and the specification respectively. Diana Krebs, Cecile Le Guen, Vitoria Vlad and Anna Alberts have all contributed with project management, and feature and design input.

What’s next?

There is always more work to do. In terms of technical work, we have a long list of enhancements.
However, while the work we’ve done in the last years has been very collaborative with our specific partners, and always towards identified use cases and user stories in the partnerships we’ve been engaged in, it has not, in general, been community facing. In fact, a noted lack of community engagement goes back to before we started on the new platform we are launching today. This has to change, and it will be an important focus moving forward. Please drop by at our forum for any feedback, questions, and comments.

Open Knowledge Foundation: Using the Global Open Data Index to strengthen open data policies: Best practices from Mexico

planet code4lib - Wed, 2017-08-16 09:00

This is a blog post coauthored with Enrique Zapata, of the Mexican National Digital Strategy.

As part of the last Global Open Data Index (GODI), Open Knowledge International (OKI) decided to have a dialogue phase, where we invited individuals, CSOs, and national governments to exchange different points of view, knowledge about the data and understand data publication in a more useful way.

In this process, we had a number of valuable exchanges that we tried to capture in our report about the state of open government data in 2017, as well as the records in the forum. Additionally, we decided to highlight the dialogue process between the government and civil society in Mexico and their results towards improving data publication in the executive authority, as well as funding to expand this work to other authorities and improve the GODI process. Here is what we learned from the Mexican dialogue:

The submission process

During this stage, GODI tries to directly evaluate how easy it is to find and their data quality in general. To achieve this, civil society and government actors discussed how to best submit and agreed to submit together, based on the actual data availability.

Besides creating an open space to discuss open data in Mexico and agreeing on a joint submission process, this exercise showed some room for improvement in the characteristics that GODI measured in 2016:

  • Open licenses: In Mexico and many other countries, the licenses are linked to datasets through open data platforms. This showed some discrepancies with the sources referenced by the reviewers since the data could be found in different sites where the license application was not clear.
  • Data findability: Most of the requested datasets assess in GODI are the responsibility of the federal government and are available in Nevertheless, the titles to identify the datasets are based on technical regulation needs, which makes it difficult for data users to easily reach the data.
  • Differences of government levels and authorities: GODI assesses national governments but some of these datasets – such as land rights or national laws – are in the hands of other authorities or local governments. This meant that some datasets can’t be published by the federal government since it’s not in their jurisdiction and they can’t make publication of these data mandatory.
Open dialogue and the review process

During the review stage, taking the feedback into account, the Open Data Office of the National Digital Strategy worked on some of them. They summoned a new session with civil society, including representatives from the Open Data Charter and OKI in order to:

  • Agree on the state of the data in Mexico according to GODI characteristics;
  • Show the updates and publication of data requested by GODI;
  • Discuss paths to publish data that is not responsibility of the federal government;
  • Converse about how they could continue to strengthen the Mexican Open Data Policy.


The results

As a result of this dialogue, we agreed six actions that could be implemented internationally beyond just the Mexican context both by governments with centralised open data repositories and those which don’t centralise their data, as well as a way to improve the GODI methodology:

  1. Open dialogue during the GODI process: Mexico was the first country to develop a structured dialogue to agree with open data experts from civil society about submissions to GODI. The Mexican government will seek to replicate this process in future evaluations and include new groups to promote open data use in the country. OKI will take this experience into account to improve the GODI processes in the future.
  2. Open licenses by default: The Mexican government is reviewing and modifying their regulations to implement the terms of Libre Uso MX for every website, platform and online tool of the national government. This is an example of good practice which OKI have highlighted in our ongoing Open Licensing research.
  3. “GODI” data group in CKAN: Most data repositories allow users to create thematic groups. In the case of GODI, the Mexican government created the “Global Open Data Index” group in This will allow users to access these datasets based on their specific needs.
  4. Create a link between government built visualization tools and The visualisations and reference tools tend to be the first point of contact for citizens. For this reason, the Mexican government will have new regulations in their upcoming Open Data Policy so that any new development includes visible links to the open data they use.
  5. Multiple access points for data: In August 2018, the Mexican government will launch a new section on to provide non-technical users easy access to valuable data. These data called “‘Infraestructura de Datos Abiertos MX’ will be divided into five easy-to-explore and understand categories.
  6. Common language for data sets: Government naming conventions aren’t the easiest to understand and can make it difficult to access data. The Mexican government has agreed to change the names to use more colloquial language can help on data findability and promote their use. In case this is not possible with some datasets, the government will go for an option similar to the one established in point 5.

We hope these changes will be useful for data users as well as other governments who are looking to improve their publication policies. Got any other ideas? Share them with us on Twitter by messaging @OKFN or send us an email to


LITA: #NoFilter: Creating Compelling Visual Content for Social Media

planet code4lib - Tue, 2017-08-15 13:55

The #NoFilter series has as its focus the numerous challenges that surround social media and its use in the library. In previous posts, I discussed sources for content inspiration as well as tips for content planning. This entry will concentrate on creating compelling visual content for your library’s social media.

A strong visual component for a social media post is imperative for capturing the attention of users and bringing them into dialogue with the library and forming the relationships that are key to institutional social media success.  Social media is not a one-way self-promotional tool for a library, but rather an interactive space allowing a library to engage meaningfully with users, cultivate their support and kindle their enthusiasm for the library’s work. Quality visual content in a social media post has the potential to spur conversations with/among users who in turn share the library’s content with ever-wider audiences.

Below are three tips for generating compelling visual content for your library’s social media posts:

Recipe card created using Canva for a Honey Bread Recipe from 1915. The card was shared on the Othmer Library’s Pinterest and Tumblr where it elicited numerous user responses.

  1. Craft an aesthetically pleasing design using Canva, the user-friendly web-based graphic design service. Utilizing one of Canva’s social media templates makes the creation process all that more efficient. If graphic design is not your forte, you can complete Canva’s Design Essentials tutorials which cover everything from fonts and colors to images and backgrounds.
  2. Assemble an infographic to display information in a captivating way. Canva makes this easy with its infographic template. PowerPoint can also be used for this purpose. The Penn Libraries provide excellent instructions for this process on its Infographics Guide.
  3. Bring a static photo or illustration to life with an animated GIF (short for Graphic Interchange Format). At the Othmer Library of Chemical History, we employ Photoshop Elements in order to create GIFs of images in our rare books and archives. Does using Photoshop seem intimidating to you? The Smithsonian Libraries offer some useful tips and tricks in their 2014 blog post: Library Hacks: Creating Animated GIFs. My colleague also created a handy step-by-step guide for GIF-making: Animated GIF Basics.


What types of visual content do you share on your library’s social media? Do you have any tips for creating compelling visuals? Share them in the comments below!

Eric Lease Morgan: How to do text mining in 69 words

planet code4lib - Tue, 2017-08-15 13:38

Doing just about any type of text mining is a matter of: 0) articulating a research question, 1) acquiring a corpus, 2) cleaning the corpus, 3) coercing the corpus into a data structure one’s software can understand, 4) counting & tabulating characteristics of the corpus, and 5) evaluating the results of Step #4. Everybody wants to do Step #4 & #5, but the initial steps usually take more time than desired.

District Dispatch: FCC extends Net Neutrality public comment period to August 30

planet code4lib - Tue, 2017-08-15 12:00

On Friday, the FCC announced it would extend the public comment period on its proposal to roll back a 2015 order protecting net neutrality for an additional two weeks. This phase of the process is supposed to allow for “replies” to arguments raised by other commenters.

With close to 20 million comments in the public record so far, any additional time is useful. It’s worth noting, however, that many advocates have called for the FCC to release the consumer complaints received since the 2015 Open Internet Order went into effect and all documents related to the ombudsperson’s interactions with internet users. The comment extension, while welcome, does not address the fact the FCC has yet to make public more than 40,000 net neutrality complaints that could provide direct and relevant evidence in response to numerous questions that the FCC poses in this proceeding.

The extra time means more opportunities for the library community to engage. Even if you have already submitted comments, you can do so again “on reply” Here are a few easy strategies:

  • Submit a comment amplifying the library and higher education principles for an open internet.
  • You can cite to specific examples or arguments in the initial comments submitted by ALA and allies earlier in the proceeding.
  • Thousands of librarians and library staff from across the country have filed comments on their own or via the ALA’s action alert. Members of the library community called on the FCC to keep the current net neutrality rules and shared their worries that the internet with “slow lanes” would hurt libraries and the communities they serve. The comments below offer a few examples and may help with your comments:
    • The New Jersey Library Association submits: “Abandoning net neutrality in favor of an unregulated environment where some content is prioritized over other content removes opportunities for entrepreneurs, students and citizens to learn, grow and participate in their government. It will further enhance the digital divide and severely inhibit the ability of our nation’s libraries to serve those on both sides of that divide.”
    • “If net neutrality is to be abolished, then our critical online services could be restricted to ‘slow lanes’ unless we pay a premium,” wrote John, a public library employee in Georgia. “These include our job and career gateway, language learning software, grant finding, medical information, ebooks, and test preparation guides, such as for the GED and ASVAB. Ending net neutrality would hurt the people who need equal access the most. These people use our career gateway to find jobs, our grant finder to support their businesses and nonprofits, and use our test aids to earn their GED or get into the military. If we were forced to pay a premium to access these resources, it will limit our ability to fund our other programs and services.”
    • Catherine, a reference librarian at a major university in Oregon writes, “I [have] learned that imaginative online searching is an invaluable research tool for personal, professional, and scholarly interests. Yes, going online can be fun, but the internet must not be considered a plaything. Access must not be restricted or limited by corporate packaging.”
    • Hampton, a chief executive officer of a public library system in Maryland, wrote about all the functions and services of the modern library dependent on reliable, unfettered internet access: “In our library, we offer downloadable eBooks, eMagazines, and eAudiobooks as well as numerous databases providing courses through, language learning through Rosetta Stone, 365-days-a-year tutoring for kindergarten through adult with BrainFuse, and many more resources online. We have public computers with internet access as well as free WiFi in our fifteen libraries extending Internet access to thousands of customers who bring their tablets and smartphones to the library. We work with customers to help them in the health care marketplace, with applications for Social Security and jobs, and every conceivable use of the internet. Obviously, being relegated to lower priority internet access would leave our customers in a very difficult position.”
    • Others wrote with concerns about the need for access to information for democracy to thrive, like Carrie, an information professional from Michigan: “The internet is not merely a tool for media consumption, but is also a means of free expression, a resource for education, and most importantly, an implement of democracy. I will not mince words: Allowing corporations to manipulate the flow of information on the internet is not the way forward. An end to net neutrality would hurt businesses large and small, inhibit the free flow of speech online, and allow telecommunications corporations to unjustly interfere with market forces.”

Stay tuned via the District Dispatch and American Libraries blog posts.

The post FCC extends Net Neutrality public comment period to August 30 appeared first on District Dispatch.


Subscribe to code4lib aggregator