You are here

Feed aggregator

LITA: New and Upcoming Titles in the LITA Guide Series

planet code4lib - Wed, 2017-04-12 15:00

Here are 5 recent and upcoming exciting titles on library technology. The LITA Guide Series books from Rowman and Littlefield publishers, contain practical, up to date, how-to information, and are usually under 100 pages. Proposals for new titles can be submitted to the Acquisitions editor using this link.

LITA members receive a 20% discount on all the titles. To get that discount, use promotion code RLLITA20 when ordering from the Rowman and Littlefield LITA Guide Series web site.

Here are the current new LITA Guide Series titles:


Using Social Media to Build Library Communities: A LITA Guide
Edited by Scott W.H. Young and Doralyn Rossman (September 2017)

Managing Library Technology: A LITA Guide
Carson Block (August 2017)

The LITA Leadership Guide: The Librarian as Entrepreneur, Leader, and Technologist
Edited by Carl Antonucci and Sharon Clapp (May 2017)

Protecting Patron Privacy: A LITA Guide
Edited by Bobbi Newman and Bonnie Tijerina (May 2017)

Managing the Digital You: Where and How to Keep and Organize Your Digital Life
Melody Condron (February 2017)

LITA publications help to fulfill its mission to educate, serve and reach out to its members, other ALA members and divisions, and the entire library and information community through its publications, programs and other activities designed to promote, develop, and aid in the implementation of library and information technology.

David Rosenthal: Identifiers: A Double-Edged Sword

planet code4lib - Wed, 2017-04-12 15:00
This is the last of my posts from CNI's Spring 2017 Membership Meeting. Predecessors are Researcher Privacy, Research Access for the 21st Century, and The Orphans of Scholarship.

Geoff Bilder's Open Persistent Identifier Infrastructures: The Key to Scaling Mandate Auditing and Assessment Exercises was ostensibly a report on the need for and progress in bringing together the many disparate identifier systems for organizations in order to facilitate auditing and assessment processes. It was actually an insightful rant about how these processes were corrupting the research ecosystem. Below the fold, I summarize Geoff's argument (I hope Geoff will correct me if I misrepresent him) and rant back.

The non-rant part of Geoff's talk started from the premise that researchers and their institutions are increasingly subject by funders and governments to assessments, such as the UK's Research Excellence Framework, and mandates, such as the Wellcome Trust's open access mandate. Compliance with the mandates has been generally poor.

Assessing how poor, and assessing the excellence of research both require an ample supply of high-quality metadata, which in principle Crossref is in a good position to supply. To assess research productivity, three main types of identifier are needed; content, contributor, and organization. Geoff used this three-legged stool image to show that:
The rant part was not about what identifiers are, but about what they are used for. It took off from Geoff's question as to whether the audience thought that the pressure-cooker of all these assessments was likely to lead to greater creativity.

I have a great counter-example. The physicist G. I. Taylor (my great-uncle) started in 1909 with the classic experiment which showed that interference fringes were still observed at such low intensity that only a single photon at a time was in flight. The following year at age 23 he was elected a Fellow of Trinity College, and apart from a few years teaching, he was able to pursue research undisturbed by any assessment for the next 6 decades. Despite this absence of pressure, he was one of the 20th century's most productive scientists, with four huge volumes of collected papers over a 60-year career.

Papers/year (linear)Since the assessments are all based on counting the number of peer-reviewed publications meeting certain criteria, one result has been gradually accelerating exponential growth in the number of peer-reviewed publications. But it is clear that More Is Not Better in which I wrote:
The Economist's Incentive Malus, ... is based on The natural selection of bad science by Paul E. Smaldino and Richard McElreath, which starts:
Poor research design and data analysis encourage false-positive findings. Such poor methods persist despite perennial calls for improvement, suggesting that they result from something more than just misunderstanding. The persistence of poor methods results partly from incentives that favour them, leading to the natural selection of bad science. This dynamic requires no conscious strategizing—no deliberate cheating nor loafing—by scientists, only that publication is a principal factor for career advancement....
The Economist reports Smaldino and McElreath's conclusion is bleak:
that when the ability to publish copiously in journals determines a lab’s success, then “top-performing laboratories will always be those who are able to cut corners”—and that is regardless of the supposedly corrective process of replication.Papers/year (log-linear)Only two things have interrupted this explosion of publishing; wars and depressions. Geoff and I are both concerned that recent political events in several of the leading research countries will lead to significant cuts in public funding for research, and thus increase the pressure in the cooker.  Research suggests that this will lead to higher retraction rates and more misconduct, further eroding the waning credibility of science. As Arthur Caplan (of the Division of Medical Ethics at NYU's Langone Medical Center) put it:
The time for a serious, sustained international effort to halt publication pollution is now. Otherwise scientists and physicians will not have to argue about any issue—no one will believe them anyway.(see also John Michael Greer).

Post-PhD science career tracksIn 2010 the Royal Society issued a report on research with much valuable information. Alas, it is more relevant today than it was then, because the trends it identified have continued unabated. Geoff took from this report a graphic that illustrates how insanely competitive academia is as a career path. It shows that over half the newly minted Ph.D.s leave science immediately. Only one in five make research a career, and less than one in two hundred make professor. Geoff is concerned that Ph.D. programs are too focused on the one and not enough on the other one hundred and ninety-nine, and I agree. My friend Doug Kalish has built a business in retirement addressing this issue.

My Ph.D. was in Mechanical Engineering, so I'm not a scientist in the sense the Royal Society uses. I was a post-doc (at Edinburgh University) and then research staff (at Carnegie-Mellon University) before moving to industry (initially at Sun Microsystems) and eventually back to academia (at Stanford). I've published quite a bit both from academia and from industry but I was never in the publish or perish rat-race. I was always assessed on the usefulness of the stuff I built; the pressures in engineering are different.

Research funding flowsMy take on the graph above is a bit different from Geoff's. I start from another graphic from the Royal Society report, showing the flow of funds in UK research and development, which includes much engineering (or its equivalent in the life sciences). Note that "private and industrial research" consumes nearly twice the funding of "university" and "public research institutions" combined. So one would expect about 2/3 of the Ph.D.s to be outside the universities and public research institutions. The split in the graphic Geoff used is 4/5, but one would expect that including engineering would lead to more retention of Ph.D.s. It is easier to fund engineering research in Universities than science because it has more immediate industrial application.

Besides, Ph.D.s leaving academia for industry is a good thing. Most of the "engineers" I worked with at my three successful Silicon Valley startups had Ph.D.s in physics, mathematics and computer science, not engineering. My Mech. Eng. Ph.D. was an outlier. Silicon Valley would not exist but for Ph.D.s leaving research to create products in industry.

FOSS4Lib Upcoming Events: JHOVE Online Hack Day Spring 2017

planet code4lib - Wed, 2017-04-12 14:47
Date: Wednesday, April 26, 2017 - 09:00 to 17:00Supports: JHOVE

Last updated April 12, 2017. Created by Peter Murray on April 12, 2017.
Log in to edit this page.

JHOVE Online Hack Day Spring 2017 details and registration link.

Terry Reese: MarcEdit Updates

planet code4lib - Wed, 2017-04-12 13:59

I’ve been working on a few updates the past couple weeks, the most time consuming of these being updates related to Alma.  Here’s the full list:


6.2.501 * Enhancement: OpenRefine Import/Export Formatting Updates [tested on 2.7 rc 1 & 2] * Enhancement: MarcEditor -- Right-to-Left and Left-to-Right language improvements when reading in the Editor [see: for more info] 6.2.500 * Enhancement: ILS Integrations (Alma): Alma Holdings Editing is now available. * Enhancement: Validator Updates - added some code to tighten the duplication reporting and added better responsiveness when running [i.e., statuses, etc.]) * Enhancement/Bug Fix: MarcEngine Updates: Specifically, in the XML space, I was running into some Chinese characters that were not being recognized as UTF8. This can happen for characters on the fringes of a characterset or have alternative characters. I made some edits that ensured that any character checking would fall through to a secondary process. * Enhancement: Task Management -- I cleaned up a few odds and ends to make managing a bit easier. This includes exposing error messages when they popup, enabling multiple task deletion, etc. * Enhancement: Task Management Preferences -- This came up where MarcEdit was having trouble identifying a file path as valid. I'm going to check these paths as soon as they are selected and report if they are valid right away. This way -- if there is a problem, you know right away.


2.3.5 ************************************************** ** 2.3.5 ************************************************** * Enhancement: ILS Integrations (Alma): Alma Holdings Editing is now available. * Enhancement: Validator Updates - added some code to tighten the duplication reporting and added better responsiveness when running [i.e., statuses, etc.]) * Enhancement/Bug Fix: MarcEngine Updates: Specifically, in the XML space, I was running into some Chinese characters that were not being recognized as UTF8. This can happen for characters on the fringes of a characterset or have alternative characters. I made some edits that ensured that any character checking would fall through to a secondary process. * Enhancement: Task Management -- I cleaned up a few odds and ends to make managing a bit easier. This includes exposing error messages when they popup, enabling multiple task deletion, etc. * Enhancement: Task Management Preferences -- This came up where MarcEdit was having trouble identifying a file path as valid. I'm going to check these paths as soon as they are selected and report if they are valid right away. This way -- if there is a problem, you know right away. * Enhancement: OpenRefine Import/Export Formatting Updates [tested on 2.7 rc 1 & 2] * Enhancement: MarcEditor -- Right-to-Left and Left-to-Right language improvements when reading in the Editor [see: for more info]

I believe that there will need to be a couple additional updates around the alma work — particularly when creating new holdings, as I’m not seeing the 001/004 pairs added to the records, but this is the start of that work.   Also, I did some work around right-to-left, left-to-right rendering when working with mixed languages.  See:

Questions, let me know.



Terry Reese: MarcEditor Changes and Right-to-left displays

planet code4lib - Wed, 2017-04-12 13:52

So, this was a tough one.  MarcEdit has a right-to-left data entry mode that was created primary for users that are creating bibliographic records primarily in a right to left language.  But what happens when you are mixing data in a record from left-to-right languages like English and Right-to-left languages, like Hebrew.  Well, in the display, odd things happen.  This is because of what the operating system does when rendering the data.  The operating system assumes certain data belongs to the right-to-left string, and then moves data in a way that it think it should render.  Here’s an example:

In this example, the $a$0 are displayed side-by-side, but this is just a display issue.  Underneath, the data is really correct.  If you compiled this data or loaded into an ILS, the data would parse correctly (though, how it displayed would be up to the ILS support of the language).  But this is confusing, but unfortunately, one of the challenges of working with records in a notepad-like environment.

Now, there is a solution that can solve the display problem.  There are two Unicode characters 0x200E and 0x200F — these are Left-to-right character markers and Right-to-left character markers.  These can be embedded in the display to render characters more appropriately.  They only show up in the display (i.e. are added when reading into the display), and are not preserved in the MARC record.  They help to alleviate some of these problems.


The way that this works — when the program identifies that its working with UTF8 data, the program will screen the text for characters have a byte that indicate that they should be rendered RTL.  The program will then embed a RTL marker at the beginning of the string and a LTR marker at the end of the string.  This gives the operating system instructions as to how to render the data, and I believe helps to solve this issue.




Open Knowledge Foundation: Launch: promotes transparency within the educational system in Germany

planet code4lib - Wed, 2017-04-12 08:41

This blog was written by Moritz Neujeffski, School of Data Germany team.

School of Data Germany, a project by Open Knowledge Foundation Deutschland, helps non-profit organisations, civil rights defenders and activists to understand and use data and technology effectively to increase their impact on societal challenges. Profound knowledge in processing data allows individuals and organisations to critically reflect and to influence public debates with evidence-based arguments. is the outcome of our first partnership with BildungsCent eV. Together we explored the programs schools in Germany offer students beside general lessons and advocated for a transparent German education system. While we definitely learned a lot about the school system in Germany, we provided specially tailored Workshops for BildungsCent eV. We addressed how to clean, analyse and visualise data and what pitfalls to look out for in digital projects.

Education is more than school lessons. Character and drive often develop outside the classroom. Public information on schools in Germany is sparse and not often available in a structured and organised format. Together with BildungsCent eV., we investigated the availability and access of data on schools in Germany.

The focus of our investigation: How is data on schools best communicated to the public? How does that affect the potential of schools to be important social hubs?

Findings of our analysis:

Parents, students, teachers, politicians, and civil society organisations benefit from enhanced information on the German school system that is provided on School of Data Germany and BildungsCent eV. campaigned for more transparency in the educational sector and promoted dialogues between stakeholders in educational policy.We also provided an overview of more than 30,000 schools of general education in Germany.

The interactive map makes it possible to search for and filter according to specific school types. The educational sector differs among the 16 German federal states. We gathered information on the development of each individual school system, public spending within the educational sector, and the employment situation of teachers for each state.

Moreover,  3,000 profiles for schools in Berlin and Saxony containing their mission statements, the number of students and teachers per school, study groups and cooperations between schools and actors from civil society, public departments, the private sector and other relevant stakeholders were set up. All this data as used in the project is available as open data on our website.

Our aim is to facilitate the use of educational data by journalists, politicians, scientists, the civic tech community, and stakeholders of educational policy. Concluding remarks on school activities & cooperations in Berlin and Saxony
  •  413 out of 800 general education schools in Berlin communicate their activities to the Ministry of Education, Youth and Family.
  • On average, they provide eight activities in at least four areas such as environment, literature, handcraft, and technology besides regular lessons.
  • In Saxony, 1206 out of 1500 schools of general education report to the statistical office.
  • In total, they offer 11,600 activities. On average, this amounts to ten activities in five different areas per school.
  • Sporting activities are most prominent in both federal states. Partners from civil society and public affairs are the highest among schools in both states.

Schools promote the well-being and development of children and adolescents through diverse projects, partners, and activities. They are an important component of the livelihood and learning environment of students and provide an important perspective on society.

To establish a holistic picture of the German school system and to increase transparency and the ability to compare federal states on educational matters, data has to be better collected, organised, and structured at the state level. Administrations, especially need to improve their performance in order to foster an effective debate on the German school system.


OCLC Dev Network: Upcoming Backward Incompatible Changes to WMS APIs

planet code4lib - Tue, 2017-04-11 19:00

OCLC will be installing an update to several WMS APIs on July 9, 2017 which contains backward-incompatible changes.

District Dispatch: Panels announced for National Library Legislative Day 2017

planet code4lib - Tue, 2017-04-11 18:59

There are just 20 days until National Library Legislative Day, and the speaker lineup is our best yet! You’ve likely already heard that Hina Shamshi from the ACLU will be joining us as our keynote speaker. Now check out some of the other panels we have planned:

The Political Dance

  • Jennifer Manley
    Managing Director, Prime Strategies NYC
  • Christian Zabriskie,
    Executive Director, Urban Librarians Unite; Branch Administrator, Yonkers Public Library

At times government relations feels like a complicated tango filled with intricate footwork and precise timing. This conversation between political activist Christian Zabriskie and Government Relations and Communications Consultant Jennifer Manley will cover a huge range of topics including navigating the new abnormal in Washington, being unafraid to play the game, and how to leverage the press and social channels for your government relations efforts. Buckle up, it’s gonna be a fast talking roller coaster of wonky fun.

Speaking Truth to Power (and Actually Being Heard!)

  • Brian Jones – Partner, Black Rock Group
  • Tina Pelkey – Senior Vice President, Black Rock Group

William Carlos Williams was a poet, not a lobbyist, but he was on to something when he said: “It is not what you say that matters but the manner in which you say it; there lies the secret of the ages.” Well, we’re not sure about that secret to the ages part, but we guarantee that speaking “truth to power” is a whole lot easier and ultimately successful when you speak Power’s language. Learn how to – and how not to – make libraries’ best case when you “hit the Hill” on May 2nd after you get home.

Libraries Ready to Code

  • Marijke Visser – Associate Director, Office for Information Technology Policy
  • Other speakers TBD

Come to this program to learn about the great promise of coding in libraries. Programs in libraries bring opportunity to youth to learn about and develop skills not only in coding, but also in the broader computational thinking behind coding. For advocacy, the story of library-based coding programs positions libraries as key institutions to prepare youth to consider and pursue STEM and many other careers based on computing and tech.

Democracy dies in darkness: helping editorial boards shed light on issues facing your community

  • Molly Roberts – Digital Producer for Opinions, The Washington Post
  • Gina Millsap – Chief Executive Officer, Topeka & Shawnee County Public Library (KS)

The Washington Post’s new motto echoes a truth librarians live by: an informed citizenry is necessary for democracy to thrive. What does that mean for the collective opinion voice of a major news outlet? How can library professionals help shed light on community issues for editorial boards? Learn how editorial boards take positions and why librarians need to be at the discussion table.

Interested in taking part in National Library Legislative Day, but unable to come to D.C. yourself? Register to participate digitally, and sign up for our Thunderclap.

The post Panels announced for National Library Legislative Day 2017 appeared first on District Dispatch.

David Rosenthal: The Orphans of Scholarship

planet code4lib - Tue, 2017-04-11 15:00
This is the third of my posts from CNI's Spring 2017 Membership Meeting. Predecessors are Researcher Privacy and Research Access for the 21st Century.

Herbert Van de Sompel, Michael Nelson and Martin Klein's To the Rescue of the Orphans of Scholarly Communication reported on an important Mellon-funded project to investigate how all the parts of a research effort that appear on the Web other than the eventual article might be collected for preservation using Web archiving technologies. Below the fold, a summary of the 67-slide deck and some commentary.

The authors show that sites such as GitHub, Wikis, Wordpress, etc. are commonly used to record artifacts of research, but that these sites do not serve as archives.  Further, the artifacts on these sites are poorly preserved by general Web archives. Instead, they investigate the prospects for providing institutions with tools they can use to capture their own researchers' Web artifacts. They divide the problem of ingesting these artifacts for preservation into four steps.

First, discovering a researcher's Web identities. This is difficult, because the fragmented nature of the research infrastructure leads to researchers having accounts, and thus identities at many different Web sites (ORCID, Github, ResearchGate, ...). There's no consistent way to discover and link them. They discuss two approaches:
  • EgoSystem, developed at LANL, takes biographical information about an individual and uses heuristics to find Web identities for them in a set of target Web sites such as Twitter and LinkedIn.
  • SourceMining ORCID for identities. Ideally, researchers would have ORCID IDs and their ORCID profiles would point to their research-related Web identities. Alas, ORCID's coverage outside the sciences, and outside the US and UK, is poor, and there is no standard for the information included in ORCID profiles.
Second, discovering artifacts per Web identity. This is easier. Once you have a researcher's Web identities, conventional Web searching and page analysis techniques can harvest artifact links quite effectively. However, there is potentially a serious problem of over-collection. For example, which of the images in a researcher's Flickr account are research-related as opposed to vacation-related?

Third, determining the Web boundary per artifact. This is the domain of Signposting, which I wrote about here. The issues are very similar to those in Web Infrastructure to Support e-Journal Preservation (and More) by Herbert, Michael and myself.

Fourth, capturing artifacts in the artifact's Web boundary. After mentioning the legal uncertainties caused by widely varying license terms among the sites hosting research artifacts, a significant barrier in practice, they show that different capture tools vary widely in their ability to collect usable Mementos of artifacts from the various sites. Building on Not all mementos are created equal: measuring the impact of missing resources, they describe a system for automatically scoring the quality of Mementos. This may be applicable to the LOCKSS harvest ingest pipeline; the team hopes to evaluate it soon.

The authors end on the question of how the authenticity of Mementos can be established. What does it mean for a Memento to be authentic? In an ideal world it would be that it was the same as the content of the Web site. But, even setting aside the difficulties of collection, in the real world this isn't possible. Web pages are different every time they are visited, the Platonic ideal of "the content of the Web site" doesn't exist.

The authors mean by "authentic" that the content obtained from the archive by a later reader is the same as was originally ingested by the archive; it hasn't been replaced by some evil-doer during its stay in the archive. They propose to verify this via a separate service recording the hashes of Mementos obtained by the archive at the time of ingest, perhaps even a blockchain of them.

There are a number of tricky issues here. First, it must be possible to request an archive to deliver an unmodified Memento, as opposed to the modified ones that the Wayback Machine (for example) delivers, with links re-written, the content framed, etc.  Then there are the problems associated with any system that relies on stored hashes for long-term integrity. Then there is the possibility that the evil-doer was doing evil during the original ingestion, so that the hash stored in the separate service is of the replacement, not the content obtained from the Web site.

LITA: #NoFilter: Social Media Planning for the Library

planet code4lib - Tue, 2017-04-11 14:19

The #NoFilter series explores some of the challenges and concerns that accompany a library’s use of social media. In my January 2017 post, I discussed the importance of generating thoughtful and diverse social media content in order to attract users and stimulate discussion of the library’s materials and services.

Part and parcel of the content generation process is planning. Wouldn’t it be great if social media wasn’t something the library had to think about in depth? If all of the content for various platforms could just be created on the fly, a content generation process seamlessly integrated into every staff member’s workflow? It’s a beautiful idea and it does happen this way at times. For example, you are walking through your library and you come across some stunning afternoon light pouring through a window. You take out your phone, snap a picture, and share it on Instagram or another platform. Done!

Photo taken while shelving in the Othmer Library’s Reading Room, Philadelphia, PA

However, the reality is that there are time constraints on library staff. Social media is often just one more task heaped onto a staff member’s already full plate. Spontaneous social media content isn’t always possible. To ensure that social media is carried out in a meaningful way and on a regular basis, a balance must be struck between it and the other requirements of one’s position. Hence the need for planning not only the topics of posts, but also who is responsible for such posts.

In my library (the Othmer Library of Chemical History), social media planning takes the following form: our team of seven meets once a month  to discuss content for the coming month. This meeting generally takes 30 minutes, on rare occasions an hour. We come to the table with historical dates (for us, it’s mostly dates pertaining to the history of science field), famous birthdays, fun days such as Record Store Day, and holidays. We also discuss campaigns such as #ColorOurCollections as well as general themes like Women’s History Month (March) or National Cookie Month (October). We discuss what we have in our collections that relates to these days and themes. Team members then volunteer to create content for particular days. We keep track of all these elements (content ideas, post-meeting brainstorming about these ideas, and those responsible for creating posts) using Trello, an online project management tool. I will delve into all of the details of our Trello boards in a future post.

As a result, we are able to produce social media content consistently and in a way that isn’t taxing on staff.

Less stress through planning = Happy staff who are enthusiastic about contributing to the library’s social media efforts = Fun and varied content for users to engage with online.

What does social media planning look like in your library? Share your experience in the comments below!

Open Knowledge Foundation: Frictionless Data Case Study:

planet code4lib - Tue, 2017-04-11 11:18

Open Knowledge International is working on the Frictionless Data project to remove the friction in working with data. We are doing this by developing a set of tools, standards, and best practices for publishing data. The heart of Frictionless Data is the Data Package standard, a containerization format for any kind of data based on existing practices for publishing open-source software.

We’re curious to learn about some of the common issues users face when working with data. In our Case Study series, we are highlighting projects and organisations who are working with the Frictionless Data specifications and tooling in interesting and innovative ways. For this case study, we interviewed Bryon Jacob of More case studies can be found at

How do you use the Frictionless Data specs and what advantages did you find in using the Data Package approach?

We deal with a great diversity of data, both in terms of content and in terms of source format – most people working with data are emailing each other spreadsheets or CSVs, and not formally defining schema or semantics for what’s contained in these data files.

When ingests tabular data, we “virtualize” the tables away from their source format, and build layers of type and semantic information on top of the raw data. What this allows us to do is to produce a clean Tabular Data Package[^Package] for any dataset, whether the input is CSV files, Excel Spreadsheets, JSON data, SQLite Database files – any format that we know how to extract tabular information from – we can present it as cleaned-up CSV data with a datapackage.json that describes the schema and metadata of the contents.

What else would you like to see developed?

Graph data packages, or “Universal Data Packages” that can encapsulate both tabular and graph data. It would be great to be able to present tabular and graph data in the same package and develop tools that know how to use these things together.

To elaborate on this, it makes a lot of sense to normalize tabular data down to clean, well-formed CSVs.or data that more graph-like, it would also make sense to normalize it to a standard format. RDF is a well-established and standardized format, with many serialized forms that could be used interchangeably (RDF XML, Turtle, N-Triples, or JSON-LD, for example). The metadata in the datapackage.json would be extremely minimal, since the schema for RDF data is encoded into the data file itself. It might be helpful to use the datapackage.json descriptor to catalog the standard taxonomies and ontologies that were in use, for example it would be useful to know if a file contained SKOS vocabularies, or OWL classes.

What are the next things you are going to be working on yourself?

We want to continue to enrich the metadata we include in Tabular Data Packages exported from, and we’re looking into using datapackage.json as an import format as well as export.

How do the Frictionless Data specifications compare to existing proprietary and nonproprietary specifications for the kind of data you work with? works with lots of data across many domains – what’s great about the Frictionless Data specs is that it’s a lightweight content standard that can be a starting point for building domain-specific content standards – it really helps with the “first mile” of standardising data and making it interoperable.

What do you think are some other potential use cases?

In a certain sense, a Tabular Data Package is sort of like an open-source, cross-platform, accessible replacement for spreadsheets that can act as a “binder” for several related tables of data. I could easily imagine web or desktop-based tools that look and function much like a traditional spreadsheet, but use Data Packages as their serialization format.

Who else do you think we should speak to?

Data science IDE (Interactive Development Environment) producers – RStudio, Rodeo (python), anaconda, Jupyter – anything that operates on Data Frames as a fundamental object type should provide first-class tool and API support for Tabular Data Packages.

What should the reader do after reading this Case Study?

To read more about Data Package integration at, read our post: Try This: Frictionless Sign up, and starting playing with data.


Have a question or comment? Let us know in the forum topic for this case study.

DuraSpace News: Bethany Seeger–Connecting Around Fedora Migration and Mapping

planet code4lib - Tue, 2017-04-11 00:00

The Fedora repository project relies on many individuals and institutions to make the project successful. We are grateful for their commitment and will showcase their contributions in a series of community profiles aimed at recognizing our contributors’ achievements, and introducing them to the rest of the community.

DuraSpace News: JOIN Fedora at ELAG2017 Athens

planet code4lib - Tue, 2017-04-11 00:00

Austin, TX  The 41th European Library Automation Group (ELAG) Systems Seminar will be held at the National Technical University of Athens, in Athens, Greece from June 6 to 9 2017. If you will be traveling to ELAG2017 ( please join us at the Fedora Bootcamp, “Automating Digital Curation Workflows with Fedora,” on June 6.

John Miedema: I Tried to Walk Away from Lila but Good Ideas are Persistent

planet code4lib - Mon, 2017-04-10 23:57

Remember Lila? Did you think I had abandoned her? If you did not follow my earlier blog you might be a little confused. Lila is not a live person. Lila was a conceptual design for a “cognitive writing technology,” natural language processing software to aid with reading and writing. It was a complex and consuming project. I tried to walk away from Lila but good ideas are persistent. Below you see a screenshot of a more basic project, a tool for analyzing individual After Reading essays and comparing them to the whole work.

The user interface is comparable to Voyant Tools by Stéfan Sinclair & Geoffrey Rockwell. Lila 0.1 has unique functions:

  1. On a Home screen a user gets to enter an essay. Lila 0.1 is intended to accept the text of individual essays created by me for After Reading. An Analyze button begins the natural language processing that results in the screen above. The text is displayed, highlighting one paragraph at a time as the user scrolls down.
  2. The button set provides four functions. The Home button is for navigation back to the Home screen. The Save button allows the user to save an essay with analytics to a database to build an essay set or corpus. The Documents button navigates to a screen for managing the database. The Settings button navigates to a screen that can adjust configurations for the analytics.
  3. The graph shows the output of natural language processing and analytics for a “Feeling” metric, an aggregate measure based on sentiment, emotion and perhaps other measures. The light blue shows the variance in Feeling across paragraphs. The dark blue straight line shows the aggregate value for the document. The user can see how Feeling varies across paragraphs and in comparison to the whole essay. Another view will allow for comparison of single essays to the corpus.
  4. The user can choose one of several available metrics to be displayed on the graph.
    • Count. The straight count of words.
    • Frequency. The frequency of words.
    • Concreteness. The imagery and memorability of words. A personal favourite.
    • Complexity. Ambiguity or polysemy, i.e., words with multiple meanings. Synonymy or antonmy. A measure of the readability of the text. Complexity can also be measured for sentences, e.g., number of conjunctions, and for paragraphs, e.g, number of sentences.
    • Hyponymy. A measure of the abstraction of words.
    • Metaphor. I am evaluating algorithms that identify metaphors.
    • Form. Various measures are available to measure text quality, e.g., repetition.
    • Readability by grade level.
    • Thematic presence can be measured by dictionary tagging of selected words related to the work’s theme.
  5. All metrics are associated with individuals words. Numeric values will be listed for a subset of the words.
  6. Topic Cloud. A representation of topics in an essay will be shown.

The intention is to help a writer evaluate the literary quality of an essay and compare it to the corpus. A little bit like spell-check and grammar-check, but packed with literary smarts. Where it is helpful to be conscious of conformity and variance, e.g., author voice, Lila can help. It is a modest step in the direction of an artificial intelligence project that will emerge in time. Perhaps one day Lila will live.

District Dispatch: Copyright First Responders webinar now available

planet code4lib - Mon, 2017-04-10 18:35

Celebrating National Library Week by introducing our new fair use coasters! Each one describes one of the factors of fair use. (We think copyright education should be fun.) Collect all four at ALA’s Annual conference this summer.

If you missed last week’s CopyTalk “Copyright First Responders” webinar, it’s alright – we have an archived copy!

Kyle Courtney of Harvard University’s Office for Scholarly Communication talked about the development of a decentralized model of copyright expertise in an academic setting — the Copyright First Responders (CFR) program. We know that copyright guidance is needed more now than ever before, and it is impossible for one lone copyright specialist or scholarly communications librarian to reach every academic department. The CFR program starts with a subject specialist and then adds on copyright expertise through a rigorous training model developed by Kyle. After taking the course, the subject specialist is ready to address the more basic queries of their department faculty. The more difficult questions are forwarded on to the more experienced level of CRPs and if necessary, then on to Kyle himself.

Hey, why shouldn’t every librarian have a bit of merriment with copyright! Listen to Kyle’s engaging talk about CFR. It may take off soon across the United States. One important lesson: make it fun!

The post Copyright First Responders webinar now available appeared first on District Dispatch.

District Dispatch: Congress is in recess, make it count

planet code4lib - Mon, 2017-04-10 16:57

National Library Week is the perfect time to make sure that your congressional representative in the House and both U.S. senators know you want them to fight for full federal library funding for fiscal year 2018. They are now home for two full weeks for their spring recess, so you have ample opportunity to make that point loudly, clearly and in as a many places as you can.

2017 Congressional Calendar (Source: The Hill)

Right now is prime time to Fight for Libraries! and against the President’s proposal to eliminate IMLS and efforts in Congress to slash virtually all federal library funding.

First, don’t worry about intruding on your representative’s and senators’ schedule. Congress may be in “recess,” but these breaks from Washington are officially known as “district work periods,” so their days (and nights) are filled with meetings with constituents like you, as well as visits to schools, companies and – yes – potentially libraries back home.

Second, get on their schedules. Call their office nearest you (here’s a handy directory) and ask to meet with your member of Congress and Senators (or their senior staff) during the work period so you – and perhaps three or four other library supporters or patrons (for example, a business owner, social worker, clergy person, soccer mom or dad or any other fans of libraries) – can ask them to oppose eliminating IMLS and support full funding for library programs like LSTA and Innovative Approaches to Literacy in FY 2018. You can find all the background info you need and talking points at Fight for Libraries!

Third, make some noise. Odds are your members of Congress will be hosting at least one Town Hall meeting during the recess. Go! Tell them: 1) how important federal funding is to your local library (an example of how LSTA money gets used would be ideal, but not essential); and 2) that you want them to oppose eliminating IMLS and any cuts in the already very modest funding libraries receive from the federal government. (States get a total of just over $150 million annually under LSTA and IAL receives just $27 million, half of which is dedicated to school libraries.)

Fourth, and really importantly, if you run a library system or library branch contact your members’ local offices and invite your Representative and both Senators to visit your library where you can show them first-hand the incredible things that a 21st century library does for their constituents. Even if that means you can’t deliver any messages that specifically relate to legislation or library funding while you’re “on duty,” it will be enormously valuable to inform your representative’s and senators’ understanding of what a library is and does and how vital their local libraries are to hundreds of thousands of voters in their communities. Hosting a visit and giving a tour is not lobbying and isn’t barred by any laws anywhere.

Finally, whatever contacts you arrange with your members of Congress and their staffs, remember to email them afterwards with a reminder of what you asked for or discussed and, most importantly, to thank them for their time and support. Civility isn’t dead and will help ensure that your efforts pay off in the end.

That’s all there is to it. Drop us a line at the ALA Office of Government Relations if you need any help or to let us know how your meeting or library visit went.

The post Congress is in recess, make it count appeared first on District Dispatch.

LITA: Help us improve LITA’s virtual teams!

planet code4lib - Mon, 2017-04-10 16:16

LITA’s Emerging Leaders team is embarking on a project to give you the tools you need to make the most of your committees and interest groups. But before we can do that we need your help!

We are working to review the online tools and best practices currently in use, and make recommendations which will serve to improve collaboration between Committee/Interest Group chairs and members. Please take a few minutes to complete our survey.

If you have any questions, be sure to indicate them in the survey, or contact LITA at

Thanks in advance!

Emerging Leaders Project Team D

  • Jessica Bennett, Missouri State University
  • Bri Furcron, State of Arizona Library
  • Catie Sahadath, University of Ottawa
  • Jennifer Shimada, Relay Graduate School of Education
  • Kyle Willis, OCLC

Jonathan Rochkind: One way to remove local merged tracking branches

planet code4lib - Mon, 2017-04-10 15:51

My git workflow involves creating a lot of git feature branches, as remote tracking branches on origin. They eventually get merged and deleted (via github PR), but i still have dozens of them lying around.

Via googling, getting StackOverflow answers, and sort of mushing some stuff I don’t totally understand together, here’s one way to deal with it, create an alias git-prune-tracking.  In your ~/.bash_profile:

alias git-prune-tracking='git branch --merged | grep -v "*" | grep -v "master" | xargs git branch -d; git remote prune origin'

And periodically run git-prune-tracking from a git project dir.

I do not completely understand what this is doing I must admit, and there might be a better way? But it seems to work. Anyone have a better way that they understand what it’s doing?  I’m kinda surprised this isn’t built into the git client somehow.

Filed under: General

David Rosenthal: Research Access for the 21st Century

planet code4lib - Mon, 2017-04-10 15:00
This is the second of my posts from CNI's Spring 2017 Membership Meeting. The first is Researcher Privacy.

Resource Access for the 21st Century, RA21 Update: Pilots Advance to Improve Authentication and Authorization for Content by Elsevier's Chris Shillum and Ann Gabriel reported on the effort by the oligopoly publishers to replace IP address authorization with Shibboleth. Below the fold, some commentary.

RA21 is presented as primarily a way to improve the user experience, and secondarily as a way of making life simpler for the customers (libraries). But in reality it is an attempt to cut off the supply of content to Sci-Hub. As such, it got a fairly rough reception, for three main reasons:
  • In an open access world, there's no need for authorization. Thus this is yet more of the publishers' efforts to co-opt librarians into being "personal shoppers moonlighting as border guards" as Barbara Fister puts it. As someone who has been involved in implementing Shibboleth and connecting to institution's identity infrastructure I can testify that the switch to Shibboleth might in the long run make librarians lives easier but between now and the long run there stands a whole lot of work. Since it is intended to protect their bottom lines, the publishers should pay for this work. But instead they are apparently seeking grant funding for their pilot program, which is pretty cheeky. Maintaining their bottom line is not exactly in the public, or the funding agencies, interest.
  • The analysis of the user experience problem on which the PR for this effort is based is flawed, because it is publisher-centric. Sure, Shibboleth could potentially reduce the burden on the off-campus user of logging in to many different publisher Web sites. But if that is the problem, there are much simpler solutions to hand that libraries, rather than publishers, can implement. Simply proxy everything, as Sam Kome (see here) reported the Claremont Colleges do successfully, or use VPNs (which would have the additional benefit of making off-campus users much safer). But, as studies of the use of Sci-Hub show, the real problem is the existence of the many different publisher Web sites, not the need to log into them. What readers want is a single portal providing access to the entire academic literature, so they only have to learn one user interface. Yet another example of the power of increasing returns to scale in the Web world.
  • Even if in an ideal world the use of Shibboleth could completely prevent the use of compromised credentials to supply sites such as Sci-Hub, which in the real world it can't, doing so is in no-one's interest. The presence of copies on these sites is not a problem for readers, whether or not they use those copies. The presence of copies on those sites is in the librarian's interests, as they may exert downward pressure on publisher prices. If copies elsewhere were really a serious problem, ResearchGate's 100M copies, about half of which are apparently copyright violations, would be twice as big a threat as Sci-Hub. None of those copyright violations are the result of compromised credentials, so Shibboleth implementation wouldn't cut them off. Publishers seem content to live with ResearchGate.


Subscribe to code4lib aggregator