You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 11 hours 8 min ago

Access Conference: Opening Keynote – Dr. Kimberly Christen

Wed, 2017-06-21 15:41

We are excited to announce that Dr. Kimberly Christen will be the opening keynote speaker for Access 2017. Join us for her aptly titled talk “The Trouble with Access

Dr. Kimberly Christen is the Director of the Digital Technology and Culture Program and the co-Director of the Center for Digital Scholarship and Curation at Washington State University.

She is the founder of Mukurtu CMS an open source community archive platform designed to meet the needs of Indigenous communities, the co-Director of the Sustainable Heritage Network, a global community providing educational resources for stewarding digital heritage and co-Director for the Local Contexts initiative, an educational platform to support the management of intellectual property specifically using Traditional Knowledge Labels.

More of her work can be found at her website: www.kimchristen.com and you can follow her on Twitter @mukurtu.

Open Knowledge Foundation: Always Already Computational Reflections

Wed, 2017-06-21 14:07

Always Already Computational is a project bringing together a variety of different perspectives to develop “a strategic approach to developing, describing, providing access to, and encouraging reuse of collections that support computationally-driven research and teaching” in subject areas relating to library and museum collections.  This post is adapted from my Position Statement for the initial workshop.  You can find out more about the project at https://collectionsasdata.github.io.

Earlier this year, I spent two and a half days in beautiful University of California Santa Barbara at a workshop speaking with librarians, developers, and museum and library collection managers about data.  Attendees at this workshop represented a variety of respected cultural institutions including the New York Public Library, the British Library, the Internet Archive, and others.

Our task was to build a collective sense of what it means to treat library and museum “collections”—the (increasingly) digitized catalogs of their holdings—as data for analysis, art, research, and other forms of re-use.  We gathered use cases and user stories in order to start the conversation on how to best publish collections for these purposes.  Look for further outputs on the project website: https://collectionsasdata.github.io .  For the moment, here are my thoughts on the experience and how it relates to work at Open Knowledge International, specifically, Frictionless Data.

Always Already Computational

Open Access to (meta)Data

The event organizers—Thomas Padilla (University of California Santa Barbara), Laurie Allen (University of Pennsylvania), Stewart Varner (University of Pennsylvania), Sarah Potvin (Texas A&M University), Elizabeth Russey Roke (Emory University), Hannah Frost (Stanford University)—took an expansive view of who should attend.  I was honored and excited to join, but decidedly new to Digital Humanities (DH) and related fields.  The event served as an excellent introduction, and I now understand DH to be a set of approaches toward interrogating recorded history and culture with the power of our current tools for data analysis, visualization, and machine learning.  As part of the Frictionless Data project at Open Knowledge International, we are building apps, libraries, and specifications that support the basic transport and description of datasets to aid in this kind of data-driven discovery.  We are trialling this approach across a variety of fields, and are interested to determine the extent to which it can improve research using library and museum collection data.

What is library and museum collection data?  Libraries and museums hold physical objects which are often (although not always) shared for public view on the stacks or during exhibits.  Access to information (metadata) about these objects—and the sort of cultural and historical research dependent on such access—has naturally been somewhat technologically, geographically, and temporally restricted.  Digitizing the detailed catalogues of the objects libraries and museums hold surely lowered the overhead of day-to-day administration of these objects, but also provided a secondary public benefit: sharing this same metadata on the web with a permissive license allows a greater variety of users in the public—researchers, students of history, and others—to freely interrogate our cultural heritage in a manner they choose.  

There are many different ways to share data on the web, of course, but they are not all equal.  A low impact, open, standards-based set of approaches to sharing collections data that incorporates a diversity of potential use cases is necessary.  To answer this need, many museums are currently publishing their collection data online, with permissive licensing, through GitHub: The Tate Galleries in the UK, Cooper Hewitt, Smithsonian Design Museum and The Metropolitan Museum of Art in New York have all released their collection data in CSV (and JSON) format on this popular platform normally used for sharing code.  See A Nerd’s Guide To The 2,229 Paintings At MoMA and An Excavation Of One Of The World’s Greatest Art Collections both published by FiveThirtyEight for examples of the kind of exploratory research enabled by sharing museum collection data in bulk, in a straightforward, user-friendly way.  What exactly did they do, and what else may be needed?

Packages of Museum Data

Our current funding from the Sloan Foundation enables us to focus on this researcher use case for consuming data.  Across fields, the research process is often messy, and researchers, even if they are asking the right questions, possess a varying level of skill in working with datasets to answer them.  As I wrote in my position statement:

Such data, released on the Internet under open licenses, can provide an opportunity for researchers to create a new lens onto our cultural and artistic history by sparking imaginative re-use and analysis.  For organizations like museums and libraries that serve the public interest, it is important that data are provided in ways that enable the maximum number of users to easily process it.  Unfortunately, there are not always clear standards for publishing such data, and the diversity of publishing options can cause unnecessary overhead when researchers are not trained in data access/cleaning techniques.

My experience at this event, and some research beforehand, suggested that there is a spectrum of data release approaches ranging from a basic data “dump” as conducted by the museums referenced above to more advanced, though higher investment, approaches such as publishing data as an online service with a public “API” (Application Programming Interface).  A public API can provide a consistent interface to collection metadata, as well as an ability to request only the needed records, but comes at the cost of having the nature of the analysis somewhat preordained by its design.  In contrast, in the data dump approach, an entire dataset, or a coherent chunk of it, can be easier for some users to access and load directly into a tool like R (see this UK Government Digital Service post on the challenges of each approach) without needing advanced programming.  As a format for this bulk download, CSV is the best choice as the MoMa reflected when releasing their collection data online:

CSV is not just the easiest way to start but probably the most accessible format for a broad audience of researchers, artists, and designers.  

This, of course, comes at the cost of not having a less consistent interface for the data, especially in the case of the notoriously underspecified CSV format.  The README file will typically go into some narrative detail about how to best use the dataset, some expected “gotchas” (e.g. “this UTF-8 file may not work well with Excel on a Mac”).  It might also list the columns in a tabular data file stored in the dataset, expected types and formats for values in each column (e.g. the date_acquired column should, hopefully, contain dates in a one or another international format).  This information is critical for actually using the data, and the automated export process that generates the public collection dataset from the museum’s internal database may try to ensure that the data matches expectations, but bugs exist, and small errors may go unnoticed in the process.

The Data Package descriptor (described in detail on our specifications site), used in conjunction with Data Package-aware tooling, is meant to somewhat restore the consistent interface provided by an API by embedding this “schema” information with the data.  This allows the user or the publisher to check that the data conforms to expectations without requiring modification of the data itself: a “packaged” CSV can still be loaded into Excel as-is (though without the benefit of type checking enabled by the Data Package descriptor).  The Carnegie Museum of Art, in its release of its collection data, follows the examples set by the Tate, the Met, the Moma, and Cooper-Hewitt as described above, but opted to also include a Data Package descriptor file to help facilitate online validation of the dataset through tools such as Good Tables.  As tools come online for editing, validating, and transforming Data Packages, users of this dataset should be able to benefit from those, too: http://frictionlessdata.io/tools/.

We are a partner in the Always Already Computational: Collections as Data project, and as part of this work, we are working with Carnegie Museum of Art to provide a more detailed look at the process that went into the creation of the CMOA dataset, as well as sketching a potential ways in which the Data Package might help enable re-use of this data.  In the meantime, check out our other case studies on the use of Data Package in fields as diverse as ecology, cell migration, and energy data:

http://frictionlessdata.io/case-studies/

Also, pay your local museum or library a visit.

Library Tech Talk (U of Michigan): Software contributions reduce our debt

Wed, 2017-06-21 00:00

Contributing to software projects can be harder and more time consuming than coding customized solutions. But over the long term, writing generalized solutions that can be used and contributed to by developers from around the world reduces our dependence on ourselves and our organizational resources, thus drastically reducing our technical debt.

DuraSpace News: WATCH New Hyku, Hyrax, and the Hydra-in-a-Box Project Demo

Wed, 2017-06-21 00:00

From Michael J. Giarlo, Technical Manager, Hydra-in-a-Box Project, Software Architect, on behalf of the Hyku tech team

Stanford, CA  Here's the latest demo of advances made on Hyku, Hyrax, and the Hydra-in-a-Box project.

DuraSpace News: Visit HykuDirect.org

Wed, 2017-06-21 00:00

DuraSpace is pleased to announce that the new HykuDirect web site is up and running, and ready to field inquiries about the exciting new hosted service currently in development: http://hykudirect.org.

• The site features Hyku background information, a complete list of key features, a timeline that lays out the steps towards availability of a full-production service, and a contact form.

LITA: Timothy Cole Wins 2017 LITA/OCLC Kilgour Research Award

Tue, 2017-06-20 22:20

Timothy Cole, Head of the Mathematics Library and Professor of Library and Information Science at the University of Illinois Urbana-Champaign, has been selected as the recipient of the 2017 Frederick G. Kilgour Award for Research in Library and Information Technology, sponsored by OCLC and the Library and Information Technology Association (LITA). Professor Cole also holds appointments in the Center for Informatics Research in Science and Scholarship (CIRSS) and the University Library.

The Kilgour Award is given for research relevant to the development of information technologies, especially work which shows promise of having a positive and substantive impact on any aspect(s) of the publication, storage, retrieval and dissemination of information, or the processes by which information and data is manipulated and managed. The winner receives $2,000, a citation, and travel expenses to attend the LITA Awards Ceremony & President’s Program at the 2017 ALA Annual Conference in Chicago (IL).

Over the past 20 years, Professor Cole’s research in digital libraries, metadata design and sharing, and interoperable linked data frameworks have significantly enhanced discovery and access of scholarly content which embodies the spirit of this prestigious Award. His extensive publication record includes research papers, books, and conference publications and has earned more than $11 million in research grants during his career.

The Award Committee also noted Professor Cole’s significant contributions to major professional organizations including the World Wide Web Consortium (W3C), Digital Library Federation, and Open Archives Initiative, all of which help set the standards in metadata and linked data practices that influence everyday processes in libraries. We believe his continuing work on Linked Open Data will further improve how information is discovered and accessed. With all of Professor Cole’s research and service contributions, the Committee unanimously found him to be the ideal candidate to receive the 2017 Frederick G. Kilgour Award.

When notified he had been selected, Professor Cole said, “I am honored and very pleased to accept this Award. Fred Kilgour’s recognition more than 50 years ago of the ways that computers and computer networks could improve both library services and workflow efficiencies was remarkably prescient, and his longevity and consistent success in this dynamic field was truly amazing. Many talented librarians have built on his legacy, and over the course of my career, I have found the opportunity to meet, learn from, and work with many of these individuals, including several prior Kilgour awardees, truly rewarding. I have been especially fortunate in my opportunities and colleagues at Illinois — notably (to name but three) Bill Mischo, Myung-Ja Han, and Muriel Foulonneau — as well as in my collaborations with other colleagues across the globe. It is these collaborations that account in large measure for the modest successes I have enjoyed. I am humbled by and most appreciative of the Award Committee for giving me this opportunity to join the ranks of Kilgour awardees.”

Members of the 2017 Kilgour Award Committee are: Tabatha Farney (Chair), Ellen Bahr, Matthew Carruthers, Zebulin Evelhoch, Bohyun Kim, Colby Riggs, and Roy Tennant (OCLC Liaison).

Thank you to OCLC for sponsoring this award.

Library of Congress: The Signal: Hack-to-Learn at the Library of Congress

Tue, 2017-06-20 20:52

When hosting workshops, such as Software Carpentry, or events, such as Collections As Data, our National Digital Initiatives team made a discovery—there is an appetite among librarians for hands-on computational experience. That’s why we created an inclusive hackathon, or a “hack-to-learn,” taking advantage of the skills librarians already have and paring them with programmers to mine digital collections.

Hack-to-Learn took place on May 16-17 in partnership with George Mason and George Washington University Libraries. Over the two days, 61 attendees used low or no-cost computational tools to explore four library collection as data sets. You can see the full schedule here.

Day two of the workshop took place at George Washington University Libraries. Here, George Oberle III, History Librarian at George Mason University, gives a Carto tutorial. Photo by Justin Littman, event organizer.

The Data Sets

The meat of this event was our ability to provide library collections as data to explore, and with concerted effort we were able to make a diverse set available and accessible.

In the spring, the Library of Congress released 25 million of its MARC records for free bulk download. Some have already been working with the data – Ben Schmidt was able to join us on day one to present his visual hacking history of MARC cataloging and Matt Miller made a list of 9 million unique titles. We thought these cataloging records would also be a great collection for hack-to-learn attendees because the format is well-structured and familiar for librarians.

The Eleanor Roosevelt Papers Project at George Washington University shared its “My Day” collection – Roosevelt’s daily syndicated newspaper column and the closest thing we have to her diary. George Washington University Libraries contributed their Tumblr End of Term Archive- text and metadata from  72 federal Tumblr blogs harvested as part of the End of Term Archive project.

Topic modelling in MALLET with the Eleanor Roosevelt “My Day” collection. MALLET generates a list of topics from a corpus and keywords composing those topics. An attendee suggested it would be a useful method for generating research topics for students (and we agree!).

As excitement for hack-to-learn grew, the Smithsonian joined the fun by providing their Phyllis Diller Gag file. Donated to the Smithsonian American History Museum, the gag file is a physical card catalog containing 52,000 typewritten joke cards the comedian organized by subject. The Smithsonian Transcription Center put these joke cards online, and they were transcribed by the public in just a few weeks. Our event was the first time these transcriptions were used.

Gephi network analysis visualization of the Phyllis Diller Gag file. The circles (or nodes) represent joke authors and their relationship to each other based on their joke subjects.

To encourage immediate access to the data and tools, we spent a significant amount of time readying these four data sets so ready-to-load versions were available. For the MARC records to be amenable for the mapping tool Carto, for example, Wendy Mann, Head of George Mason University Data Services, had to reduce the size of the set, then convert the 1,000 row files to csv using MarcEdit, map the MARC fields as column headings, create load files for MARC fields in each file, and then mass edit column names in OpenRefine so that each field name began with a character as opposed to a number (a Carto requirement).

We also wanted to be transparent about this work so attendees could re-create these workflows after hack-to-learn. We bundled the data sets in their multiple versions of readiness, README files, a list of resources, a list of brainstorming ideas of what possible questions to ask of the data, and install directions for the different tools all in a folder that was available for attendees a week before the event. We invited attendees to join a Slack channel to ask questions or report errors before and during the event, and opened day one with a series of lightning talks about the data sets from content and technical experts.

What Was Learned

Participants were largely librarians, faculty or students from our three partner organizations. 12 seats were opened to the public and quickly filled by librarians, faculty or students from universities or cultural heritage institutions. Based on our registration survey, the majority of participants trended towards little or no experience. Almost half reported experience with OpenRefine, while 44.8% reported having never used any of the tools before. 49.3% wanted to learn about “all” methodologies (data cleaning, text mining, network analysis, etc.), and 46.3% reported interest in specifically text mining.

31.3% of hack-to-learn registrants were curious about computational research and wanted and introduction, and 28.4% were familiar with some tools but not all. 14.9% thought it sounded fun!

Twenty-one attendees responded to our post-event survey. Participants confirmed that collections as data work felt less “intimidating” and the tools more “approachable.” Respondents reported a recognition of untapped potential in their data sets and requested more events of this kind.

“I was able to get results using all the tools, so in a sense everything worked well. Pretty sure my ‘success’ was related to the scale of task I set for myself; I viewed the work time as time for exploring the tools, rather than finishing something.”

Many appreciated the event’s diversity- the diversity of data sets and tools, the mixture of subject matter and technical experts, and the mix between instructional and problem-solving time.

“The tools and datasets were all well-selected and gave a good overview of how they can be used. It was the right mix of easy to difficult. Easy enough to give us confidence and challenging enough to push our skills.”

The Phyllis Diller team works with OpenRefine at Hack-to-Learn, May 17, 2017. Photo by Shawn Miller.

When asked what could be improved, many felt that identifying what task to do or question to ask of the data set was difficult, and attendees often underestimated the data preparation step. We received suggestions such as adding guided exercises with the tools before independent work and more time for digging deeper into a particular methodology or research question.

“It was at first overwhelming but ultimately hugely beneficial to have multiple tools and multiple data sets to choose from. All this complexity allowed me to think more broadly about how I might use the tools, and having data sets with different characteristics allowed for more experimentation.”

Most importantly, attendees identified what still needed to be learned. Insights from the event related to the limitations of the tools. For example, attendees recognized GUI interfaces were accessible and useful for surface-level investigation of a data set, but command-line knowledge was needed for deeper investigation or in some cases, working with a larger data set. Several participants in the post-event survey showed interest in learning Python as a result.

Recognizing what they didn’t know was not discouraging. In fact, one point we heard from multiple attendees was the desire for more hack-to-learn events.

“If someone were to host occasional half-day or drop-in hack-a-thons with these or other data sets, I would like to try again. I especially appreciate that you were welcoming of people like me without a lot of programming experience … Your explicit invitation to people with *all* levels of experience was the difference between me actually doing this and not doing it.”

We’d like to send a big thank you again to our partners at George Washington and George Mason University Libraries, and to the Smithsonian American History Museum and Smithsonian Transcription Center for you time and resources to make Hack-to-Learn a success! We encourage anyone reading this to consider doing one at your library, and if you do, let us know so we can share it on The Signal!

 

 

LITA: Learn about Contextual Inquiry, after ALA Annual

Tue, 2017-06-20 18:42

Sign up today for 

Contextual Inquiry: Using Ethnographic Research to Impact your Library UX

This new LITA web course begins, July 6, 2017, shortly after ALA Annual. Use the excitement generated by the conference to further explore new avenues to increase your user engagement. The contextual inquiry research methodology helps to better understand the intents and motivations behind user behavior. The approach involves in-depth, participant-led sessions where users take on the role of educator, teaching the researcher by walking them through tasks in the physical environment in which they typically perform them.

Instructors: Rachel Vacek, Head of Design & Discovery, University of Michigan Library; and Deirdre Costello, Director, UX Research, EBSCO Information Services
July 6 – August 10, 2017
Register here, courses are listed by date and you need to log in

In this session, learn what’s needed to conduct a Contextual Inquiry and how to analyze the ethnographic data once collected. We’ll talk about getting stakeholders on board, the IRB, Institutional Review Board, process and scalability for different sized library teams. We’ll cover how to synthesize and visualize your findings as sequence models and affinity diagrams that directly inform the development of personas and common task flows. Finally, learn how this process can help guide your design and content strategy efforts while constructing a rich picture of the user experience.

View details and Register here.

This is a blended format web course

The course will be delivered as separate live webinar lectures, one per week. You do not have to attend the live lectures in order to participate. The webinars will be recorded for later viewing.

Check the LITA Online Learning web page for additional upcoming LITA continuing education offerings.

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty, mbeatty@ala.org

Karen Coyle: Pray for Peace

Tue, 2017-06-20 17:56
This is a piece I wrote on March 22, 2003, two days after the beginning of the second Gulf war. I just found it in an old folder, and sadly have to say that things have gotten worse than I feared. I also note an unfortunate use of terms like "peasant" and "primitive" but I leave those as a recognition of my state of mind/information. Pray for peace.

Saturday, March 22, 2003
Gulf War II
The propaganda machine is in high gear, at war against the truth. The bombardments are constant and calculated. This has been planned carefully over time.
The propaganda box sits in every home showing footage that it claims is of a distant war. We citizens, of course, have no way to independently verify that, but then most citizens are quite happy to accept it at face value.
We see peaceful streets by day in a lovely, prosperous and modern city. The night shots show explosions happening at a safe distance. What is the magical spot from which all of this is being observed?
Later we see pictures of damaged buildings, but they are all empty, as are the streets. There are no people involved, and no blood. It is the USA vs. architecture, as if the city of Bagdad itself is our enemy.
The numbers of casualties, all of them ours, all of them military, are so small that each one has an individual name. We see photos of them in dress uniform. The families state that they are proud. For each one of these there is the story from home: the heavily made-up wife who just gave birth to twins and is trying to smile for the camera, the child who has graduated from school, the community that has rallied to help re-paint a home or repair a fence.
More people are dying on the highways across the USA each day than in this war, according to our news. Of course, even more are dying around the world of AIDS or lung cancer, and we aren't seeing their pictures or helping their families. At least not according to the television news.
The programming is designed like a curriculum with problems and solutions. As we begin bombing the networks show a segment in which experts explain the difference between the previous Gulf War's bombs and those used today. Although we were assured during the previous war that our bombs were all accurately hitting their targets,  word got out afterward that in fact the accuracy had been dismally low. Today's experts explain that the bombs being used today are far superior to those used previously, and that when we are told this time that they are hitting their targets it is true, because today's bombs really are accurate.
As we enter and capture the first impoverished, primitive village, a famous reporter is shown interviewing Iraqi women living in the USA who enthusiastically assure us that the Iraqi people will welcome the American liberators with open arms. The newspapers report Iraqis running into the streets shouting "Peace to all." No one suggests that the phrase might be a plea for mercy by an unarmed peasant facing a soldier wearing enough weaponry to raze the entire village in an eye blink.
Reporters riding with US troops are able to phone home over satellite connections and show us grainy pictures of heavily laden convoys in the Iraqi desert. Like the proverbial beasts of burden, the trucks are barely visible under their packages of goods, food and shelter. What they are bringing to the trade table is different from the silks and spices that once traveled these roads, but they are carrying luxury goods beyond the ken of many of Iraq's people: high tech sensor devices, protective clothing against all kinds of dangers, vital medical supplies and, perhaps even more important, enough food and water to feed an army. In a country that feeds itself only because of international aid -- aid that has been withdrawn as the US troops arrive -- the trucks are like self-contained units of American wealth motoring past.
I feel sullied watching any of this, or reading newspapers. It's an insult to be treated like a mindless human unit being prepared for the post-war political fall-out. I can't even think about the fact that many people in this country are believing every word of it. I can't let myself think that the propaganda war machine will win.
Pray for peace.

David Rosenthal: Analysis of Sci-Hub Downloads

Tue, 2017-06-20 17:36
Bastian Greshake has a post at the LSE's Impact of Social Sciences blog based on his F1000Research paper Looking into Pandora's Box. In them he reports on an analysis combining two datasets released by Alexandra Elbakyan:
  • A 2016 dataset of 28M downloads from Sci-Hub between September 2015 and February 2016.
  • A 2017 dataset of 62M DOIs to whose content Sci-Hub claims to be able to provide access.
Below the fold, some extracts and commentary.

Greshake's procedure was as follows:
  • Obtain the bibliographic metadata (publisher, journal, year) for each of the 62M Sci-Hub DOIs from CrossRef. About 76% of the 62M could be resolved.
  • Match the DOI of each of the 28M downloads with the corresponding metadata. Metadata for about 77% of the downloads could be obtained.
  • Count the number of downloads for each of the ~47M/62M DOIs that had metadata.
From the download count and the metadata Greshake was able to draw many interesting graphs, including:
Greshake makes some interesting observations. The distributions are heavily skewed towards articles from major publishers:
Looking at the data on a publisher level, there are ~1,700 different publishers, with ~1,000 having at least a single paper downloaded. Both corpus and downloaded publications are heavily skewed towards a set of few publishers, with the 9 most abundant publishers having published ~70% of the complete corpus and ~80% of all downloads respectively.And from a small number of major journals:
The complete released database contains ~177,000 journals, with ~60% of these having at least a single paper downloaded. ... <10% of the journals being responsible for >50% of the total content in Sci-Hub. The skew for the downloaded content is even more extreme, with <1% of all journals getting over 50% of all downloads.The download skew towards the major publishers is not simply caused by their greater representation in the corpus:
982 publishers differed significantly from the expected download numbers, with 201 publishers having more downloads than expected and 781 being underrepresented. Interestingly, while some big publishers like Elsevier and Springer Nature come in amongst the overly downloaded publishers, many of the large publishers, like Wiley-Blackwell and the Institute of Electrical and Electronics Engineers (IEEE) are being downloaded less than expected given their portfolio. There may be significant use from industry as opposed to academia:
12 of the 20 most downloaded journals can broadly be classified as being within the subject area of chemistry. This is an effect that has also been seen in a prior study looking at the downloads done from Sci-Hub in the United States. In addition, publishers with a focus on chemistry and engineering are also amongst the most highly accessed and overrepresented. ... it's noteworthy that both disciplines have a traditionally high number of graduates who go into industry. But Greshake's implication seems to be contradicted by the data. Among the least downloaded publishers are ACM, SPIE, AMA, IOP, APS, BMJ, IEEE and Wiley-Blackwell, all of whom cover fields that are oriented more toward practitioners than academics. The disparity may be due more to the practitioner-oriented fields being less dominated by publishers hostile to open access. For example, computing (ACM, IEEE) and physics (IOP, APS) are heavy users of arxiv.org, eliminating the need for Sci-Hub. Whereas chemistry is dominated by ACS and RSC, both notably unenthusiastic about open access, and the third and fourth most downloaded publishers.

The corpus and even more the downloads are heavily skewed towards recent articles:
over 95% of the publications listed in Sci-Hub were published after 1950 ... Over 95% of all downloads fall into publications done after 1982, with ~35% of the downloaded publications being less than 2 years old at the time they are being accessedThere is a:
bleak picture when it came to the diversity of actors in the academic publishing space, with around 1/3 of all articles downloaded being published through Elsevier. The analysis presented here puts this into perspective with the whole space of academic publishing available through Sci-Hub, in which Elsevier is also the dominant force with ~24% of the whole corpus. The general picture of a few publishers dominating the market, with around 50% of all publications being published through only 3 companies, is even more pronounced at the usage level compared to the complete corpus, ... Only 11% of all publishers, amongst them already dominating companies, are downloaded more often than expected, while publications of 45% of all publishers are significantly less downloaded. Greshake concludes:
the Sci-Hub data shows that the academic publishing field is even more of an oligopoly in terms of actual usage when compared to the amount of literature published.This is not unexpected; the major publishers' dominance is based on bundling large numbers of low-quality (and thus low-usage) journals with a small number of must-have (and thus high-usage) journals in "big deals".

Brown University Library Digital Technologies Projects: Testing HTTP calls in Python

Tue, 2017-06-20 16:32

Many applications make calls to external services, or other services that are part of the application. Testing those HTTP calls can be challenging, but there are some different options available in Python.

Mocking

One option for testing your HTTP calls is to mock out your function that makes the HTTP call. This way, your function doesn’t make the HTTP call, since it’s replaced by a mock function that just returns whatever you want it to.

Here’s an example of mocking out your HTTP call:

import requests class SomeClass: def __init__(self): self.data = self._fetch_data() def _fetch_data(self): r = requests.get('https://repository.library.brown.edu/api/collections/') return r.json() def get_collection_ids(self): return [c['id'] for c in self.data['collections']] from unittest.mock import patch MOCK_DATA = {'collections': [{'id': 1}, {'id': 2}]} with patch.object(SomeClass, '_fetch_data', return_value=MOCK_DATA) as mock_method: thing = SomeClass() assert thing.get_collection_ids() == [1, 2]

Another mocking option is the responses package. Responses mocks out the requests library specifically, so if you’re using requests, you can tell the responses package what you want each requests call to return.

Here’s an example using the responses package (SomeClass is defined the same way as in the first example):

import responses import json MOCK_JSON_DATA = json.dumps({'collections': [{'id': 1}, {'id': 2}]}) @responses.activate def test_some_class(): responses.add(responses.GET, 'https://repository.library.brown.edu/api/collections/', body=MOCK_JSON_DATA, status=200, content_type='application/json' ) thing = SomeClass() assert thing.get_collection_ids() == [1, 2] test_some_class() Record & Replay Data

A different type of solution is to use a package to record the responses from your HTTP calls, and then replay those responses automatically for you.

  • VCR.py – VCR.py is a Python version of the Ruby VCR library, and it supports various HTTP clients, including requests.

Here’s a VCR.py example, again using SomeClass from the first example:

import vcr IDS = [674, 278, 280, 282, 719, 300, 715, 659, 468, 720, 716, 687, 286, 288, 290, 296, 298, 671, 733, 672, 334, 328, 622, 318, 330, 332, 625, 740, 626, 336, 340, 338, 725, 724, 342, 549, 284, 457, 344, 346, 370, 350, 656, 352, 354, 356, 358, 406, 663, 710, 624, 362, 721, 700, 661, 364, 660, 718, 744, 702, 688, 366, 667] with vcr.use_cassette('vcr_cassettes/cassette.yaml'): thing = SomeClass() fetched_ids = thing.get_collection_ids() assert sorted(fetched_ids) == sorted(IDS)
  • betamax – From the documentation: “Betamax is a VCR imitation for requests.” Note that it is more limited than VCR.py, since it only works for the requests package.

Here’s a betamax example (note: I modified the code in order to test it – maybe there’s a way to test the code with betamax without modifying it?):

import requests class SomeClass: def __init__(self, session=None): self.data = self._fetch_data(session) def _fetch_data(self, session=None): if session: r = session.get('https://repository.library.brown.edu/api/collections/') else: r = requests.get('https://repository.library.brown.edu/api/collections/') return r.json() def get_collection_ids(self): return [c['id'] for c in self.data['collections']] import betamax CASSETTE_LIBRARY_DIR = 'betamax_cassettes' IDS = [674, 278, 280, 282, 719, 300, 715, 659, 468, 720, 716, 687, 286, 288, 290, 296, 298, 671, 733, 672, 334, 328, 622, 318, 330, 332, 625, 740, 626, 336, 340, 338, 725, 724, 342, 549, 284, 457, 344, 346, 370, 350, 656, 352, 354, 356, 358, 406, 663, 710, 624, 362, 721, 700, 661, 364, 660, 718, 744, 702, 688, 366, 667] session = requests.Session() recorder = betamax.Betamax( session, cassette_library_dir=CASSETTE_LIBRARY_DIR ) with recorder.use_cassette('our-first-recorded-session', record='none'): thing = SomeClass(session) fetched_ids = thing.get_collection_ids() assert sorted(fetched_ids) == sorted(IDS) Integration Test

Note that with all the solutions I listed above, it’s probably safest to cover the HTTP calls with an integration test that interacts with the real service, in addition to whatever you do in your unit tests.

Another possible solution is to test as much as possible with unit tests without testing the HTTP call, and then just rely on the integration test(s) to test the HTTP call. If you’ve constructed your application so that the HTTP call is only a small, isolated part of the code, this may be a reasonable option.

Here’s an example where the class fetches the data if needed, but the data can easily be put into the class for testing the rest of the functionality (without any mocking or external packages):

import requests class SomeClass: def __init__(self): self._data = None @property def data(self): if not self._data: r = requests.get('https://repository.library.brown.edu/api/collections/') self._data = r.json() return self._data def get_collection_ids(self): return [c['id'] for c in self.data['collections']] import json MOCK_DATA = {'collections': [{'id': 1}, {'id': 2}]} def test_some_class(): thing = SomeClass() thing._data = MOCK_DATA assert thing.get_collection_ids() == [1, 2] test_some_class()

Islandora: Islandora at Open Repositories 2017

Tue, 2017-06-20 14:50

Next week the Open Repositories conference in Brisbane, Australia will begin. Islandora will be there, with an exhibit table where you can stop by to chat and pick up the latest and greatest in laptop stickers, an electronic poster in the Cube, and several sessions that may be of interest to our community:

For the full schedule and more information about the conference, check out the Open Repositories 2017 website.

District Dispatch: Librarian speaks with Rep. Eshoo at net neutrality roundtable

Tue, 2017-06-20 11:27

“There’s nothing broken about existing net neutrality rules that needs to be fixed,” opined Congresswoman Anna Eshoo (D-CA-18) at a roundtable she convened in her district to discuss the impacts of the policy and the consequences of gutting it.

Director of the Redwood City Public Library Derek Wolfgram joined Chris Riley, director of Public Policy at Mozilla; Gigi Sohn, former counselor to FCC Chairman Tom Wheeler; Andrew Scheuermann, CEO and co-founder of Arch Systems; Evan Engstrom, executive director of Engine; Vlad Pavlov, CEO and co-founder of rollApp; Nicola Boyd, co-founder of VersaMe; and Vishy Venugopalan, vice president of Citi Ventures in the discussion.

Mozilla hosted Congresswoman Anna Eshoo for a discussion about net neutrality.

On May 18, FCC Chairman Ajit Pai began the process of overturning critical net neutrality rules—which ensure internet service providers must treat all internet traffic the same—a move which the assembled panelists agreed would hurt businesses and consumers. The Congresswoman singled out anchor institutions—libraries and schools in particular—as important voices in the current discussion because libraries are “there for everyone.”

Having seen the impacts of the digital divide in my own community, I felt that it was very important to highlight the value of net neutrality in breaking down barriers, rather than creating new ones, for families and small businesses to connect with educational resources, employment access and opportunities for innovation.

“I was honored to have the opportunity to contribute a library perspective to Congresswoman Eshoo’s roundtable discussion on net neutrality,” said Wolfgram. “The Congresswoman clearly understands the value of libraries as ‘anchor institutions’ in this country’s educational infrastructure and recognizes the potential consequences of the erosion of equitable access to information if net neutrality were to be eliminated. Having seen the impacts of the digital divide in my own community, I felt that it was very important to highlight the value of net neutrality in breaking down barriers, rather than creating new ones, for families and small businesses to connect with educational resources, employment access and opportunities for innovation.”

In his comments to the roundtable, Wolfgram identified two reasons strong, enforceable net neutrality rules are core to libraries’ public missions: preserving intellectual freedom and promoting equitable access to information. The Redwood City Library connects patrons to all manner of content served by the internet and many of these content providers, he fears, would not have the financial resources to compete against corporate content providers. Without net neutrality, high-quality educational resources could be relegated to second tier status.

Like so many libraries across the country, the Redwood City Library provides low-cost access to the internet for members of the community who otherwise couldn’t connect. Students, even in the heart of Silicon Valley, depend on library-provided WiFi sitting in cars outside the library to get their work done, said Wolfgram. Redwood City Library has recently started loaning internet hot spots, focusing on school-age children and families in an effort to bridge this gap.

“I would hate to see this big step forward, then the students get second-class access or don’t have a full connection to the resources they need,” said Wolfgram. “The internet should contribute to the empowerment of all.”

Congresswoman Eshoo agreed, calling the current net neutrality rules, “a celebration of the First Amendment.”

Former FCC official Sohn indicated the stakes are even higher. At issue, she said, is whether the FCC will have any role in overseeing the dominant communications network of our lifetimes. The FCC’s current proposal puts at risk subsidies for providing broadband to rural residents and people with low incomes through the Lifeline program. It is, as one panelist commented, like “replacing the real rules with no rules.”

The panel concluded with a call to action and a reminder of how public comment matters: the FCC has to follow a rulemaking process and future legal challenges will depend on the robust record developed now. “It’s essential to build a record to win,” Sohn said.

And she’s right. On June 9, we published guidance on how you can comment at the FCC on why net neutrality matters to your library. You can also blog, tweet, post and talk to your community about the importance of net neutrality and show the overwhelming support for this shared public platform.

Rep. Eshoo’s office reached out to ALA to identify a librarian to participate in her roundtable. ALA, on behalf of the library community, deeply appreciates the invitation and her continuing support of libraries and the public interest.

The post Librarian speaks with Rep. Eshoo at net neutrality roundtable appeared first on District Dispatch.

HangingTogether: Securing the Collective Print Books Collection: Progress to Date

Tue, 2017-06-20 01:04

In 2012, Sustainable Collection Services (SCS) and the Michigan Shared Print Initiative (MI-SPI) undertook one of the first shared print monographs projects in the US. Seven libraries came together under the auspices of Midwest Collaborative for Library Services (MCLS) to identify and retain 736,000 monograph holdings for an initial period of 15 years. This work laid the cornerstone of a secure North American (and ultimately international) collective print book collection.

Since then, ten other groups have quietly continued this important work, with the help of SCS (part of OCLC since 2015) and the GreenGlass Model Builder. The results speak for themselves:

  • 11 Shared Print Programs (some with multiple projects)
  • 143 Institutions participating (almost all below the research level)
  • 7.6 million distinct editions identified for long-term retention
  • 19.7 million title-holdings now under long term retention commitment

Models and retention criteria vary according to local and regional priorities, but most of the committed titles are secured under formal Memoranda of Understanding (MOU) for 15 years, often with review every five years. In some respects, these are grass-roots activities, designed to address local needs, but it seems clear that these programs can contribute significantly to a federated national or international solution, such as that envisioned by the MLA’s Working Group on The Future of the Print Record.

Organizations at the forefront of shared print monographs retention to date include:

In addition, the HathiTrust Shared Print Program has made excellent progress, with 50 libraries proposing 16 million monograph volumes for 25-year retention. That work continues, and will ultimately secure multiple holdings of all 7.8 million distinct monograph titles in the HathiTrust digital archive. OCLC/SCS has additional group projects underway in Maryland and Nova Scotia, and both EAST and SCELC are about to bring additional libraries into their shared print programs. As shown in the maps below, construction of the secure collective monographs collection is well underway.

US Print Monograph Retentions by State (June 2017) OCLC Sustainable Collection Services

 

Canadian Print Monograph Retentions by Province (June 2017) OCLC SCS

In subsequent posts, I’ll examine patterns of overlap and geographic distribution of retention commitments, as well as registration of those commitments in WorldCat. I’ll also share some thoughts about managing the collective collection holistically. For now, congratulations and thanks to the many librarians and consortial staff whose hard work has brought the community so far so quickly.

[Special thanks to my SCS colleague Andy Breeding, who compiled the data and maps.]

DuraSpace News: New Logo for DSpace!

Tue, 2017-06-20 00:00

Introducing the new DSpace logo.

DuraSpace News: Pssst: Save $100-Register for the VIVO Conference by June 30

Tue, 2017-06-20 00:00

From the organizers of the 2017 VIVO Conference

Prices for attending the 2017 VIVO Conference, the VIVO event of the year, are going up! Take advantage of the advance registration price of $275 until June 30. The cost for Conference registration increases to $375 after June 30.

Richard Wallis: A Discovery Opportunity for Archives?

Mon, 2017-06-19 17:16

A theme of my last few years has been enabling the increased discoverability of Cultural Heritage resources by making the metadata about them more open and consumable.

Much of this work has been at the libraries end of the sector but I have always have had an eye on the broad Libraries, Archives, and Museums world, not forgetting Galleries of course.

Two years ago at the LODLAM Summit 2015 I ran a session to explore if it would be possible to duplicate in some way the efforts of the Schema Bib Extend W3C Community Group which proposed and introduced an extension and enhancements to the Schema.org vocabulary to improve its capability for describing bibliographic resources, but this time for archives physical, digital and web.

Interest was sufficient for me to setup and chair a new W3C Community Group, Schema Architypes. The main activity of the group has been the creation and discussion around a straw-man proposal for adding new types to the Schema.org vocabulary.

Not least the discussion has been focused on how the concepts from the world of archives (collections, fonds, etc.) can be represented by taking advantage of the many modelling patterns and terms that are already established in that generic vocabulary, and what few things would need to be added to expose archive metadata to aid discovery.

Coming up next week is LODLAM Summit 2017, where I have proposed a session to further the discussion on the proposal.

So why am I now suggesting that there maybe an opportunity for the discovery of archives and their resources?

In web terms for something to be discoverable it, or a description of it, needs to be visible on a web page somewhere. To take advantage of the current structured web data revolution, being driven by the search engines and their knowledge graphs they are building, those pages should contain structured metadata in the form of Schema.org markup.

Through initiatives such as ArchivesSpace and their application, and ArcLight it is clear that many in the world of archives have been focused on web based management, search, and delivery views of archives and the resources and references they hold and describe. As these are maturing it is clear that the need for visibility on the web is starting to be addressed.

So archives are now in a great place to grab the opportunity to take advantage of the benefits of Schema.org to aid discovery of their archives and what they contain. At least with these projects, they have the pages on which to embed that structured web data, once a consensus around the proposals from the Schema Architypes Group has formed.

I call out to those involved with the practical application of systems for the management, searching, and delivery of archives to at least take a look at the work of the group and possibly engage on a practical basis, exploring the potential and challenges for implementing Schema.org.

So if you want to understand more behind this opportunity, and how you might get involved, either join the W3C Group or contact me direct.

 

*Image acknowledgement to The Oberlin College Archives

Evergreen ILS: Evergreen 3.0 development update #10

Mon, 2017-06-19 16:45

Ducks and geese. Photo courtesy Andrea Neiman

Since the previous update, another 8 patches have been committed to the master branch.

As it happened, all of the patches in question are concerned with fixing various bugs with the web staff client. Evergreen 3.0 will round out the functionality available in the web staff client by adding support for serials and offline circulation, but a significant amount of the effort will also include dealing with bugs now that some libraries are starting to use the web staff client in limited production as of version 2.12.

Launchpad is, of course, used to keep track of the bugs, and I would like to highlight some of the tags used:

  • webstaffclient, which is a general catch-all tag for all bugs and feature requests related to the web staff client.
  • webstaffprodcirc, which is for bugs that significantly affect the use of the web staff client’s circulation module.
  • fixedinwebby, which is for bugs that are fixed in the web staff client but which have a XUL client side that will likely not be fixed. As a reminder, the plan is to deprecated the XUL staff client with the release of 3.0 and remove it entirely with the Fall 2018 release.
Duck trivia

Even ducks were not immune to the disco craze of the 70s.

Submissions

Updates on the progress to Evergreen 3.0 will be published every Friday until general release of 3.0.0. If you have material to contribute to the updates, please get them to Galen Charlton by Thursday morning.

DuraSpace News: Atmire acquires Longsight DSpace business

Mon, 2017-06-19 00:00

Longsight, Inc. and Atmire are pleased to announce the transition of Longsight’s DSpace business to Atmire.

DuraSpace News: VIVO Updates for June 18–VIVO 1.10 Beta Available, Conference, JIRA++

Mon, 2017-06-19 00:00

From Mike Conlon, VIVO Project Director

Pages