You are here

Feed aggregator

DuraSpace News: 4Science Update: New Cooperation, Releases, Features, Webinars and Presentations!

planet code4lib - Wed, 2017-06-28 00:00

From Michele Mennielli, International Business Developer, 4Science

Conal Tuohy: Analysis & Policy Online

planet code4lib - Tue, 2017-06-27 23:45

Notes for my Open Repositories 2017 conference presentation. I will edit this post later to flesh it out into a proper blog post.
Follow along at:

  • Early discussion with Amanda Lawrence of APO (which at that time stood for “Australian Policy Online”) about text mining, at the 2015 LODLAM Summit in Sydney.
  • They needed automation to help with the cataloguing work, to improve discovery.
  • They needed to understand their own corpus better.
  • I suggested a particular technical approach based on previous work.
  • In 2016, APO contracted me to advise and help them build a system that would “mine” metadata from their corpus, and use Linked Data to model and explore it.
  • Openness
  • Integrate metadata from multiple text-mining processes, plus manually created metadata
  • Minimal dependency on their current platform (Drupal 7, now Drupal 8)
  • Lightweight; easy to make quick changes
technical approach
  • Use an entirely external metadata store (a SPARQL Graph Store)
  • Use a pipeline! Extract, Transform, Load
  • Use standard protocol to extract data (first OAI-PMH, later sitemaps)
  • In fact, use web services for everything; the pipeline is then just a simple script that passes data between web services
  • Sure, XSLT and SPARQL Query, but what the hell is XProc?!
  • Configured Apache Tika as a web service, using Stanford Named Entity Recognition toolkit
  • Built XProc pipeline to harvest from Drupal’s OAI-PMH module, download digital objects, process them with Stanford NER via Tika, and store the resulting graphs in Fuseki graph store
  • Harvested, and produced a graph of part of the corpus, but …
  • Turned out the Drupal OAI-PMH module wa broken! So we used Sitemap instead
  • “Related” list added to APO dev site (NB I’ve seen this isn’t working in all browsers, and obviously needs more work, perhaps using an iframe is not the best idea. Try Chrome if you don’t see the list of related pages on the right)
next steps
  • Visualize the graph
  • Integrate more of the manually created metadata into the RDF graph
  • Add topic modelling (using MALLET) alongside the NER
Let’s see the code Questions?

(if there’s any time remaining)

David Rosenthal: Wall Street Journal vs. Google

planet code4lib - Tue, 2017-06-27 15:00
After we worked together at Sun Microsystems, Chuck McManis worked at Google then built another search engine (Blekko). His contribution to the discussion on Dave Farber's IP list about the argument between the Wall Street Journal and Google is very informative. Chuck gave me permission to quote liberally from it in the discussion below the fold.

The background to the discussion is that since 2005 Google has provided paywalled news sites with three options:
  1. First click free:
    We've worked with subscription-based news services to arrange that the very first article seen by a Google News user (identifiable by referrer) doesn't require a subscription. Although this first article can be seen without subscribing, any further clicks on the article page will prompt the user to log in or subscribe to the news site. ... A user coming from a host matching [**] or [**] must be able to see a minimum of 3 articles per day. ... Otherwise, your site will be treated as a subscription site.
  2. Snippets:
    we will display the "subscription" label next to the publication name of all sources that greet our users with a subscription or registration form. ... If you prefer this option, please display a snippet of your article that is at least 80 words long and includes either an excerpt or a summary of the specific article. ... we will only crawl and display your content based on the article snippets you provide.
  3. Robots.txt:
    you could put the subscription content under one section of your site, and apply robots.txt to that section. In this case, Googlebot would not crawl or index anything on that section, not even for snippets.
Until recently, the WSJ took the first option. Google News readers could see three articles a day free, and the WSJ ranked high in Google's search results. Then:
The Journal decided to stop letting people read articles free from Google after discovering nearly 1 million people each month were abusing the three-article limit. They would copy and paste Journal headlines into Google and read the articles for free, then clear their cookies to reset the meter and read more,The result was:
the Wall Street Journal’s subscription business soared, with a fourfold increase in the rate of visitors converting into paying customers. But there was a trade-off: Traffic from Google plummeted 44 percent.In the great Wall Street tradition of "Greed is Good", the WSJ wasn't satisfied with a "fourfold increase in ... paying customers". They wanted to have their cake and eat it too:
Executives at the Journal, owned by Rupert Murdoch's News Corp., argue that Google's policy is unfairly punishing them for trying to attract more digital subscribers. They want Google to treat their articles equally in search rankings, despite being behind a paywall.We're the Wall Street Journal, the rules don't apply to us!

There were comments on the IP list to the effect that it would be easy for the WSJ to exempt the googlebot from the paywall. Chuck's post pointed out that things were a lot more complex than they appeared. He explained the context:
First (and perhaps foremost) there is a big question of business models, costs, and value. ... web advertising generates significantly (as in orders of magnitude) less revenue [for newspapers] than print advertising. Subscription models have always had 'leakage' where content was shared when a print copy was handed around (or lended in the case of libraries), content production costs (those costs that don't include printing and distribution of printed copies) have gone up, and information value (as a function of availability) has gone down. ... publications like the Wall Street journal are working hard to maximize the value extracted within the constraints of the web infrastructure.

Second, there is a persistent tension between people who apply classical economics to the system and those who would like to produce a financially viable work product.

And finally, there is a "Fraud Surface Area" component that is enabled by the new infrastructure that is relatively easily exploited without a concomitant level of risk to the perpetrators.Chuck explains the importance to Google of fraud prevention, and one way they approach the problem:
Google is a target for fraudsters because subverting its algorithm can enable advertising click fraud, remote system compromise, and identity theft. One way that arose early on in Google's history were sites that present something interesting when the Google Crawler came through reading the page, something malicious when an individual came through. The choice of what to show in response to an HTTP protocol request was determined largely from meta-data associated with the connection such as "User Agent", "Source Address", "Protocol options", and "Optional headers." To combat this Google has developed a crawling infrastructure that will crawl a web page and then at a future date audit that page by fetching it from an address with metadata that would suggest a human viewer. When the contents of a page change based on whether or not it looks like a human connection, Google typically would immediately dump the page and penalize the domain in terms of its Page RankBut surely the WSJ isn't a bad actor, and isn't it important enough for Google to treat differently from run-of-the-mill sites?
Google is also a company that doesn't generally like to put "exemptions" in for a particular domain. They have had issues in the past where an exemption was added and then the company went out of business and the domain acquired by a bad actor who subsequently exploited the exemption to expose users to malware laced web pages. As a result, (at least as of 2010 when I left) the policy was not to provide exceptions and not to create future problems when the circumstances around a specific exemption might no longer apply. As a result significant co-ordination between the web site and Google is required to support anything out of the ordinary, and that costs resources which Google is not willing to donate to solve the web site's problems.So this is a business, not a technical issue. There is no free lunch; Google isn't going to do work for the WSJ without getting paid:
both Google and the [WSJ] are cognizant of the sales conversion opportunity associated with a reader *knowing* because of the snippet that some piece of information is present in the document, and then being denied access to that document for free. It connects the dots between "there is something here I want to know" and "you can pay me now and I'll give it to you." As a result, if Google were to continue to rank the WSJ article into the first page of results it would be providing a financial boost to the WSJ and yet not benefiting itself financially at all.

The bottom line is, as it usually is, that there is a value here and the market maker is unwilling to cede all of it to the seller. Google has solved this problem with web shopping sites by telling them they have to pay Google a fee to appear in the first page of results, no doubt if the WSJ was willing to pay Google an ongoing maintenance fee Google would be willing to put the WSJ pages back into the first page of results (even without them being available if you clicked on them).Chuck explains the three ways you can pay Google:
As has been demonstrated in the many interactions between Google and the newspapers of the world, absent any externally applied regulation, there are three 'values' Google is willing to accept. You can give Google's customers free access to a page found on Google (the one click free rule) which Google values because it keeps Google at the top of everyone's first choice for searching for information. Alternatively you can allow only Google advertising on your pages which Google values because it can extract some revenue from the traffic they send your way. Or you can just pay Google for the opportunity to be in the set of results that the user sees first.Interestingly, what I didn't see in the discussion on the IP list was the implication of:
discovering nearly 1 million people each month were abusing the three-article limit. ... clear their cookies to reset the meter and read more,which is that the WSJ's software is capable of detecting that a reader has cleared cookies after reading their three articles. One way to do so is via browser fingerprinting, but there are many others. If the WSJ can identify readers who cheat in this way, they could be refused content. Google would still see that "First Click Free" readers saw their three articles per day, so it would continue to index and drive traffic to the WSJ. But, clearly, the WSJ would rather whine about Google's unfairness than implement a simple way to prevent cheating.

Library of Congress: The Signal: Innovate, Integrate, and Legislate: Announcing an App Challenge

planet code4lib - Tue, 2017-06-27 13:08

This is a guest post from John Pull, Communications Officer of the Office of the Chief Information Officer.

This morning, on Tuesday, June 27, 2017, Library of Congress Chief Information Officer Bernard A. Barton, Jr., is scheduled to make an in-person announcement to the attendees of the 2017 Legislative Data & Transparency Conference in the CVC.  Mr. Barton will deliver a short announcement about the Library’s intention to launch a legislative data App Challenge later this year.  This pre-launch announcement will encourage enthusiasts and professionals to bring their app-building skills to an endeavor that seeks to create enhanced access and interpretation of legislative data.

The themes of the challenge are INNOVATE, INTEGRATE, and LEGISLATE.  Mr. Barton’s remarks are below:

Here in America, innovation is woven into our DNA.  A week from today our nation celebrates its 241st birthday, and those years have been filled with great minds who surveyed the current state of affairs, analyzed the resources available to them, and created devices, systems, and ways of thinking that created a better future worldwide.

The pantheon includes Benjamin Franklin, George Washington Carver, Alexander Graham Bell, Bill Gates, and Steve Jobs.  It includes first-generation Americans like Nikola Tesla and Albert Einstein, for whom the nation was an incubator of innovation.  And it includes brilliant women such as Grace Hopper, who led the team that invented the first computer language compiler, and Shirley Jackson, whose groundbreaking research with subatomic particles enabled the inventions of solar cells, fiber-optics, and the technology the brings us something we use every day: call waiting and caller ID.

For individuals such as these, the drive to innovate takes shape through understanding the available resources, surveying the landscape for what’s currently possible, and taking it to the next level.  It’s the 21st Century, and society benefits every day from new technology, new generations of users, and new interpretations of the data surrounding us.  Social media and mobile technology have rewired the flow of information, and some may say it has even rewired the way our minds work.  So then, what might it look like to rewire the way we interpret legislative data?

It can be said that the legislative process – at a high level – is linear.  What would it look like if these sets of legislative data were pushed beyond a linear model and into dimensions that are as-yet-unexplored?  What new understandings wait to be uncovered by such interpretations?  These understandings could have the power to evolve our democracy.

That’s a pretty grand statement, but it’s not without basis.  The sets of data involved in this challenge are core to a legislative process that is centuries old.  It’s the source code of America government.  An informed citizenry is better able to participate in our democracy, and this is a very real opportunity to contribute to a better understanding of the work being done in Washington.  It may even provide insights for the people doing the work around the clock, both on the Hill, and in state and district offices.  Your innovation and integration may ultimately benefit the way our elected officials legislate for our future.

Improve the future, and be a part of history.

The 2017 Legislative Data App Challenge will launch later this summer.  Over the next several weeks Information will be made available at, and individuals are invited to connect via

DuraSpace News: AVAILABLE: 6 Release Candidate Version of DSpace-CRIS

planet code4lib - Tue, 2017-06-27 00:00

From Andrea Bollini, Chief Technology and Innovation Officer, 4Science

DSpace-CRIS is an open-source extension of DSpace that includes functionality useful for CRIS (Current Researcher Information System) users. DSpace-CRIS enables ingestion, storage, display and management of metadata and fulltext of all research-related components that can include publications, projects, grants, patents, organization units and researcher profiles (people).

Access Conference: Closing Keynote – Nora Young

planet code4lib - Mon, 2017-06-26 22:44

The closing keynote and Dave Binkley Memorial Lecture for 2017 will be Nora Young, host of CBC Radio’s Spark  and author of The Virtual Self. Nora will be talking about Information and Meaning in the Data Boom.

Nora Young is an informed and ideal guide for anyone looking to examine—and plan for—the ever-changing high-tech landscape; she helps audiences understand trends in social media, big data, wearable tech and more, while showing them how to better protect their privacy in our increasingly digital world. The host and creator of Spark on CBC Radio, and the author of The Virtual Self, she demystifies technology and explains how it is shaping our lives and the larger world in which we live.

Young was the founding host of CBC Radio’s Definitely Not The Opera, where she often discussed topics related to new media and technology. Her work has appeared online, on television, and in print. Along with Cathi Bond, she has been a hobby podcaster of The Sniffer since 2005. Her favourite technology is her bicycle.

You can follow her on Twitter @nora3000

Open Knowledge Foundation: New Open Knowledge Network chapter in Nepal

planet code4lib - Mon, 2017-06-26 15:41

We are happy to announce that this month a new Chapter at the Open Knowledge Network is being launched officially: welcome Open Knowledge Nepal in this new stage!

Since February 2013, Open Knowledge Nepal has been involved in research, advocacy, training, organizing meetups and hackathons, and developing tools related to Open Data, Open Government Data, Open Source, Open Education, Open Access, Open Development, Open Research and others.

The organization also helps and supports Open Data entrepreneurs and startups to solve different kinds of data related problems they are facing through counseling, training and by developing tools for them.

Nikesh Balami, CEO of Open Knowledge Nepal tells us: “from random groups of people to build a core team of diverse backgrounds, starting from messy thoughts to realistic plans and long-term goals, we have become more organized and robust. [We] Identified ourselves as a positive influence towards the community and nation. After being incorporated as a Chapter, we now can reach out extensively among interested groups and also expect to create impact in a most structured way in national and international level. Our main goal is to establish ourselves as a well-known open data organization/network in Nepal.

Pavel Richter, CEO of Open Knowledge International, underscored the importance of chapters: “Most of the work to improve people’s lives is and has to happen in local communities and on a national level. It is therefore hugely important to build a lasting structure for this work, and I am particularly happy to welcome Nepal as a Chapter of the growing Open Knowledge Family.”

Chapters are the Open Knowledge Network’s most developed form, they have legal independence from the organization and are affiliated by a Memorandum of Understanding. For a full list of our current chapters, see here and to learn more about their structure visit the network guidelines.

The Open Knowledge global network now includes groups in over 40 countries. Twelve of these groups have now affiliated as chapters. This network of dedicated civic activists, openness specialists, and data diggers are at the heart of the Open Knowledge International mission, and at the forefront of the movement for Open.

Check out the work OK Nepal does at

DuraSpace News: NYC this Summer: VIVO 2017 Plus Symplectic North American Conferences

planet code4lib - Mon, 2017-06-26 00:00

From the community organizers of the VIVO 2017 Conference.

Attend both the VIVO and Symplectic conferences at Weill Cornell Medicine and hit two fantastic events in the same trip!

LITA: 2017 LITA Forum – Save the Date

planet code4lib - Sun, 2017-06-25 15:50

Save the date and you can …

Register now for the 2017 LITA Forum 
Denver, CO
November 9-12, 2017

Registration is Now Open!

Join us in Denver, Colorado, at the Embassy Suites by Hilton Denver Downtown Convention Center, for the 2017 LITA Forum, a three-day education and networking event featuring 2 preconferences, 2 keynote sessions, more than 50 concurrent sessions and 20 poster presentations. It’s the 20th annual gathering of the highly regarded LITA Forum for technology-minded information professionals. Meet with your colleagues involved in new and leading edge technologies in the library and information technology field. Registration is limited in order to preserve the important networking advantages of a smaller conference. Attendees take advantage of the informal Friday evening reception, networking dinners and other social opportunities to get to know colleagues and speakers.

Keynote Speakers:

The Preconference Workshops:

  • IT Security and Privacy in Libraries: Stay Safe From Ransomware, Hackers & Snoops, with Blake Carver, Lyrasis
  • Improving Technology Services with Design Thinking: A Workshop, with Michelle Frisque, Chicago Public Library

Comments from past attendees:

“Best conference I’ve been to in terms of practical, usable ideas that I can implement at my library.”
“I get so inspired by the presentations and conversations with colleagues who are dealing with the same sorts of issues that I am.”
“It was a great experience. The size was perfect, and it was nice to be able to talk shop with people doing similar work in person over food and drinks.”
“After LITA I return to my institution excited to implement solutions I find here.”
“This is always the most informative conference! It inspires me to develop new programs and plan initiatives.”
“I thought it was great. I’m hoping I can come again in the future.”

Get all the details, register and book a hotel room at the 2017 Forum website.

Questions or Comments?

Contact LITA at (312) 280-4268 or Mark Beatty,

See you in Denver.

Evergreen ILS: Evergreen 3.0 development update #11: meet us in Chicago

planet code4lib - Sat, 2017-06-24 13:53

Mallard duck from the book Birds and nature (1904). Public domain image

Since the previous update, another 23 patches have been committed to the master branch.

This week also marks two maintenance releases, Evergreen 2.11.6 and 2.12.3, and most of the patches pushed were bug fixes for the web staff client.

I’m currently in Chicago for American Library Association’s Annual Conference, and the Evergreen community is holding an event today! Specifically, on Saturday, 24 June from 4:30 to 5:30 in room W177 of McCormick Place, Ron Gagnon and Elizabeth Thomsen of NOBLE will be moderating a discussion of a recent feature in Evergreen to adjust the sorting of catalog search results based on the popularity of resources. Debbie Luchenbill of MOBIUS
will also discuss Evergreen’s group formats and editions feature. Come see us!

Duck trivia

The color and patterns of duck plumage have long be studied as examples of sexual selection as a factor in evolution.


Updates on the progress to Evergreen 3.0 will be published every Friday until general release of 3.0.0. If you have material to contribute to the updates, please get them to Galen Charlton by Thursday morning.

District Dispatch: Libraries across the U.S. are Ready to Code

planet code4lib - Fri, 2017-06-23 19:09

This post was originally published on Google’s blog The Keyword.

“It always amazes me how interested both parents and kids are in coding, and how excited they become when they learn they can create media on their own–all by using code.” – Emily Zorea, Youth Services Librarian, Brewer Public Library

Emily Zorea is not a computer scientist. She’s a Youth Services Librarian at the Brewer Public Library in Richland Center, Wisconsin, but when she noticed that local students were showing an interest in computer science (CS), she started a coding program at the library. Though she didn’t have a CS background, she understood that coding, collaboration and creativity were critical skills for students to approach complex problems and improve the world around them. Because of Emily’s work, the Brewer Public Library is now Ready to Code. At the American Library Association, we want to give librarians like Emily the opportunity to teach these skills, which is why we are thrilled to partner with Google on the next phase of the Libraries Ready to Code initiative — a $500,000 sponsorship from Google to develop a coding toolkit and make critical skills more accessible for students across 120,000 libraries in the U.S.

Libraries will receive funding, consulting expertise, and operational support from Google to pilot a CS education toolkit that equips any librarian with the ability to implement a CS education program for kids. The resources aren’t meant to transform librarians into expert programmers but will support them with the knowledge and skills to do what they do best: empower youth to learn, create, problem solve, and develop the confidence and future skills to succeed in their future careers.

For libraries, by libraries
Librarians and staff know what works best for their communities, so we will rely on them to help us develop the toolkit. This summer a cohort of libraries will receive coding resources, like CS First, a free video-based coding club that doesn’t require CS knowledge, to help them facilitate CS programs. Then we’ll gather feedback from the cohort so that we can build a toolkit that is useful and informative for other libraries who want to be Ready to Code. The cohort will also establish a community of schools and libraries who value coding, and will use their knowledge and expertise to help that community.

Critical thinking skills for the future
Though every student who studies code won’t become an engineer, critical thinking skills are essential in all career paths. That is why Libraries Ready to Code also emphasizes computational thinking, a basic set of problem-solving skills, in addition to code, that is at the heart of connecting the libraries’ mission of fostering critical thinking with computer science.

“Ready to Code means having the resources available so that if someone is interested in coding or wants to explore it further they are able to. Knowing where to point youth can allow them to begin enjoying and exploring coding on their own.”- Jason Gonzales, technology specialist, Muskogee Public Library

Many of our library educators, like Jason Gonzales, a technology specialist at the Muskogee Public Library, already have exemplary programs that combine computer science and computational thinking. His community is located about 50 miles outside of Tulsa, Oklahoma, so the need for new programming was crucial, given that most youth are not able to travel to the city to pursue their interests. When students expressed an overwhelming interest in video game design, he knew what the focus of a new summer coding camp would be. Long-term, he hopes students will learn more digital literacy skills so they are comfortable interacting with technology and applying it to other challenges now and in the future.

From left to right: Jessie ‘Chuy’ Chavez of Google, Inc. with Marijke Visser and Alan Inouye of ALA’s OITP at the Google Chicago office.

When the American Library Association and Google announced the Libraries Ready to Code initiative last year, it began as an effort to learn about CS activities, like the ones that Emily and Jason led. We then expanded to work with university faculty at Library and Information Science (LIS) schools to integrate CS content their tech and media courses. Our next challenge is scaling these successes to all our libraries, which is where our partnership with Google, and the development of a toolkit, becomes even more important. Keep an eye out in July for a call for libraries to participate in developing the toolkit. We hope it will empower any library, regardless of geography, expertise, or affluence to provide access to CS education and ultimately, skills that will make students successful in the future.

The post Libraries across the U.S. are Ready to Code appeared first on District Dispatch.

LITA: Apply to be the next ITAL Editor

planet code4lib - Fri, 2017-06-23 17:01

Applications and nominations are invited for the position of editor of Information Technology And Libraries (ITAL), the flagship publication of the Library Information Technology Association (LITA).

LITA seeks an innovative, experienced editor to lead its top-tier, open access journal with an eye to the future of library technology and scholarly publishing. The editor is appointed for a three-year term, which may be renewed for an additional three years. Duties include:

  • Chairing the ITAL Editorial Board
  • Managing the review and publication process:
    • Soliciting submissions and serving as the primary point of contact for authors
    • Assigning manuscripts for review, managing review process, accepting papers for publication
    • Compiling accepted and invited articles into quarterly issues
  • Liaising with service providers including the journal publishing platform and indexing services
  • Marketing and promoting the journal
  • Participating as a member of and reporting to the LITA Publications Committee

Some funding for editorial assistance plus a $1,500/year stipend are provided.

Please express your interest or nominate another person for the position using this online form:

Applications and nominations that are received by July 21 will receive first consideration. Applicants and nominees will be contacted by the search committee and an appointment will be made by the LITA Board of Directors upon the recommendation of the search committee and the LITA Publications Committee. Applicants must be a member of ALA and LITA at the time of appointment.

Contact with any questions.

Information Technology and Libraries (ISSN 2163-5226) publishes material related to all aspects of information technology in all types of libraries. Topic areas include, but are not limited to, library automation, digital libraries, metadata, identity management, distributed systems and networks, computer security, intellectual property rights, technical standards, geographic information systems, desktop applications, information discovery tools, web-scale library services, cloud computing, digital preservation, data curation, virtualization, search-engine optimization, emerging technologies, social networking, open data, the semantic web, mobile services and applications, usability, universal access to technology, library consortia, vendor relations, and digital humanities.

HangingTogether: Seeking a Few Brass Tacks: Measuring the Value of Resource Sharing

planet code4lib - Fri, 2017-06-23 04:19

At the two most recent American Library Association conferences, I’ve met with a small ad hoc group of librarians to discuss how we might measure and demonstrate the value that sharing our collections delivers to various stake holders: researchers, library administrators, parent organizations, service/content providers.

First we described our current collection sharing environment and how it is changing (Orlando, June 2016).

Then we walked through various ways in which, using data, we might effectively measure and demonstrate the value of interlending – and how some in our community are already doing it (Atlanta, January 2017).

Our next logical step will be to settle on some concrete actions we can take – each by ourselves, or working among the group, or collaborating with others outside the group – to begin to measure and demonstrate that value in ways that tell a meaningful and compelling story.

As the group prepares to meet for a third time – at ALA Annual in Chicago this weekend – I thought it might be useful to share our sense of what some of these actions might eventually look like, and what group members have been saying about the possibilities during our conversations.

“Value of Resource Sharing” discussion topics: Round III

We demonstrate value best by documenting the value we deliver to our patrons.

o “One could fruitfully explore how what patrons value (speed, convenience, efficiency, ease) determines whether resource sharing is ultimately perceived as valuable.”
o “Rather than focusing on systems and exploring the life cycle of the request, we should look at that of the learner.”
o “We need to support our value not just with numbers, which are important, but with human examples of how we make a difference with researchers.”
o “We are now sharing this [citation study] work with our faculty and learning a lot, such as their choice not to use the best, but most accessible material.”
o “Did they value what we provided, and, if so, why?”
o “We know that resource sharing supports research, course completion, and publishing, but it is usually a one-way street: we provide information on demand but don’t see the final result, the contribution of that material to the final product.”
o “We need to collect and tell the stories of how the material we obtain for our users transforms their studies or allows them to succeed as researchers.”
o “I think we need to explore how we can make the process smoother for both the patrons and library staff. We talk about the cost of resource sharing a lot but we haven’t really talked about how it could be easier or how policies get in the way or how our processes are so costly because they make so much busy work.”

Common methods of measuring and demonstrating value include determining how much it costs a library to provide a service, or how much a library service would cost if the patron had to purchase it from a commercial provider.

o Q: “How much did you spend on textbooks?”  A:”None! ILL!”
o “Why not measure that expense [of providing access to academic databases to students]?”
o “Build an equation to calculate the costs of various forms of access: shelve/retrieve on campus, shelve/retrieve remotely, etc.”
o “Paul Courant did a study of what it cost to keep a book on the shelf on campus as opposed to in offsite storage….Are the numbers in the Courant study still right?”

Collections have long been a way for libraries to demonstrate value – by counting them and publicizing their size. Numbers of volumes to which you have access via consortia is becoming a more useful metric. Collections can have different values for an organization, depending upon context: where they are housed, how quickly they can be provided to users, and who wants access to them.

o “How can access to legacy print in high density storage be monetized? Perhaps a change in mindset is in order – to lower costs for institutions committed to perpetual preservation and access, and raise costs for institutions that do not.”
o “What would be the cost to retain a last copy in a secure climate controlled environment? Would we then be counting on ARLs to do the work of preserving our cultural heritage? We already know there are unique material not held by ARLs, so how do the pieces fit together? How do we incorporate public libraries which also have many unique materials in their collections? How do we equitably share the resources and costs?”
o “We rely on redundancy…65% of…requests are for things…already owned.”

We can demonstrate value by providing new services to patrons that make their experience more like AmaZoogle.

o “How do we create delivery predictability models like everyone in e-commerce already offers? Are we just afraid to predict because we don’t want to be wrong? Or do we really not know enough to offer delivery information to users?”
o “I’m interested in focusing on the learning moments available throughout the resource sharing workflows and integrating stronger information literacy into the users’ experience…’We’ve begun processing your request for a dissertation…Did you know your library provides access to these peer reviewed journal articles that you might find helpful?’ or ‘You can expect this article to hit your inbox within 24 hours – are you ready to evaluate and cite it? You might find these research guides helpful…'”

What ideas do you have for measuring the value of sharing collections?  We’d love to hear from you about this.  Please leave us a comment below.

I’ll report out about takeaways from the group’s third meeting soon after ALA.

Ed Summers: Implications/Questions

planet code4lib - Fri, 2017-06-23 04:00

… we are concerned with the argument, implicit if not explicit in many discussions about the pitfalls of interdisciplinary investigation, that one primary measure of the strength of social or cultural investigation is the breadth of implications for design that result (Dourish, 2006). While we have both been involved in ethnographic work carried out for this explicit purpose, and continue to do so, we nonetheless feel that this is far from the only, or even the most significant, way for technological and social research practice to be combined. Just as from our perspective technological artifacts are not purely considered as “things you might want to use,” from their investigation we can learn more than simply “what kinds of things people want to use.” Instead, perhaps, we look to some of the questions that have preoccupied us throughout the book: Who do people want to be? What do they think they are doing? How do they think of themselves and others? Why do they do what they do? What does technology do for them? Why, when, and how are those things important? And what roles do and might technologies play in the social and cultural worlds in which they are embedded?

These investigations do not primarily supply ubicomp practitioners with system requirements, design guidelines, or road maps for future development. What they might provide instead are insights into the design process itself; a broader view of what digital technologies might do; an appreciation for the relevance of social, cultural, economic, historical, and political contexts as well as institutions for the fabric of everyday technological reality; a new set of conceptual resources to bring to bear within the design process; and a new set of questions to ask when thinking about technology and practice.

Dourish & Bell (2011), p. 191-192

I’m very grateful to Jess Ogden for pointing me at this book by Dourish and Bell when I was recently bemoaning the fact that I struggled to find any concrete implications for design in Summers & Punzalan (2017).


Dourish, P. (2006). Implications for design. In Proceedings of the sigchi conference on human factors in computing systems (pp. 541–550). ACM. Retrieved from

Dourish, P., & Bell, G. (2011). Divining a digital future: Mess and mythology in ubiquitous computing. MIT PressPress.

Summers, E., & Punzalan, R. (2017). Bots, seeds and people: Web archives as infrastructure. In Proceedings of the 2017 acm conference on computer supported cooperative work and social computing (pp. 821–834). New York, NY, USA: ACM.

Dan Cohen: Irrationality and Human-Computer Interaction

planet code4lib - Thu, 2017-06-22 20:07

When the New York Times let it be known that their election-night meter—that dial displaying the real-time odds of a Democratic or Republican win—would return for Georgia’s 6th congressional district runoff after its notorious November 2016 debut, you could almost hear a million stiff drinks being poured. Enabled by the live streaming of precinct-by-precinct election data, the dial twitches left and right, pauses, and then spasms into another movement. It’s a jittery addition to our news landscape and the source of countless nightmares, at least for Democrats.

We want to look away, and yet we stare at the meter for hours, hoping, praying. So much so that, perhaps late at night, we might even believe that our intensity and our increasingly firm grip on our iPhones might affect the outcome, ever so slightly.

Which is silly, right?

*          *          *

Thirty years ago I opened a bluish-gray metal door and entered a strange laboratory that no longer exists. Inside was a tattered fabric couch, which faced what can only be described as the biggest pachinko machine you’ve ever seen, as large as a giant flat-screen TV. Behind a transparent Plexiglas front was an array of wooden pegs. At the top were hundreds of black rubber balls, held back by a central gate. At the bottom were vertical slots.

A young guy—like me, a college student—sat on the couch in a sweatshirt and jeans. He was staring intently at the machine. So intently that I just froze, not wanting to get in the way of his staring contest with the giant pinball machine.

He leaned in. Then the balls started releasing from the top at a measured pace and they chaotically bounced around and down the wall, hitting peg after peg until they dropped into one of the columns at the bottom. A few minutes later, those hundreds of rubber balls had formed a perfectly symmetrical bell curve in the columns.

The guy punched the couch and looked dispirited.

I unfroze and asked him the only phrase I could summon: “Uh, what’s going on?”

“I was trying to get the balls to shift to the left.”

“With what?”

With my mind.”

*          *          *

This was my first encounter with the Princeton Engineering Anomalies Research program, or PEAR. PEAR’s stated mission was to pursue an “experimental agenda of studying the interaction of human consciousness with sensitive physical devices, systems, and processes,” but that prosaic academic verbiage cloaked a far cooler description: PEAR was on the hunt for the Force.

This was clearly bananas, and also totally enthralling for a nerdy kid who grew up on Star Wars. I needed to know more. Fortunately that opportunity presented itself through a new course at the university: “Human-Computer Interaction.” I’m not sure I fully understood what it was about before I signed up for it.

The course was team-taught by prominent faculty in computer science, psychology, and engineering. One of the professors was George Miller, a founder of cognitive psychology, who was the first to note that the human mind was only capable of storing seven-digit numbers (plus or minus two digits). And it included engineering professor Robert Jahn, who had founded PEAR and had rather different notions of our mental capacity.

*          *          *

One of the perks of being a student in Human-Computer Interaction was that you were not only welcome to stop by the PEAR lab, but you could also engage in the experiments yourself. You would just sign up for a slot and head to the basement of the engineering quad, where you would eventually find the bluish-gray door.

By the late 1980s, PEAR had naturally started to focus on whether our minds could alter the behavior of a specific, increasingly ubiquitous machine in our lives: the computer. Jahn and PEAR’s co-founder, Brenda Dunne, set up several rooms with computers and shoebox-sized machines with computer chips in them that generated random numbers on old-school red LED screens. Out of the box snaked a cord with a button at the end.

You would book your room, take a seat, turn on the random-number generator, and flip on the PC sitting next to it. Once the PC booted up, you would type in a code—as part of the study, no proper names were used—to log each experiment. Then the shoebox would start showing numbers ranging from 0.00 to 2.00 so quickly that the red LED became a blur. You would click on the button to stop the digits, and then that number was recorded by the computer.

The goal was to try to stop the rapidly rotating numbers on a number over 1.00, to push the average up as far as possible. Over dozens of turns the computer’s monitor showed how far that average diverged from 1.00.

That’s a clinical description of the experiment. In practice, it was a half-hour of tense facial expressions and sweating, a strange feeling of brow-beating a shoebox with an LED, and some cursing when you got several sub-1.00 numbers in a row. It was human-computer interaction at its most emotional.

Jahn and Dunne kept the master log of the codes and the graphs. There were rumors that some of the codes—some of the people those codes represented—had discernable, repeatable effects on the random numbers. Over many experiments, they were able to make the average rise, ever so slightly but enough to be statistically significant.

In other words, there were Jedis in our midst.

Unfortunately, over several experiments—and a sore thumb from clicking on the button with increasing pressure and frustration—I had no luck affecting the random numbers. I stared at the graph without blinking, hoping to shift the trend line upwards with each additional stop. But I ended up right in the middle, as if I had flipped a coin a thousand times and gotten 500 heads and 500 tails. Average.

*          *          *

Jahn and Dunne unsurprisingly faced sustained criticism and even some heckling, on campus and beyond. When PEAR closed in 2007, all the post-mortems dutifully mentioned the editor of a journal who said he could accept a paper from the lab “if you can telepathically communicate it to me.” It’s a good line, and it’s tempting to make even more fun of PEAR these many years later.

The same year that PEAR closed its doors, the iPhone was released, and with it a new way of holding and touching and communicating with a computer. We now stare intently at these devices for hours a day, and much of that interaction is—let’s admit it—not entirely rational.

We see those three gray dots in a speech bubble and deeply yearn for a good response. We open the stocks app and, in the few seconds it takes to update, pray for green rather than red numbers. We go to the New York Times on election eve and see that meter showing live results, and more than anything we want to shift it to the left with our minds.

When asked by what mechanism the mind might be able to affect a computer, Jahn and Dunne hypothesized that perhaps there was something like an invisible Venn diagram, whereby the ghost in the machine and the ghost in ourselves overlapped ever so slightly. A permeability between silicon and carbon. An occult interface through which we could ever so slightly change the processes of the machine itself and what it displays to us seconds later.

A silly hypothesis, perhaps. But we often act like it is all too real.

Harvard Library Innovation Lab: at IIPC

planet code4lib - Thu, 2017-06-22 19:54

At IIPC last week, Jack Cushman (LIL developer) and Ilya Kreymer (former LIL summer fellow) shared their work on security considerations for web archives, including, a sandbox for developers interested in exploring web archive security.

Slides: repo:

David Rosenthal of Stanford also has a great write-up on the presentation:

LITA: Megan Ozeran wins 2017 LITA / Ex Libris Student Writing Award

planet code4lib - Thu, 2017-06-22 17:30

Megan Ozeran has been selected as the winner of the 2017 Student Writing Award sponsored by Ex Libris Group and the Library and Information Technology Association (LITA) for her paper titled “Managing Metadata for Philatelic Materials.” Ozeran is a MLIS candidate at the San Jose State University School of Information.

“Megan Ozeran’s paper was selected as the winner because it takes a scholarly look at an information technology topic that is new and unresolved. Ms. Ozeran’s discussion offers a thorough examination of the current state of cataloging stamps and issues related to their discoverability,” said Rebecca Rose, the Chair of this year’s selection committee.

The LITA/Ex Libris Student Writing Award recognizes outstanding writing on a topic in the area of libraries and information technology by a student or students enrolled in an ALA-accredited library and information studies graduate program. The winning manuscript will be published in Information Technology and Libraries (ITAL), LITA’s open access, peer reviewed journal, and the winner will receive $1,000 and a certificate of merit.

The Award will be presented during the LITA Awards Ceremony & President’s Program at the ALA Annual Conference in Chicago (IL), on Sunday, June 25, 2017.

The members of the 2017 LITA/Ex Libris Student Writing Award Committee are: Rebecca Rose (Chair), Ping Fu, and Mary Vasudeva.

Thank you to Ex Libris for sponsoring this award.

David Rosenthal: WAC2017: Security Issues for Web Archives

planet code4lib - Thu, 2017-06-22 15:00
Jack Cushman and Ilya Kreymer's Web Archiving Conference talk Thinking like a hacker: Security Considerations for High-Fidelity Web Archives is very important. They discuss 7 different security threats specific to Web archives:
  1. Archiving local server files
  2. Hacking the headless browser
  3. Stealing user secrets during capture
  4. Cross site scripting to steal archive logins
  5. Live web leakage on playback
  6. Show different page contents when archived
  7. Banner spoofing
Below the fold, a brief summary of each to encourage you to do two things:
  1. First, view the slides.
  2. Second, visit, which is a sandbox with a local version of Webrecorder that has not been patched to fix known exploits, and a number of challenges for you learn how they might apply to web archives in general.
Archiving local server filesA page being archived might have links that, when interpreted in the context of the crawler, point to local resources that should not end up in public Web archives. Examples include:
  • http://localhost:8080/
  • file:///etc/passwd
It is necessary to implement restrictions in the crawler to prevent it collecting from local addresses or from protocols other than http(s). It is also a good idea to run the crawler in an isolated container or VM to maintain control over the set of resources local to the crawler.
Hacking the headless browserNowadays collecting many Web sites requires executing the content in a headless browser such as PhantomJS. They all have vulnerabilities, only some of which are known at any given time. The same is true of the virtualization infrastructure. Isolating the crawler in a VM or a container does add another layer of complexity for the attacker, who now needs exploits not just for the headless browser but also for the virtualization infrastructure. But it requires that both need to be kept up-to-date. This isn't a panacea, just risk reduction.
Stealing user secrets during captureUser-driven Web recorders place user data at risk, because they typically hand URLs to be captured to the recording process as suffixes to a URL for the Web recorder, thus vitiating the normal cross-domain protections. Everything, login pages, third-party ads, etc. is regarded as part of the Web recorder domain.

Mitigating this risk is complex, potentially including rewriting cookies, intercepting Javascript's access to cookies, and manipulating sessions.
Cross site scripting to steal archive loginsSimilarly, the URLs used to replay content must be carefully chosen to avoid the risk of cross-site scripting attacks on the archive. When replaying preserved content, the archive must serve all preserved content from a different top-level domain from that used by users to log in to the archive and for the archive to serve the parts of a replay page (e.g. the Wayback machine's timeline) that are not preserved content. The preserved content should be isolated in an iframe. For example:
  • Archive domain:
  • Content domain:
Live web leakage on playbackEspecially with Javascript in archived pages, it is hard to make sure that all resources in a replayed page come from the archive, not from the live Web. If live Web Javascript is executed, all sorts of bad things can happen. Malicious Javascript could exfiltrate information from the archive, track users, or modify the content displayed.

Injecting the Content-Security-Policy (CSP) header into replayed content can mitigate these risks by preventing compliant browsers from loading resources except from the specified domain(s), which would be the archive's replay domain(s).
Show different page contents when archivedI wrote previously about the fact that these days the content of almost all web pages depends not just on the browser, but also the user, the time, the state of the advertising network and other things. Thus it is possible for an attacker to create pages that detect when they are being archived, so that the archive's content will be unrepresentative and possibly hostile. Alternately, the page can detect that it is being replayed, and display different content or attack the replayer.

This is another reason why both the crawler and the replayer should be run in isolated containers or VMs. The bigger question of how crawlers can be configured to obtain representative content from personalized, geolocated, advert-supported web-sites is unresolved, but out of scope for Cushman and Kreymer's talk.
Banner spoofingWhen replayed, malicious pages can overwrite the archives banner, misleading the reader about the provenance of the page.

LibreCat/Catmandu blog: Introducing FileStores

planet code4lib - Thu, 2017-06-22 12:35

Catmandu is always our tool of choice when working with structured data. Using the Elasticsearch or MongoDB Catmandu::Store-s it is quite trivial to store and retrieve metadata records. Storing and retrieving a YAML, JSON (and by extension XML, MARC, CSV,…) files can be as easy as the commands below:

$ catmandu import YAML to database < input.yml
$ catmandu import JSON to database < input.json
$ catmandu import MARC to database <
$ catmandu export database to YAML > output.yml

A catmandu.yml  configuration file is required with the connection parameters to the database:

$ cat catmandu.yml --- store: database: package: ElasticSearch options: client: '1_0::Direct' index_name: catmandu ...

Given these tools to import and export and even transform structured data, can this be extended to unstructured data? In institutional repositories like LibreCat we would like to manage metadata records and binary content (for example PDF files related to the metadata).  Catmandu 1.06 introduces the Catmandu::FileStore as an extension to the already existing Catmandu::Store to manage binary content.

A Catmandu::FileStore is a Catmandu::Store where each Catmandu::Bag acts as a “container” or a “folder” that can contain zero or more records describing File content. The files records themselves contain pointers to a backend storage implementation capable of serialising and streaming binary files. Out of the box, one Catmandu::FileStore implementation is available Catmandu::Store::File::Simple, or short File::Simple, which stores files in a directory.

Some examples. To add a file to a FileStore, the stream command needs to be executed:

$ catmandu stream /tmp/myfile.pdf to File::Simple --root /data --bag 1234 --id myfile.pdf

In the command above: /tmp/myfile.pdf is the file up be uploaded to the File::Store. File::Simple is the name of the File::Store implementation which requires one mandatory parameter, --root /data which is the root directory where all files are stored.  The--bag 1234 is the “container” or “folder” which contains the uploaded files (with a numeric identifier 1234). And the --id myfile.pdf is the identifier for the new created file record.

To download the file from the File::Store, the stream command needs to be executed in opposite order:

$ catmandu stream File::Simple --root /data --bag 1234 --id myfile.pdf to /tmp/file.pdf


$ catmandu stream File::Simple --root /data --bag 1234 --id myfile.pdf > /tmp/file.pdf

On the file system the files are stored in some deep nested structure to be able to spread out the File::Store over many disks:

/data `--/000 `--/001 `--/234 `--/myfile.pdf

A listing of all “containers” can be retreived by requesting an export of the default (index) bag of the File::Store:

$ catmandu export File::Simple --root /data to YAML
_id: 1234

A listing of all files in the container “1234” can be done by adding the bag name to the export command:

$ catmandu export File::Simple --root /data --bag 1234 to YAML
_id: myfile.pdf
_stream: !!perl/code '{ "DUMMY" }'
content_type: application/pdf
created: 1498125394
md5: ''
modified: 1498125394
size: 883202

Each File::Store implementation supports at least the fields presented above:

  • _id: the name of the file
  • _stream: a callback function to retrieve the content of the file (requires an IO::Handle as input)
  • content_type: the MIME-Type of the file
  • created: a timestamp when the file was created
  • modified: a timestamp when the file was last modified
  • size: the byte length of the file
  • md5: optional a MD5 checksum

We envision in Catmandu that many implementations of FileStores can be created to be able to store files in GitHub, BagIts, Fedora Commons and more backends.

Using the Catmandu::Plugin::SideCar  Catmandu::FileStore-s and Catmandu::Store-s can be combined as one endpoint. Using Catmandu::Store::Multi and Catmandu::Store::File::Multi many different implementations of Stores and FileStores can be combined.

This is a short introduction, but I hope you will experiment a bit with the new functionality and provide feedback to our project.


Subscribe to code4lib aggregator