You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 2 hours 29 min ago

OCLC Dev Network: Learning Linked Data: SPARQL

Thu, 2014-11-20 16:15

One thing you realize pretty quickly is that it is very hard to work with Linked Data and just confine one’s explorations to a single site or data set. The links inevitably lead you on a pilgrimage from one data set to another and another. In the case of the WorldCat Discovery API, my pilgrimage led me from WorldCat to, FAST and VIAF and from VIAF on to dbpedia. Dbpedia is an amazingly fun data set to play with. Using it to provide additional richness and context to the discovery experience has been enlightening.

HangingTogether: Libraries & Research: Changes in libraries

Thu, 2014-11-20 14:16

[This is the fourth in a short series on our 2014 OCLC Research Library Partnership meeting, Libraries and Research: Supporting Change/Changing Support. You can read the firstsecond, and third posts and also refer to the event webpage that contains links to slides, videos, photos, and a Storify summary.]

And now, onward to the final session of the meeting, which focused appropriately enough on changes in libraries, which include new roles and and preparing to support future service demands. They are engaging in new alliances and are restructuring themselves to prepare for change in accordance with their strategic plans.

[Paul-Jervis Heath, Lynn Silipigni Connaway, and Jim Michalko]

Lynn Silipigni Connaway (Senior Research Scientist, OCLC Research) [link to video] shared the results of several studies that identify the importance of user-centered assessment and evaluation. Lynn has been working actively in this area since 2003, looking at not only researchers but also future researchers (students!). In interviews on virtual reference, focusing on perspective users, Lynn and her team found that students use Google and Wikipedia but also rely on human resources — other students, advisers, graduate students and faculty. In looking through years of data, interviewees tend to use generic terms like “database” and refer to specific tools and sources only when they are further along in their career — this doesn’t mean they don’t use them, rather, they get used to using more sophisticated terminology as they go along. No surprise, convenience trumps everything; researchers at all levels are eager to optimize their time so many “satisfice” if the assignment or task doesn’t warrant extra time spent. From my perspective, one of the most interesting findings from Lynn’s studies relates to students’ somewhat furtive use of Wikipedia, which she calls the Learning Black Market (students look up something in Google, find sources in Wikipedia, copy and paste the citation into their paper!). Others use Facebook to get help. Some interesting demographic differences — more established researchers use Twitter, and use of Wikipedia declines as researchers get more experience. In regards to the library, engagement around new issues (like data management) causes researchers to think anew about ways the library might be useful. Although researchers of all stripes will reach out to humans for help, librarians rank low on that list. Given all of these challenges, there are opportunities for librarians and library services — be engaging and be where researchers are, both physically and virtually. We should always assess what we are doing — keep doing what’s working, cut or reinvent what is not. Lynne’s presentation provides plenty of links and references for you to check out.

Paul-Jervis Heath (Head of Innovation & Chief Designer, University of Cambridge) [link to video] spoke from the  perspective of a designer, not a librarian (he has worked on smart homes, for example). He shared findings from recent work with the Cambridge University libraries. Because of disruption, libraries face a perfect storm of change in teaching, funding, and scholarly communications. User expectations are formed by consumer technology. While we look for teachable moments, Google and tech companies do not — they try to create intuitive experiences. Despite all the changes, libraries don’t need to sit on the sidelines, they can be engaged players. Design research is important and distinguished from market research in that it doesn’t measure how people think but how they act. From observation studies, we can see that students want to study together in groups, even if they are doing their own thing. The library needs to be optimized for that. Another technique employed, asking students to use diaries to document their days. Many students prefer the convenience of studying in their room but what propels them to the library is the desire to be with others in order to focus. At Cambridge, students have a unique geographic triangle defined by where they live, the department where they go to class, and the market they prefer to shop in. Perceptions about how far something (like the library) is outside of the triangle are relative. Depending on how far your triangle points are, life can be easy or hard. Students are not necessarily up on technology so don’t make assumptions. It turns out that books (the regular, paper kind) are great for studying! But students use ebooks to augment their paper texts, or will use when all paper books are gone. Shadowing (with permission) is another technique which allows you to immerse yourself in a researcher’s life and understand their mental models. Academics wear lot of different hats, play different roles within the university and are too pressed for time to learn new systems. It’s up to the library to create efficiencies and make life easier for researchers. Paul closed by emphasizing six strategic themes: transition from physical to digital; library spaces; sustainable classic library services; supporting research and scholarly communications; making special collections more available; and creating touchpoints that will bring people back to the library seamlessly.

Jim Michalko (Vice President, OCLC Research Library Partnership) [link to video] talked about his recent work looking at library organizational structures and restructuring. (Jim will be blogging about this work soon, so I won’t give more than a few highlights.) For years, libraries have been making choices about what to do and how to do it, and libraries have been reorganizing themselves to get this (new) work done. Jim gathered feedback from 65 institutions in the OCLC Research Library Partnership and conducted interviews with a subset of those, in order to find out if structure indeed follows strategy. Do new structures represent markets or adjacent strategies (in business speak)? We see libraries developing capacities in customer relationship management and we see this reflected in user-focused activities. Almost all institutions interviewed were undertaking restructuring based on a changes external to the library, such as new constituencies and expectations. Organizations are orienting themselves to be more user centered, and to align themselves with a new direction taken by the university. We see many libraries bringing in skill sets beyond those normally found in the library package. Many institutions charged a senior position with helping to run a portion of a regional or national service. Other similarities: all had a lot of communication about restructuring. Almost all also related to a space plan.

This session was followed by a discussion session and I invite you to watch it, and also to watch this lovely summary of our meeting delivered by colleague Titia van der Werf (less than 7 minutes long and worth watching!):

If you attended the meeting or were part of the remote viewing audience for all or part of it, or if you watched any of the videos, I hope you will leave some comments with your reactions. Thanks for reading!

About Merrilee Proffitt

Mail | Web | Twitter | Facebook | LinkedIn | More Posts (274)

Library of Congress: The Signal: All the News That’s Fit to Archive

Thu, 2014-11-20 14:03

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

The Library has had a web archiving program since the early 2000s.  As with other national libraries, the Library of Congress web archiving program started out harvesting the web sites of its national election campaigns, followed by some collections to harvest sites for period of time connected with events (for example, an Iraq War web archive and a papal transition 2005 web archive along with collecting the sites of the U.S. House and Senate and the legislative branch of government more broadly.

An American of the 1930s getting his news by reading a newspaper. These days he’d likely be looking at a computer screen. Photo courtesy of the Library of Congress Prints and Photographs division.

The question for the Library of Congress of “what else” to harvest beyond these collections is harder to answer than one might think because of the relatively small web archiving capacity of the Library of Congress (which is influenced by our permissions approach) compared to the vast immenseness of the Internet.  About six years ago we started a collection now known as the Public Policy Topics, for which we would acquire sites with content reflecting different viewpoints and research on a broad selection of public policy questions, including the sites of national political parties, selected advocacy organizations and think tanks and other organizations with a national voice in America’s policy discussions that could be of interest to future researchers.  We are adding more sites to Public Policy Topics continuously.

Eventually I decided to include some news web sites that contained significant discussion of policy issues from particular points of view – sites ranging from to, from to  We started crawling these sites on a weekly basis to try to assure complete capture over time and to build a representation of how the site looked as different news events came and went in the public consciousness (and on these web sites).  We have been able to assess the small number of such sites that we have crawled and have decided that the results are acceptable.  But this was obviously not a very large-scale effort compared to the increasing number of sites presenting general news on the Internet -for many people, their current equivalent of a newspaper.

Newspapers – they are a critical source for historical research and the Library of Congress has a long history of collecting and providing access to U.S. (and other countries’) newspapers.  Having started to collect a small number of “newspaper-like” U.S. news sites for the Public Policy Topics collection, I began a conversation with three reference librarian colleagues from the Newspaper & Current Periodical Reading Room – Amber Paranick, Roslyn Pachoca and Gary Johnson ­- about expanding this effort to a new collection, a “General News on the Internet” web archive.  They explained to me:

Our newspaper collections are invaluable to researchers.  Newspapers provide a first-hand draft of history.  They provide supplemental information that cannot be found anywhere else.  They ‘fill in the gaps,’ so to speak. The way people access news has been changing and evolving ever since newspapers were first being published. We recognized the need to capture news published in another format.  It is reasonable to expect us to continue to connect these kinds of resources to our current and future patrons. Websites tend to be ephemeral and may disappear completely.  Without a designated archive, critical news content may be lost.

In short, my colleagues shared my interest, concern and enthusiasm for starting a larger collection of Internet-only general news sites as a web archiving collection.  I’ll let them explain their thinking further:

When we first got started on the project, we weren’t sure how to proceed.  Once we established clear boundaries on what to include, what types of news sites would be within scope for this collection, our selection process became easier. We asked for help in finding websites from our colleagues. 

We felt it was important to include sites that focus on general news with significant national presence where there are articles that have an author’s voice, such as with or (even as some of these sites also contain articles that are meant to attract visitors, so-called “click bait).  We wanted to include a variety of sites that represent more cutting edge ways of presenting general news, such as and TheVerge, and we felt sites that focus on parody such as were also important to have represented.  Of course, these sites are not the only sources from which people obtain their news, but we tried to choose a variety that included more trendy or popular sources as well as the conventional or traditional types.  Again, the idea is to assure future users have access to a significant representation of how Americans accessed news at this time using the Internet.

The Library of Congress has an internal process for proposing new web archiving collections.  I worked with Amber, Roslyn and Gary and they submitted a “General News on the Internet” project proposal and it was approved.  Yay!  Then the work began – Amber, Roslyn and Gary describe some of the hurdles:

We understand that archiving video content is a problem. We thought websites like could be great candidates but in effect, because they contained so much video and a kind of Tumblr-like portal entry point for news, we had to reject them.  Since we do not do “one hop out” crawling, the linked-to content that is the substantive content (i.e., the news) would be entirely missed.   Also, websites like change their content so frequently, it might be impossible to capture all of its content.

In addition, it was decided that sites chosen would not include general news sites associated primarily with other delivery vehicles, such as or  Many of these types also have paywalls and therefore obviously would create limitations when trying to archive.

We also encountered another type of challenge with  Since it is primarily a news-aggregator with most of the site consisting of links to news on other sites it would be tough to include the many links with the limitations in crawling (again, the “one hop” limitation – we don’t harvest links that are on a different URL).  In the end we decided to proceed in archiving The Drudge Report site since it is well known for the content that is original to that site.

The harvesting for this collection has now been underway for several months; we are examining the results.  We look forward to making an archived version of today’s news as brought to you by the Internet available to Library of Congress patrons for many tomorrows.

What news sites do you think we should collect?

David Rosenthal: Talk "Costs: Why Do We Care?"

Thu, 2014-11-20 09:22
Investing in Opportunity: Policy Practice and Planning for a Sustainable Digital Future sponsored by the 4C project and the Digital Preservation Coalition featured a keynote talk each day. The first, by Fran Berman, is here.

Mine was the second, entitled Costs: Why Do We Care? It was an update and revision of The Half-Empty Archive, stressing the importance of collecting, curating and analyzing cost data. Below the fold, an edited text with links to the sources.
IntroductionI'm David Rosenthal from the LOCKSS (Lots Of Copies Keep Stuff Safe) Program at the Stanford University Libraries, and I have two reasons for being especially happy to be here today. First, I'm a Londoner. Second, under the auspices of JISC the UK has been a very active participant in the LOCKSS program since 2006. As with all my talks, you don't need to take notes or ask for the slides. The text of the talk, with links to the sources, will go up on my blog shortly.

Why do I think am I qualified to stand here and pontificate about preservation costs? The LOCKSS Program develops, and supports users of, the LOCKSS digital preservation technology. This is a peer-to-peer system designed to let libraries collect and preserve copyright content published on the Web, such as e-journals and e-books. LOCKSS users participate in a number of networks customized for these and other forms of content including government documents, social science datasets, library special collections, and so on. One of these networks, the CLOCKSS archive, a community-managed dark archive of e-journals and e-books, was recently certified to the Trusted Repository Audit Criteria, equalling the previous highest score and gaining the first-ever perfect score for technology. The LOCKSS software is free open source, the LOCKSS team charges for support and services. On that basis, with no grant funding, for more than 7 years we have covered our costs and accumulated some reserves.

Because understanding and controlling our costs is very important for us, and because the LOCKSS system's Lots Of Copies trades using more disk space for using less of other resources (especially lawyers), I have been researching the costs of storage for some years.

Like all of you, the LOCKSS team has to plan and justify our budget each year. It is clear that economic failure is one of the most significant threats to the content we preserve, as it is even for the content national libraries preserve. For each of us individually the answer to "Costs: Why Do We Care?" is obvious. But I want to talk about why the work we are discussing over these two days, of collecting, curating, normalizing, analyzing and disseminating cost information about digital curation and preservation, is important not just at an individual level but for the big picture of preservation. What follows is in three sections:
  • The current  situation.
  • Cost trends.
  • What can be done?
The Current SituationHow well are we doing at the task of preservation? Attempts have been made to measure the probability that content is preserved in some areas; e-journals, e-theses and the surface Web:
  • In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
  • Luis Faria and co-authors (PDF) compare information extracted from journal publisher's web sites with the Keepers Registry and conclude:We manually repeated this experiment with the more complete Keepers Registry and found that more than 50% of all journal titles and 50% of all attributions were not in the registry and should be added.
  • The Prelida project Hiberlink project studied the links in 46,000 US theses and determined that about 50% of the linked-to content was preserved in at least one Web archive.
  • Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Web is Archived?" They generated lists of "random" URLs using several different techniques including sending random words to search engines and random strings to the URL shortening service. They then:
    • tried to access the URL from the live Web.
    • used Memento to ask the major Web archives whether they had at least one copy of that URL.
    Their results are somewhat difficult to interpret, but for their two more random samples they report: URIs from search engine sampling have about 2/3 chance of being archived [at least once] and URIs just under 1/3.
So, are we preserving half the stuff that should be preserved? Unfortunately, there are a number of reasons why this simplistic assessment is wildly optimistic.
An Optimistic AssessmentFirst, the assessment isn't risk-adjusted:
  • As regards the scholarly literature librarians, who are concerned with post-cancellation access not with preserving the record of scholarship, have directed resources to subscription rather than open-access content, and within the subscription category, to the output of large rather than small publishers. Thus they have driven resources towards the content at low risk of loss, and away from content at high risk of loss. Preserving Elsevier's content makes it look like a huge part of the record is safe because Elsevier publishes a huge part of the record. But Elsevier's content is not at any conceivable risk of loss, and is at very low risk of cancellation, so what have those resources achieved for future readers?
  • As regards Web content, the more links to a page, the more likely the crawlers are to find it, and thus, other things such as robots.txt being equal, the more likely it is to be preserved. But equally, the less at risk of loss.
Second, the assessment isn't adjusted for difficulty:
  • A similar problem of risk-aversion is manifest in the idea that different formats are given different "levels of preservation". Resources are devoted to the formats that are easy to migrate. But precisely because they are easy to migrate, they are at low risk of obsolescence.
  • The same effect occurs in the negotiations needed to obtain permission to preserve copyright content. Negotiating once with a large publisher gains a large amount of low-risk content, where negotiating once with a small publisher gains a small amount of high-risk content.
  • Similarly, the web content that is preserved is the content that is easier to find and collect. Smaller, less linked web-sites are probably less likely to survive.
Harvesting the low-hanging fruit directs resources away from the content at risk of loss.

Third, the assessment is backward-looking:
  • As regards scholarly communication it looks only at the traditional forms, books and papers. It ignores not merely published data, but also all the more modern forms of communication scholars use, including workflows, source code repositories, and social media. These are mostly both at much higher risk of loss than the traditional forms that are being preserved, because they lack well-established and robust business models, and much more difficult to preserve, since the legal framework is unclear and the content is either much larger, or much more dynamic, or in some cases both.
  • As regards the Web, it looks only at the traditional, document-centric surface Web rather than including the newer, dynamic forms of Web content and the deep Web.
Fourth, the assessment is likely to suffer measurement bias:
  • The measurements of the scholarly literature are based on bibliographic metadata, which is notoriously noisy. In particular, the metadata was apparently not de-duplicated, so there will be some amount of double-counting in the results.
  • As regards Web content, Ainsworth et al describe various forms of bias in their paper.
As Cliff Lynch pointed out in his summing-up of the 2014 IDCC conference, the scholarly literature and the surface Web are genres of content for which the denominator of the fraction being preserved (the total amount of genre content) is fairly well known, even if it is difficult to measure the numerator (the amount being preserved). For many other important genres, even the denominator is becoming hard to estimate as the Web enables a variety of distribution channels:
  • Books used to be published through well-defined channels that assigned ISBNs, but now e-books can appear anywhere on the Web.
  • YouTube and other sites now contain vast amounts of video, some of which represents what in earlier times would have been movies.
  • Much music now happens on YouTube (e.g. Pomplamoose
  • Scientific data is exploding in both size and diversity, and despite efforts to mandate its deposit in managed repositories much still resides in grad students laptops.
Of course, "what we should be preserving" is a judgement call, but clearly even purists who wish to preserve only stuff to which future scholars will undoubtedly require access would be hard pressed to claim that half that stuff is preserved.
Preserving the RestOverall, its clear that we are preserving much less than half of the stuff that we should be preserving. What can we do to preserve the rest of it?
  • We can do nothing, in which case we needn't worry about bit rot, format obsolescence, and all the other risks any more because they only lose a few percent. The reason why more than 50% of the stuff won't make it to future readers would be can't afford to preserve.
  • We can more than double the budget for digital preservation. This is so not going to happen; we will be lucky to sustain the current funding levels.
  • We can more than halve the cost per unit content. Doing so requires a radical re-think of our preservation processes and technology.
Such a radical re-think requires understanding where the costs go in our current preservation methodology, and how they can be funded. As an engineer, I'm used to using rules of thumb. The one I use to summarize most of the research into past costs is that ingest takes half the lifetime cost, preservation takes one third, and access takes one sixth.

On this basis, one would think that the most important thing to do would be to reduce the cost of ingest. It is important, but not as important as you might think. The reason is that ingest is a one-time, up-front cost. As such, it is relatively easy to fund. In principle, research grants, author page charges, submission fees and other techniques can transfer the cost of ingest to the originator of the content, and thereby motivate them to explore the many ways that ingest costs can be reduced. But preservation and dissemination costs continue for the life of the data, for "ever". Funding a stream of unpredictable payments stretching into the indefinite future is hard. Reductions in preservation and dissemination costs will have a much bigger effect on sustainability than equivalent reductions in ingest costs.
Cost TrendsWe've been able to ignore this problem for a long time, for two reasons. From at least 1980 to 2010 storage costs followed Kryder's Law, the disk analog of Moore's Law, dropping 30-40%/yr. This meant that, if you could afford to store the data for a few years, the cost of storing it for the rest of time could be ignored, because of course Kryder's Law would continue forever. The second is that as the data got older, access to it was expected to become less frequent. Thus the cost of access in the long term could be ignored.

Can we continue to ignore these problems?
PreservationKryder's Law held for three decades, an astonishing feat for exponential growth. Something that goes on that long gets built into people's model of the world, but as Randall Munroe points out, in the real world exponential curves cannot continue for ever. They are always the first part of an S-curve.

This graph, from Preeti Gupta of UC Santa Cruz, plots the cost per GB of disk drives against time. In 2010 Kryder's Law abruptly stopped. In 2011 the floods in Thailand destroyed 40% of the world's capacity to build disks, and prices doubled. Earlier this year they finally got back to 2010 levels. Industry projections are for no more than 10-20% per year going forward (the red lines on the graph). This means that disk is now about 7 times as expensive as was expected in 2010 (the green line), and that in 2020 it will be between 100 and 300 times as expensive as 2010 projections.

These are big numbers, but do they matter? After all, preservation is only about one-third of the total. and only about one-third of that is media costs.

Our models of the economics of long-term storage compute the endowment, the amount of money that, deposited with the data and invested at interest, would fund its preservation "for ever". This graph, from my initial rather crude prototype model, is based on hardware cost data from Backblaze and running cost data from the San Diego Supercomputer Center (much higher than Backblaze's) and Google. It plots the endowment needed for three copies of a 117TB dataset to have a 95% probability of not running out of money in 100 years, against the Kryder rate (the annual percentage drop in $/GB). The different curves represent policies of keeping the drives for 1,2,3,4,5 years. Up to 2010, we were in the flat part of the graph, where the endowment is low and doesn't depend much on the exact Kryder rate. This is the environment in which everyone believed that long-term storage was effectively free. But suppose the Kryder rate were to drop below about 20%/yr. We would be in the steep part of the graph, where the endowment needed is both much higher and also strongly dependent on the exact Kryder rate.

We don't need to suppose. Preeti's graph and industry projections show that now and for the foreseeable future we are in the steep part of the graph. What happened to slow Kryder's Law? There are a lot of factors, we outlined many of them in a paper for UNESCO's Memory of the World conference (PDF). Briefly, both the disk and tape markets have consolidated to a couple of vendors, turning what used to be a low-margin, competitive market into one with much better margins. Each successive technology generation requires a much bigger investment in manufacturing, so requires bigger margins, so drives consolidation. And the technology needs to stay in the market longer to earn back the investment, reducing the rate of technological progress.

Thanks to aggressive marketing, it is commonly believed that "the cloud" solves this problem. Unfortunately, cloud storage is actually made of the same kind of disks as local storage, and is subject to the same slowing of the rate at which it was getting cheaper. In fact, when all costs are taken in to account, cloud storage is not cheaper for long-term preservation than doing it yourself once you get to a reasonable scale. Cloud storage really is cheaper if your demand is spiky, but digital preservation is the canonical base-load application.

You may think that the cloud is a competitive market; in fact it is dominated by Amazon.
Jillian Mirandi, senior analyst at Technology Business Research Group (TBRI), estimated that AWS will generate about $4.7 billion in revenue this year, while comparable estimated IaaS revenue for Microsoft and Google will be $156 million and $66 million, respectively. When Google recently started to get serious about competing, they pointed out that Amazon's margins may have been minimal at introduction, by then they were extortionate:
cloud prices across the industry were falling by about 6 per cent each year, whereas hardware costs were falling by 20 per cent. And Google didn't think that was fair. ... "The price curve of virtual hardware should follow the price curve of real hardware."Notice that the major price drop triggered by Google was a one-time event; it was a signal to Amazon that they couldn't have the market to themselves, and to smaller players that they would no longer be able to compete.

In fact commercial cloud storage is a trap. It is free to put data in to a cloud service such as Amazon's S3, but it costs to get it out. For example, getting your data out of Amazon's Glacier without paying an arm and a leg takes 2 years. If you commit to the cloud as long-term storage, you have two choices. Either keep a copy of everything outside the cloud (in other words, don't commit to the cloud), or stay with your original choice of provider no matter how much they raise the rent.

Unrealistic expectations that we can collect and store the vastly increased amounts of data projected by consultants such as IDC within current budgets place currently preserved content at great risk of economic failure. Here are three numbers that illustrate the looming crisis in long-term storage, its cost:
Here's a graph that projects these three numbers out for the next 10 years. The red line is Kryder's Law, at IHS iSuppli's 20%/yr. The blue line is the IT budget, at's 2%/yr. The green line is the annual cost of storing the data accumulated since year 0 at the 60% growth rate projected by IDC, all relative to the value in the first year. 10 years from now, storing all the accumulated data would cost over 20 times as much as it does this year. If storage is 5% of your IT budget this year, in 10 years it will be more than 100% of your budget. If you're in the digital preservation business, storage is already way more than 5% of your IT budget.
DisseminationThe storage part of preservation isn't the only on-going cost that will be much higher than people expect, access will be too. In 2010 the Blue Ribbon Task Force on Sustainable Digital Preservation and Access pointed out that the only real justification for preservation is to provide access. With research data this can be a real difficulty; the value of the data may not be evident for a long time. Shang dynasty astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.

In most cases so far the cost of an access to an individual item has been small enough that archives have not charged the reader. Research into past access patterns to archived data showed that access was rare, sparse, and mostly for integrity checking.

But the advent of "Big Data" techniques mean that, going forward, scholars increasingly want not to access a few individual items in a collection, but to ask questions of the collection as a whole. For example, the Library of Congress announced that it was collecting the entire Twitter feed, and almost immediately had 400-odd requests for access to the collection. The scholars weren't interested in a few individual tweets, but in mining information from the entire history of tweets. Unfortunately, the most the Library could afford to do with the feed is to write two copies to tape. There's no way they could afford the compute infrastructure to data-mine from it. We can get some idea of how expensive this is by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 2.5 times as expensive; until recently it was 5.5 times.
IngestAlmost everyone agrees that ingest is the big cost element. Where does the money go? The two main cost drivers appear to be the real world, and metadata.

In the real world it is natural that the cost per unit content increases through time, for two reasons. The content that's easy to ingest gets ingested first, so over time the difficulty of ingestion increases. And digital technology evolves rapidly, mostly by adding complexity. For example, the early Web was a collection of linked static documents. Its language was HTML. It was reasonably easy to collect and preserve. The language of today's Web is Javascript, and much of the content you see is dynamic. This is much harder to ingest. In order to find the links much of the collected content now needs to be executed as well as simply being parsed. This is already significantly increasing the cost of Web harvesting, both because executing the content is computationally much more expensive, and because elaborate defenses are required to protect the crawler against the possibility that the content might be malign.

It is worth noting, however, that the very first US web site in 1991 featured dynamic content, a front-end to a database!

The days when a single generic crawler could collect pretty much everything of interest are gone; future harvesting will require more and more custom tailored crawling such as we need to collect subscription e-journals and e-books for the LOCKSS Program. This per-site custom work is expensive in staff time. The cost of ingest seems doomed to increase.

Worse, the W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.

Metadata in the real world is widely known to be of poor quality, both format and bibliographic kinds. Efforts to improve the quality are expensive, because they are mostly manual and, inevitably, reducing entropy after it has been generated is a lot more expensive than not generating it in the first place.
What can be done?We are preserving less than half of the content that needs preservation. The cost per unit content of each stage of our current processes is predicted to rise. Our budgets are not predicted to rise enough to cover the increased cost, let alone more than doubling to preserve the other more than half. We need to change our processes to greatly reduce the cost per unit content.
PreservationIt is often assumed that, because it is possible to store and copy data perfectly, only perfect data preservation is acceptable. There are two problems with this expectation.

To illustrate the first problem, lets examine the technical problem of storing data in its most abstract form. Since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. Inside the box there can be as much redundancy as you want, on whatever media you choose, managed by whatever anti-entropy protocols you want. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in.

Now consider every bit in that Petabyte as being like a radioactive atom, subject to a random process that flips it with a very low probability per unit time. You have just specified a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.

Here's some back-of-the-envelope hand-waving. Amazon's S3 is a state-of-the-art storage system. Its design goal is an annual probability of loss of a data object of 10-11. If the average object is 10K bytes, the bit half-life is about a million years, way too short to meet the requirement but still really hard to measure.

Note that the 10-11 is a design goal, not the measured performance of the system. There's a lot of research into the actual performance of storage systems at scale, and it all shows them under-performing expectations based on the specifications of the media. Why is this? Real storage systems are large, complex systems subject to correlated failures that are very hard to model.

Worse, the threats against which they have to defend their contents are diverse and almost impossible to model. Nine years ago we documented the threat model we use for the LOCKSS system. We observed that most discussion of digital preservation focused on these threats:
  • Media failure
  • Hardware failure
  • Software failure
  • Network failure
  • Obsolescence
  • Natural Disaster
but that the experience of operators of large data storage facilities was that the significant causes of data loss were quite different:
  • Operator error
  • External Attack
  • Insider Attack
  • Economic Failure
  • Organizational Failure
To illustrate the second problem, consider that building systems to defend against all these threats combined is expensive, and can't ever be perfectly effective. So we have to resign ourselves to the fact that stuff will get lost. This has always been true, it should not be a surprise. And it is subject to the law of diminishing returns. Coming back to the economics, how much should we spend reducing the probability of loss?

Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.

However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.

Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users is currently the 150th most visited site, whereas is the 1519th. For UK users is currently the 131st most visited site, whereas is the 2744th.

Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more really is better.

Unrealistic expectations for how well data can be preserved make the best be the enemy of the good. We spend money reducing even further the small probability of even the smallest loss of data that could instead preserve vast amounts of additional data, albeit with a slightly higher risk of loss.

Within the next decade all current popular storage media, disk, tape and flash, will be up against very hard technological barriers. A disruption of the storage market is inevitable. We should work to ensure that the needs of long-term data storage will influence the result. We should pay particular attention to the work underway at Facebook and elsewhere that uses techniques such as erasure coding, geographic diversity, and custom hardware based on mostly spun-down disks and DVDs to achieve major cost savings for cold data at scale. 

Every few months there is another press release announcing that some new,  quasi-immortal medium such as fused silica glass or stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

There is one long-term storage medium that might eventually make sense. DNA is very dense, very stable in a shirtsleeve environment, and best of all it is very easy to make Lots Of Copies to Keep Stuff Safe. DNA sequencing and synthesis are improving at far faster rates than magnetic or solid state storage. Right now the costs are far too high, but if the improvement continues DNA might eventually solve the archive problem. But access will always be slow enough that the data would have to be really cold before being committed to DNA.

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:

  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we have seen, current media are many orders of magnitude too unreliable for the task ahead.
Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of BackBlaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ... Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)DisseminationThe real problem here is that scholars are used to having free access to library collections and research data, but what scholars now want to do with archived data is so expensive that they must be charged for access. This in itself has costs, since access must be controlled and accounting undertaken. Further, data-mining infrastructure at the archive must have enough performance for the peak demand but will likely be lightly used most of the time, increasing the cost for individual scholars. A charging mechanism is needed to pay for the infrastructure. Fortunately, because the scholar's access is spiky, the cloud provides both suitable infrastructure and a charging mechanism.

For smaller collections, Amazon provides Free Public Datasets, Amazon stores a copy of the data with no charge, charging scholars accessing the data for the computation rather than charging the owner of the data for storage.

Even for large and non-public collections it may be possible to use Amazon. Suppose that in addition to keeping the two archive copies of the Twitter feed on tape, the Library of Congress kept one copy in S3's Reduced Redundancy Storage simply to enable researchers to access it. For this year, it would have averaged about $4100/mo, or about $50K. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges could be borne by the library or charged back to the researchers. If they were charged back, the 400 initial requests would each need to pay about $125 for a year's access to the collection, not an unreasonable charge. If this idea turned out to be a failure it could be terminated with no further cost, the collection would still be safe on tape. In the short term, using cloud storage for an access copy of large, popular collections may be a cost-effective approach. Because the Library's preservation copy isn't in the cloud, they aren't locked-in.

In the near term, separating the access and preservation copies in this way is a promising way not so much to reduce the cost of access, but to fund it more realistically by transferring it from the archive to the user. In the longer term, architectural changes to preservation systems that closely integrate limited amounts of computation into the storage fabric have the potential for significant cost reductions to both preservation and dissemination. There are encouraging early signs that the storage industry is moving in that direction.
IngestThere are two parts to the ingest process, the content and the metadata.

The evolution of the Web that poses problems for preservation also poses problems for search engines such as Google. Where they used to parse the HTML of a page into its Document Object Model (DOM) in order to find the links to follow and the text to index, they now have to construct the CSS object model (CSSOM), including executing the Javascript, and combine the DOM and CSSOM into the render tree to find the words in context. Preservation crawlers such as Heritrix used to construct the DOM to find the links, and then preserve the HTML. Now they also have to construct the CSSOM and execute the Javascript. It might be worth investigating whether preserving a representation of the render tree rather than the HTML, CSS, Javascript, and all the other components of the page as separate files would reduce costs.

It is becoming clear that there is much important content that is too big, too dynamic, too proprietary or too DRM-ed for ingestion into an archive to be either feasible or affordable. In these cases where we simply can't ingest it, preserving it in place may be the best we can do; creating a legal framework in which the owner of the dataset commits, for some consideration such as a tax advantage, to preserve their data and allow scholars some suitable access. Of course, since the data will be under a single institution's control it will be a lot more vulnerable than we would like, but this type of arrangement is better than nothing, and not ingesting the content is certainly a lot cheaper than the alternative.

Metadata is commonly regarded as essential for preservation. For example, there are 52 criteria for ISO 16363 Section 4. Of these, 29 (56%) are metadata-related. Creating and validating metadata is expensive:
  • Manually creating metadata is impractical at scale.
  • Extracting metadata from the content scales better, but it is still expensive since:
  • In both cases, extracted metadata is sufficiently noisy to impair its usefulness.
We need less metadata so we can have more data. Two questions need to be asked:
  • When is the metadata required? The discussions in the Preservation at Scale workshop contrasted the pipelines of Portico and the CLOCKSS Archive, which ingest much of the same content. The Portico pipeline is far more expensive because it extracts, generates and validates metadata during the ingest process. CLOCKSS, because it has no need to make content instantly available, implements all its metadata operations as background tasks, to be performed as resources are available.
  • How important is the metadata to the task of preservation? Generating metadata because it is possible, or because it looks good in voluminous reports, is all too common. Format metadata is often considered essential to preservation, but if format obsolescence isn't happening , or if it turns out that emulation rather than format migration is the preferred solution, it is a waste of resources. If the reason to validate the formats of incoming content using error-prone tools is to reject allegedly non-conforming content, it is counter-productive. The majority of content in formats such as HTML and PDF fails validation but renders legibly.
The LOCKSS and CLOCKSS systems take a very parsimonious approach to format metadata. Nevertheless, the requirements of ISO 16363 forced us to expend resources implementing and using FITS, whose output does not in fact contribute to our preservation strategy, and whose binaries are so large that we have to maintain two separate versions of the LOCKSS daemon, one with FITS for internal use and one without for actual preservation. Further, the demands we face for bibliographic metadata mean that metadata extraction is a major part of ingest costs for both systems. These demands come from requirements for:
  • Access via bibliographic (as opposed to full-text) search, For example, OpenURL resolution.
  • Meta-preservation services such as the Keepers Registry.
  • Competitive marketing.
Bibliographic search, preservation tracking and bragging about exactly how many articles and books your system preserves are all important, but whether they justify the considerable cost involved is open to question. Because they are cleaning up after the milk has been spilt, digital preservation systems are poorly placed to improve metadata quality.

Resources should be devoted to avoiding spilling milk rather than cleanup. For example, given how much the academic community spends on the services publishers allegedly provide in the way of improving the quality of publications, it is an outrage than even major publishers cannot spell their own names consistently, cannot format DOIs correctly, get authors' names wrong, and so on.

The alternative is to accept that metadata correct enough to rely on is impossible, downgrade its importance to that of a hint, and stop wasting resources on it. One of the reasons full-text search dominates bibliographic search is that it handles the messiness of the real world better.
ConclusionAttempts have been made, for various types of digital content, to measure the probability of preservation. The consensus is about 50%. Thus the rate of loss to future readers from "never preserved" will vastly exceed that from all other causes, such as bit rot and format obsolescence. This raises two questions:
  • Will persisting with current preservation technologies improve the odds of preservation? At each stage of the preservation process current projections of cost per unit content are higher than they were a few years ago. Projections for future preservation budgets are at best no higher. So clearly the answer is no.
  • If not, what changes are needed to improve the odds? At each stage of the preservation process we need to at least halve the cost per unit content. I have set out some ideas, others will have different ideas. But the need for major cost reductions needs to be the focus of discussion and development of digital preservation technology and processes.
Unfortunately, any way of making preservation cheaper can be spun as "doing worse preservation". Jeff Rothenberg's Future Perfect 2012 keynote is an excellent example of this spin in action. Even if we make large cost reductions, institutions have to decide to use them, and "no-one ever got fired for choosing IBM".

We live in a marketplace of competing preservation solutions. A very significant part of the cost of both not-for-profit systems such as CLOCKSS or Portico, and commercial products such as Preservica is the cost of marketing and sales. For example, TRAC certification is a marketing check-off item. The cost of the process CLOCKSS underwent to obtain this check-off item was well in excess of 10% of its annual budget.

Making the tradeoff of preserving more stuff using "worse preservation" would need a mutual non-aggression marketing pact. Unfortunately, the pact would be unstable. The first product to defect and sell itself as "better preservation than those other inferior systems" would win. Thus private interests work against the public interest in preserving more content.

To sum up, we need to talk about major cost reductions. The basis for this conversation must be more and better cost data.

SearchHub: Stump The Chump: D.C. Winners

Wed, 2014-11-19 22:55

Last week was another great Stump the Chump session at Lucene/Solr Revolution in DC. Today, I’m happy to anounce the winners:

  • First Prize: Jeff Wartes ($100 Amazon gift certificate)
  • Second Prize: Fudong Li ($50 Amazon gift certificate)
  • Third Prize: Venkata Marrapu ($25 Amazon gift certificate)

Keep an eye on the Lucidworks YouTube page to watch the video as soon as it is available and see the winning questions.

I want to thank everyone who participated — either by sending in your questions, or by being there in person to heckle me. But I would especially like to thank the judges, and our moderator Cassandra Targett, who had to do all the hard work preparing the questions.

See you next year!

The post Stump The Chump: D.C. Winners appeared first on Lucidworks.

Nicole Engard: Bookmarks for November 19, 2014

Wed, 2014-11-19 20:30

Today I found the following resources and bookmarked them on <a href=

Digest powered by RSS Digest

The post Bookmarks for November 19, 2014 appeared first on What I Learned Today....

Related posts:

  1. Add weather warnings to your calendar
  2. Amazon Shopping List
  3. Get a wiki for your family

SearchHub: Lucidworks Fusion v1.1 Now Available

Wed, 2014-11-19 19:53
Hot on the heels of the v1 release of Lucidworks Fusion, we’re back with a whole new set of features and improvements to help you design, build, and deploy search apps with lightning speed. Here’s what’s new in Fusion v1.1: Windows Support We know some of you were a little miffed that Fusion didn’t support Windows out of the gate. With the release of v1.1, Fusion now supports Windows 7, Windows 8.1, Windows 2008 Server, and Windows 2012 Server. Enhanced Signal Processing Framework Fusion’s signal processing framework has added several improvements to allow more complex interactions between signals types to give your users higher relevancy and insights including co-occurrence aggregations, extensive new math options, and alternative integration options. UI Updates A new streamlined interface lets you edit and configure schemas all right in the browser – without using the command line or editing a config file. This lets non-technical users access the power and flexibility of Fusion. Quick Start Getting started with Fusion is easier than ever with our new Quick Start which walks you through creating your first collection, indexing data, and getting your first app up and running. Relevancy Workbench Our new relevancy workbench provides a more intuitive interface to make it easier than ever to increase relevancy and fine-tune results – even for non-technical users. Connector Bonanza Fusion v1.0 shipped with over 25 connectors so you can index no matter where it lives. Fusion v1.1 now ships with connectors for Sharepoint 2010 and 2013, Subversion 1.8 and greater, Google Drive, Couchbase and Jive. Grab it now! Lucidworks Fusion v1.1 is now available for download.

The post Lucidworks Fusion v1.1 Now Available appeared first on Lucidworks.

DPLA: New Uses for Old Advertising

Wed, 2014-11-19 18:00

3 Feeds One Cent, International Stock Food Company, Minneapolis, Minnesota, ca.1905. Courtesy of Hennepin County Library’s James K. Hosmer Special Collections Library via the Minnesota Digital Library.

Digitization efforts in the US have, to date, been overwhelmingly dominated by academic libraries, but public libraries are increasingly finding a niche by looking to their local collections as sources for original content. The Hennepin County Library has partnered with the Minnesota Digital Library (MDL)—and now the Digital Public Library of America—to bring thousands of items to the digital realm from its extensive holdings in the James K. Hosmer Special Collections Department. These items include maps, atlases, programs, annual reports, photographs, diaries, advertisements, and trade catalogs.

Our partnership with MDL has not only provided far greater access to these hidden parts of our collections, it has also made patrons much more aware of the significance of our collections and the large number of materials that we could be digitizing. The link to DPLA has further increased our awareness of the potential reach of our collections: DPLA is already the second largest source of referrals to our digital content on MDL. All this has motivated us to increase our digitization activities and place greater emphasis on the role of digital content in our services.

Recently, we have been contributing hundreds of items related to local businesses in the form of large advertising posters, trade catalogs, and over 300 business trade cards from Minneapolis companies. These vividly illustrated materials provide a fascinating view of advertising techniques, local businesses, consumer and industrial goods, social mores and popular culture from the late 19th and early 20th centuries.

Hennepin County Library is committed to serving as Hennepin County’s partner in lifelong learning with programs for babies to seniors, new immigrants, small business owners and students of all ages. It comprises 41 libraries, and has holdings of more than five million books, CDs, and DVDs in 40 world languages. It manages around 1,750 public computers, has 11 library board members, and is one great system serving 1.1 million residents of Hennepin County.

Featured image credit: Detail of 1893 Minneapolis Industrial Exposition Catalog, Minneapolis, Minnesota. Courtesy of Hennepin County Library’s James K. Hosmer Special Collections Library via the Minnesota Digital Library.

All written content on this blog is made available under a Creative Commons Attribution 4.0 International License. All images found on this blog are available under the specific license(s) attributed to them, unless otherwise noted.

Eric Lease Morgan: My second Python script,

Wed, 2014-11-19 17:54

This is my second Python script,, and it illustrates where common words appear in a text.

#!/usr/bin/env python2 # - illustrate where common words appear in a text # # usage: ./ <file> # Eric Lease Morgan <> # November 19, 2014 - my second real python script; "Thanks for the idioms, Don!" # configure MAXIMUM = 25 POS = 'NN' # require import nltk import operator import sys # sanity check if len( sys.argv ) != 2 : print "Usage:", sys.argv[ 0 ], "<file>" quit() # get input file = sys.argv[ 1 ] # initialize with open( file, 'r' ) as handle : text = sentences = nltk.sent_tokenize( text ) pos = {} # process each sentence for sentence in sentences : # POS the sentence and then process each of the resulting words for word in nltk.pos_tag( nltk.word_tokenize( sentence ) ) : # check for configured POS, and increment the dictionary accordingly if word[ 1 ] == POS : pos[ word[ 0 ] ] = pos.get( word[ 0 ], 0 ) + 1 # sort the dictionary pos = sorted( pos.items(), key = operator.itemgetter( 1 ), reverse = True ) # do the work; create a dispersion chart of the MAXIMUM most frequent pos words text = nltk.Text( nltk.word_tokenize( text ) ) text.dispersion_plot( [ p[ 0 ] for p in pos[ : MAXIMUM ] ] ) # done quit()

I used the program to analyze two works: 1) Thoreau’s Walden, and 2) Emerson’s Representative Men. From the dispersion plots displayed below, we can conclude a few things:

  • The words “man”, “life”, “day”, and “world” are common between both works.
  • Thoreau discusses water, ponds, shores, and surfaces together.
  • While Emerson seemingly discussed man and nature in the same breath, but none of his core concepts are discussed as densely as Thoreau’s.

Thoreau’s Walden

Emerson’s Representative Men

Python’s Natural Langauge Toolkit (NLTK) is a good library to get start with for digital humanists. I have to learn more though. My jury is still out regarding which is better, Perl or Python. So far, they have more things in common than differences.

OCLC Dev Network: Now Playing: Finding a Common Language

Wed, 2014-11-19 17:00

If you missed our recent webinar on Finding a Common Language you can now view the full recording.

Harvard Library Innovation Lab: Link roundup November 19, 2014

Wed, 2014-11-19 16:34

Yipes cripes we’ve got our winter coats on today. Sit down with a hot beverage and enjoy these internet finds.

The FES Watch Is an E-Ink Chameleon – Design Milk

An E-Ink watch. Why isn’t E-Ink used in more places?

The Ingenuity and Beauty of Creative Parchment Repair in Medieval Books | Colossal

Acknowledge the imperfect object. Could be some creative ways to repair damaged children’s books.

A Brief History of Failure

I love the idea that failed tech can loop back around. Who knows we’ve tossed in the trash bin.

Lost At The Museum? This Ingenious 3-D Map Makes Navigation A Cinch

This would be a killer maps of the stacks.

Letterpress Printers Are Running Out Of @ Symbols And Hashtags

The boom of the @ sign.

John Miedema: Genre, gender and agency analysis using Parts of Speech in Watson Content Analytics. A simple demonstraton.

Wed, 2014-11-19 16:03

Genre is often applied as a static classification: fiction, non-fiction, mystery, romance, biography, and so on. But the edges of genre are “blurry” (Underwood). The classification of genre can change over time and situation. Ideally, genre and all classifications could be modeled dynamically during content analysis. How can IBM’s Watson Content Analytics (WCA) help analyze genre? Here is a simple demonstration.

In WCA I created a collection of 1368 public domain novels from Open Library. For this demonstration, I obtained author metadata and expressed it as a WCA facet. I did not obtain existing genre metadata. I will demonstrate that I can use author gender to dynamically classify genre for a specific analytical question. In particular, I follow the research of Matthew Jockers and the Nebraska Literary Lab. Can genre be distinguished by the gender of the author? How is action and agency treated differently in male and female genres? This simple demonstration does not answer these questions, but shows how WCA can be used to give insight into literature.

In Figure 1, the WCA Author facet is used to filter the collection to ten male authors: Walter Scott, Robert Louis Stevenson, and others. The idea is to dynamically generate a male genre by the selection of male authors. (Simple, but note that a complex array of facets could be used to quickly define a male genre.)

In Figure 2, the WCA Parts-of-Speech analysis lists frequently used verbs in the collection susbset, the male genre: tempt, condemn, struggle. Some values might be considered action verbs, but further analysis is required.


In Figure 3, the verb “struggle” is seen in the context of its source, the Waverly novels: “the Bohemian struggled to detain Quentin”, “to struggle with the sea”. This view can be used to determine the gender of characters, the actions they are performing, and interpret agency.


In Figure 4, a new search is performed, this time filtering for female authors: Jane Austen, Maria Edgeworth, Susan Ferrier, and others. In this case, the idea is to dynamically generate a female genre by selecting female authors.


In Figure 5, the WCA Parts-of-Speech analysis lists frequently used verbs in the female genre: mix, soothe, furnish. At a glance, there is an obvious difference in quality from the verbs in the male genre.

Finally in Figure 6, the verb “furnish” is seen in the context of its source in Jane Austen’s Letters, “Catherine and Lydia … a walk to Meryton was necessary to amuse their morning hours and furnish conversation.” In this case, furnish does not refer to the literal furnishing of a house, but to the facilitation of dialog. As before, detailed content inspection is needed to analyze and interpret agency.

HangingTogether: Libraries &amp; Research: Supporting change in the university

Wed, 2014-11-19 14:43

[This is the third in a short series on our 2014 OCLC Research Library Partnership meeting, Libraries and Research: Supporting Change/Changing Support. You can read the first and second posts and also refer to the event webpage that contains links to slides, videos, photos, and a Storify summary.]

[Driek Heesakkers, Paolo Manghi, Micah Altman, Paul Wouters, and John Scally]

As if changes in research are not enough, changes are also coming at the university level and at the national level. The new imperatives of higher education around Open Access, Open Data and Research Assessment are impacting the roles of libraries in managing and providing access to e-research outputs, in helping define the university’s data management policies, and demonstrating value in terms of research impact. This session explored these issues and more!

John MacColl (University Librarian at University of St Andrews) [link to video] opened the session, speaking briefly about the UK context to illustrate how libraries are taking up new roles within academia. John presented this terse analysis of the landscape (and I thank him for providing notes!):

  • Professionally, we live increasingly in an inside-out environment. But our academic colleagues still require certification and fixity, and their reputation is based on a necessarily conservative world view (tied up with traditional modes of publishing and tenure)
  • Business models are in transition. The first phase of transition was from publisher print to publisher digital. We are now in a phase which he terms as deconstructive, based on a reassessment of the values of scholarly publishing, driven by the high cost of journals.
  • There are several reasons for this: among the main ones are the high costs of publisher content, and our responsibility as librarians for the sustainability of the scholarly record; another is the emergence of public accountability arguments – the public has paid for this scholarship, they have the right to access outputs.
  • What these three new areas of research library activity have in common is the intervention of research funders into the administration of research within universities, although the specifics vary considerably in different nations.

John Scally (Director of Library and University Collections, University of Edinburgh) [link to video] added to the conversation, speaking about the role of the research library in research data management (RDM) at the University of Edinburgh. From John’s perspective, the library is a natural place for RDM work to happen because the library has been in the business of managing and curating stuff for a long time and services are at the core of the library. Naturally, making content available in different ways is a core responsibility of the library. Starting research data conversations around policy and regulatory compliance is difficult — it’s easier to frame as a problem around storage, discovery and reuse of data. At Edinburgh they tried to frame discussions around how can we help, how can you be more competitive, do better research? If a researcher comes to the web page about data management plans (say at midnight, the night before a grant proposal is due) that webpage should do something useful at the time of need, not direct researchers to come to the library during the day. Key takeaways: Blend RDM into core services, not a side business. Make sure everyone knows who is leading. Make sure the money is there, and you know who is responsible. Institutional policy is a baby step along the way, implementation is most important. RDM and open access are ways of testing (and stressing) your systems and procedures – don’t ignore fissures and gaps. An interesting correlation between RDM and the open access repository – since RDM has been implemented at Edinburgh, deposits of papers have increased.

Driek Heesakkers (Project Manager at the University of Amsterdam Library) [link to video] told us about RDM at the University of Amsterdam and in the Netherlands. Netherlands differs from other landscapes, characterized as “bland” – not a lot of differences between institutions in terms of research outputs. A rather complicated array of institutions for humanities, social science, health science, etc, all trying to define their roles in RDM. For organizations who are mandated to capture data, it’s vital that they not just show up at the end of the process to scoop up data, but that they be embedding in the environment where the work is happening, where tools are being used.  Policy and infrastructure need to be rolled out together. Don’t reinvent the wheel – if there are commercial partners or cloud services that do the work well, that’s all for the good. What’s the role of the library? We are not in the lead with policy but we help to interpret and implement — similarly with technology. The big opportunity is in the support – if you have faculty liaisons, you should be using them for data support. Storage is boring but necessary. The market for commercial solutions is developing which is good news – he’d prefer to buy, not built, when appropriate. This is a time for action — we can’t be wary or cautious.

Switching gears away from RDM, Paul Wouters (Director of the Centre for Science and Technology Studies at the University of Leiden) [link to video] spoke about the role of libraries in research assessment. His organization combines fundamental research and services for institutions and individual researchers. With research becoming increasingly international and interdisciplinary, it’s vital that we develop methods of monitoring novel indicators. Some researchers have become, ironically and paradoxically, fond of assessment (may be tied up with the move towards the quantified self?). However, self assessment can be nerve wracking and may not return useful information. Managers may are also interested in individual assessment because it may help them give feedback.  Altmetrics do not correlate closely to citation metrics, and and can vary considerably across disciplines. It’s important to think about the meaning of various ways of measuring impact. As an example of other ways of measuring, Paul presented the ACUMEN (Academic Careers Understood through Measurement and Norms) project, which allows researchers to take the lead and tell a story given evidence from his or her portfolio. An ACUMEN profile includes a career narrative supported by expertise, outputs, and influence. Giving a stronger voice to researchers is more positive than researchers not being involved in or misunderstanding (and resenting) indicators.

Micah Altman (Director of Research, Massachusetts Institute of Technology Libraries) [link to video] discussed the importance of researcher identification and the need to uniquely identify researchers in order to manage the scholarly record and to support assessment. Micah spoke in part as a member of a group that OCLC Research colleague Karen Smith-Yoshimura led, the Registering Researchers Task Group working group (their report, Registering Researchers in Authority Files is now available). It explored motivations, state of the practice, observations and recommendations. The problem is that there is more stuff, more digital content, and more people (the average number of authors on journal articles have gone up, in some cases way up). To put it mildly, disambiguating names is not a small problem. A researcher may have one or more identifiers, which may not link to one another and may come from different sources. The task group looked at the problem not only from the perspective of the library, but also from the perspective of various stakeholders (publishers, universities, researchers, etc.). Approaches to managing name identifiers result in some very complicated (and not terribly efficient) workflows. Normalizing and regularizing this data has big potential payoffs in terms of reducing errors in analytics, and creating a broad range of new (and more accurate) measures. Fortunately, with a recognition of the benefits, interoperability between identifier systems is increasing, as is the practice of assigning identifiers to researcher. One of the missing pieces is not only identifying researchers but also their roles in a given piece of work (this is a project that Micah is working on with other collaborators). What are steps that libraries can take? Prepare to engage! Work across stakeholder communities; demand more than PDFs from publishers. And prepare for more (and different) types of measurement.

Paolo Manghi (Researcher at Institute of Information Science and Technologies “A. Faedo” (ISTI), Italian National Research Council) [link to video] talked about the data infrastructures that support access to the evolving scholarly record and the requirements needed for different data sources (repositories, CRIS systems, data archives, software archives, etc.) to interoperate. Paolo spoke as a researcher, but also as the technical manager of the EU funded OpenAIRE project. This project started in 2009 out of a strong open access push from the European Commission. The project initially collected metadata and information about access to research outputs. The scope was expanded to include not only articles but also other research outputs. The work is done by human input and also technical infrastructure. They rely on input from repositories, also use software developed elsewhere. Information is funneled via 32 national open access desks. They have developed numerous guidelines (for metadata, for data repositories, and for CRIS managers to export data to be compatible with OpenAIRE). The project fills three roles — a help desk for national agencies, a portal (linking publications to research data and information about researchers) and a repository for data and articles that are otherwise homeless (Zenodo). Collecting all this information into one place allows for some advanced processes like deduplication, identifying relationships, demonstrating productivity, compliance, and geographic distribution. OpenAIRE interacts with other repository networks, such as SHARE (US), and ANDS (Australia). The forthcoming Horizon 2020 framework will cause some significant challenges for researchers and service providers because it puts a larger emphasis on access for non-published outputs.

The session was followed by a panel discussion.

I’ll conclude tomorrow with a final posting, wrapping up this series.

About Merrilee Proffitt

Mail | Web | Twitter | Facebook | LinkedIn | More Posts (274)

LITA: Cataloging Board Games

Wed, 2014-11-19 13:00

Since September, I have been immersed in the world of games and learning.  I co-wrote a successful grant application to create a library-based Center for Games and Learning.

The project is being  funded through a Sparks Ignition! Grant from the Institute of Museum and Library Services.

One of our first challenges has been to decide how to catalog the games.  I located this presentation on SlideShare.  We have decided to catalog the games as Three Dimensional Objects (Artifact) and use the following MARC fields:

  • MARC 245  Title Statement
  • MARC 260  Publication, Distribution, Etc.
  • MARC 300  Physical Description
  • MARC 500  General Note
  • MARC 508  Creation/Production Credits
  • MARC 520  Summary, Etc.
  • MARC 521  Target Audience
  • MARC 650  Topical Term
  • MARC 655  Index Term—Genre/Form

There are many other fields that we could use, but we decided to keep it as simple as possible.  We decided not to interfile the games and instead, create a separate collection for the Center for Games and Learning.  Due to this, we will not be assigning a Library of Congress Classification to them, but will instead by shelving the games in alphabetical order.  We also created a material type of “board games.”

For the Center for Games and Learning we are also working on a website that will be live in the next few months.  The project is still in its infancy and I will be sharing more about this project in upcoming blog posts.

Do any LITA blog readers have board games in your libraries? If, so what MARC fields do you use to catalog the games?







State Library of Denmark: SB IT Preservation at ApacheCon Europe 2014 in Budapest

Wed, 2014-11-19 12:56

Ok, actually only two of us are here. It would be great to have the whole department at the conference, then we could cover more tracks and start discussing, what we will be using next week ;-)

The first keynote was mostly introduction to The Apache Software Foundation along with some key numbers. The second keynote (in direct extension of the first) was an interview with best selling author Hugh Howey, who self-published ‘Wool’, in 2011. A very inspiring interview! Maybe I could be an author too – with a little help from you? One of the things he talked about was how he thinks

“… the future looks more and more like the past”

in the sense that storytelling in the past was collaborative storytelling around the camp fire. Today open source software projects are collaborative, and maybe authors should try it too? Hugh Howey’s book has grown with help from fans and fan fiction.

The coffee breaks and lunches have been great! And the cake has been plentiful!

Så skal Apache software foundations 15 års fødselsdag da fejres!

Var der nogen som sagde at Ungarn var kendt for kager?

And yes, there has also been lots and lots of interesting presentations of lots and lots of interesting Apache tools. Where to start? There is one that I want to start using on Monday: Apache Tez. The presentation was by Hitesh Shah from Hortonworks and the slides are available online.

There are quite a few, that I want to look into a bit more and experiment with, such as Spark and Cascading, and I think my colleague can add a few more. There are some that we will tell our colleagues at home about, and hope that they have time to experiment… And now I’ll go and hear about Quadrupling your Elephants!

Note: most of the slides are online. Just look at

Open Knowledge Foundation: The Public Domain Review brings out its first book

Wed, 2014-11-19 12:44

Open Knowledge project The Public Domain Review is very proud to announce the launch of its very first book! Released through the newly born spin-off project the PDR Press, the book is a selection of weird and wonderful essays from the project’s first three years, and shall be (we hope) the first of an annual series showcasing in print form essays from the year gone by. Given that there’s three years to catch up on, the inaugural incarnation is a special bumper edition, coming in at a healthy 346 pages, and jam-packed with 146 illustrations, more than half of which are newly sourced especially for the book.

Spread across six themed chapters – Animals, Bodies, Words, Worlds, Encounters and Networks – there is a total of thirty-four essays from a stellar line up of contributors, including Jack Zipes, Frank Delaney, Colin Dickey, George Prochnik, Noga Arikha, and Julian Barnes.

What’s inside? Volcanoes, coffee, talking trees, pigs on trial, painted smiles, lost Edens, the social life of geometry, a cat called Jeoffry, lepidopterous spying, monkey-eating poets, imaginary museums, a woman pregnant with rabbits, an invented language drowning in umlauts, a disgruntled Proust, frustrated Flaubert… and much much more.

Order by 26th November to benefit from a special reduced price and delivery in time for Christmas.

If you are wanting to get the book in time for Christmas (and we do think it is a fine addition to any Christmas list!), then please make sure to order before midnight (PST) on 26th November. Orders place before this date will also benefit from a special reduced price!

Please visit the dedicated page on The Public Domain Review site to learn more and also buy the book!

FOSS4Lib Recent Releases: PERICLES Extraction Tool - 1.0

Wed, 2014-11-19 01:02

Last updated November 18, 2014. Created by Peter Murray on November 18, 2014.
Log in to edit this page.

Package: PERICLES Extraction ToolRelease Date: Thursday, October 30, 2014

FOSS4Lib Updated Packages: PERICLES Extraction Tool

Wed, 2014-11-19 01:01

Last updated November 18, 2014. Created by Peter Murray on November 18, 2014.
Log in to edit this page.

The PERICLES Extraction Tool (PET) is an open source (Apache 2 licensed) Java software for the extraction of significant information from the environment where digital objects are created and modified. This information supports object use and reuse, e.g. for a better long-term preservation of data. The Tool was developed entirely for the PERICLES EU project by Fabio Corubolo, University of Liverpool, and Anna Eggers, Göttingen State and University Library.

Package Type: Data Preservation and ManagementLicense: Apache 2.0 Package Links Development Status: In DevelopmentOperating System: LinuxMacWindows Releases for PERICLES Extraction Tool Programming Language: JavaPerlOpen Hub Link: Hub Stats Widget: 

Library Tech Talk (U of Michigan): Quick Links and Search Frequency

Wed, 2014-11-19 00:00
Does adding links to popular databases change user searching behavior? An October 2013 change to the University of Michigan Library’s front page gave us the opportunity to conduct an empirical study and shows that user behavior has changed since the new front page design was launched.