You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 4 hours 49 min ago

OCLC Dev Network: Software Development Practices: Testing for Behavior, Not Just Success

Fri, 2014-09-26 19:30

This is the fourth and final post in our software development practices series. In our most recent post we discussed how Acceptance Criteria could be used to encapsulate the details of the user experience that the system should provide. This week we'll talk about how developers can use tests to determine whether or not the system is satisfying the Acceptance Criteria.

District Dispatch: Celebrating the National Student Poets Program

Fri, 2014-09-26 19:24

Last week, I had the pleasure of attending a dinner to honor the National Student Poets. Each year, the National Student Poets Program recognizes five extraordinary high school students, who receive college scholarships and opportunities to present their work at writing and poetry events across the country—which includes events at libraries.

To qualify for the National Student Poets Program, one must demonstrate excellence in poetry, provide evidence that they received prior awards for their work, and successfully navigate a multi-level selection process. The program is sponsored and hosted by the President’s Committee on the Arts and Humanities, the Institute of Museum and Library Services, Scholastic Art & Writing Awards, and several other groups, with the dinner hosted at the fabulous, new Google Washington Office—altogether an interesting collaboration.

The students began the day at the White House, and they read their poetry in the Blue Room, hosted by the First Lady. Then they met with a group of White House speechwriters to talk about the creation of a different kind of “poetry.” At the dinner, I sat next to one of the incoming (2014) National Student Poets, Cameron Messinides, a 17-year old from Greenville, South Carolina. He, as well as the other honorees, exhibited impressive, almost intimidating ability and poise in their presentations and informal conversation.

The advent of the digital age does not, of course, negate important forms of intellectual endeavor such as poetry, but does raise questions about how these forms of traditional communication extend online. And for the American Library Association (ALA), there are further questions about how libraries may best participate in this extension. Then there is the question of how to convey such library possibilities to decision makers and influencers. Thus, under the rubric of our Policy Revolution! Initiative as well as a new Office for Information Technology Policy program, we are exploring the needs and opportunities of children and youth with respect to technology and libraries with this eye on engaging national decision makers and influencers.

Well, OK, the event was fun too. With all due deference to our Empress of E-rate (Marijke Visser, who is the associate director of the ALA Office for Information Technology Policy), one cannot spend all of one’s time on E-rate and such matters, though even so, admittedly one can see a plausible link between E-rate, libraries, and poetry. So even at this dinner, E-rate did lurk in the back of my mind… I guess there is no true escape from E-rate.

Score one for the Empress.

The post Celebrating the National Student Poets Program appeared first on District Dispatch.

Library of Congress: The Signal: Library to Launch 2015 Class of NDSR

Fri, 2014-09-26 19:05

Last year’s class of Residents, along with LC staff, at the ALA Mid-winter conference

The Library of Congress Office of Strategic Initiatives, in partnership with the Institute of Museum and Library Services, has recently announced the 2015 National Digital Stewardship Residency program, which will be held in the Washington, DC area starting in June 2015.

As you may know (NDSR was well represented on the blog last year), this program is designed for recent graduates with an advanced degree who are interested in the field of digital stewardship.  This will be the fourth class of residents for this program overall – the first in 2013, was held in Washington, DC and the second and third classes, starting in September 2014, are being held concurrently in New York and Boston.

The five 2015 residents will each be paired with an affiliated host institution for a 12-month program that will provide them with an opportunity to develop, apply and advance their digital stewardship knowledge and skills in real-world settings. The participating hosts and projects for the 2015 cohort will be announced in early December and the applications will be available  shortly after.  News and updates will be posted to the NDSR webpage, and here on The Signal.

In addition to providing great career benefits for the residents, the successful NDSR program also provides benefits to the institutions involved as well as the library and archives field in general. For an example of what the residents have accomplished in the past, see this previous blog post about a symposium held last spring, organized entirely by last year’s residents.

Another recent success for the program – all of the former residents now have substantive jobs or fellowships in a related field.  Erica Titkemeyer, a former resident who worked at the Smithsonian Institution Archives, now has a position at the University of North Carolina at Chapel Hill as the Project Director and AV Conservator for the Southern Folklife Collection. Erica said the Residency provided the opportunity to utilize skills gained through her graduate education and put them to practical use in an on-the-job setting.  In this case, she was involved in research and planning for preservation of time-based media art at the Smithsonian.

Erica notes some other associated benefits. “I had a number of chances to network within the D.C. area through the Library of Congress, external digital heritage groups and professional conferences as well,” she said. “I have to say, I am most extremely grateful for having had a supportive group of fellow residents. The cohort was, and still remains, a valuable resource for knowledge and guidance.”

This residency experience no doubt helped Erica land her new job, one that includes a lot of responsibility for digital library projects. “Currently we are researching options and planning for mass-digitization of the collection, which contains thousands of recordings on legacy formats pertaining to the music and culture of the American South,” she said.

George Coulbourne, Executive Program Officer at the Library of Congress, remarked on the early success of this program: “We are excited with the success of our first class of residents, and look forward to continuing this success with our upcoming program in Washington, DC. The experience gained by the residents along with the tangible benefits for the host institution will help set the stage for a national residency model in digital preservation that can be replicated in various public and private sector environments.”

So, this is a heads-up to graduate students and all interested institutions – start thinking about how you might want to participate in the 2015 NDSR.  Keep checking our website and blog for updated information, applications, dates, etc. We will post this information as it becomes available.

(See the Library’s official press release.)

Andromeda Yelton: what I’m looking for in Emerging Leader candidates

Fri, 2014-09-26 18:34

One of my happier duties as a LITA Board member is reviewing Emerging Leader applications to decide whom the division should sponsor. I just finished this year’s round of review this morning, and now that my choices are safely submitted (but fresh on my mind) I can share what I’m looking for, in hopes that it’s useful to future Emerging Leader candidates as you develop your applications.

But first, a caveat: last year and this, I would have been happy with LITA sponsoring at least half of the candidates I saw, if only we could. Really the only unpleasant part of reviewing applications is that we can’t sponsor everyone we’d like to; I see so many extraordinarily talented, driven people in the application pile, and it’s actually painful not to be able to put all of them at the top of my list.

Okay! That said…

Things I want to see
  • People who have gotten things done.
  • People who haven’t just done an excellent job with duties as assigned, but who have perceived a need and initiated something to solve it.
  • People who have marshaled resources and buy-in, even though they are (as is the case for most EL candidates) in a junior position, or outside a formal hierarchy.
  • Letters of recommendation that speak to the things you can’t credibly address about yourself (communication, leadership skills), using specific examples.
  • Since these are specifically LITA Emerging Leader candidates, I want to see some kind of facility with technology. I’m very open-minded about what this is, but it must go beyond standard office-worker technology proficiency. I want to see that you can use technologies to create things, or that you can create technology. Tell me about that time you set up an institutional repository, or crafted a social media strategy, or did a pile of digitization, or used your video editing skills to launch your library’s marketing campaign, or automated some kind of metadata workflow, or taught yourself Javascript — seriously, I don’t care what technologies you’re using or whether you’re using them in a technology-librarian context, but you have to have some sort of technological proficiency and creativity.

Diversity is a specific (and large) part of the rubric I’m asked to use, and I’m going to give it extended treatment here. First, not gonna lie: most people in the pool are white women, and you have an uphill battle to prove your understanding of diversity if you’re one of them. (I am also a white woman, and the same goes for me.) Second, I’m not looking for evidence that you care about diversity or think it’s a good thing (of course you do. what are you, some kind of a jerk? no). I’m looking for concrete evidence that you actually get it. Tell me that you wrote a thesis on some topic that required you to grapple with primary sources and major thinkers on some diversity-related topic. Tell me about the numerous conference presentations you’ve done that required this kind of thinking. Tell me about the work, whether paid or volunteer, that you’ve done with diverse populations. Tell me about how you’ve gone out of your way, and maybe out of your comfort zone, to actually do something that deepens your awareness, develops your skills, and diversifies your network.

If you belong to a population that gives you special insight about some axis of diversity (and many white women do!), tell me about that, too. I don’t give full credit for that – I’d still like to see that you’ve theorized or worked on some sort of diversity issue – but it does give me faith that you have some sort of relevant insight and experience.

There are many kinds of diversity that have shown up in EL apps and there’s no one that matters most to me, nor do I expect any candidate to have experience with all of them. But you need to have done something. And if you really haven’t, at least acknowledge and problematize that fact; if you do this and the rest of your application is exemplary you may still be in the running for me.

Things I do not want to see

I had 20 applications to review this year. I am reviewing them as a volunteer, amidst the multiple proposals I am writing this month and the manuscript due in November and the course and webinar I’ll be teaching soon and my regular duties on two boards and helping lead a major fundraising campaign and writing code for a couple of clients and the usual housework and childcare things and occasionally even having a life and, this week, some pretty killer insomnia. Seriously, if you give me any excuse to stop reading your application, I will take it.

Do not give me the excuse to stop reading.

Some things that will make me stop reading:

  • If your application is in any way incomplete (didn’t answer all the questions, missing one or more references, no resume).
  • Significant or frequent errors of grammar, spelling, or usage.
  • Shallow treatment of the diversity question (see above).

I also might stop reading overly academic prose, particularly if it reads like you’re not 100% comfortable with that (admittedly pretty weird) genre. I do want to see that you’re smart and have a good command of English, but communication within associations is a different genre than journal articles. Talk to me in your voice (but get someone to proofread). Particularly if you’re a current student or a recent graduate: I give you permission to not write an academic paper. (I implore you not to write an academic paper.) My favorite EL applications sparkle with personality. They speak with humility or confidence or questioning or insight or elegance. A few even make me laugh.

I would prefer it if you spell out acronyms, at least on their first occurrence. You can assume that I recognize ALA and its divisions, but there are a lot of acronyms in the library world, and they’re not all clear outside their context. If you’re active in CLA, is that California or Colorado or Connecticut? Or Canada?

Some information about mechanics

Pulling back the curtain for a moment here: the web site where I access your application materials does not have super-awesome design or usability, and this impacts (sometimes unfairly) how I rate your answers.

Your answers to the questions are displayed in all bold letters. This makes it hard to read long paragraphs. Please use paragraph breaks thoughtfully.

Your recommenders’ text appears to be displayed without any paragraph breaks at all, if they’ve typed it directly into the site. Ow. Please ask them to upload letters as files instead.

Speaking of which: I use Pages. On a Mac. Your .docx file will probably look wrong to me. If you’ve invested time and graphic design skills in lovingly crafting a resume, I want to see! Please upload your resume as .pdf, and ask your recommenders to upload their letters as .pdf too. (On reflection I feel bad about this because it’s a famously poor format for accessibility. But seriously, your .docx looks bad.)

Whew! Glad I got to say all that Hope this helps future EL candidates. I look forward to reading your applications next year!

Cherry Hill Company: Why Drupal, and some digression.

Fri, 2014-09-26 18:01

Recently, there was a thread stated by a frustrated Drupal user on the Code4Lib (Code for Libraries) mailing list. It drew many thoughtful and occasionally passionate responses. This was mine:

I think that it is widely conceded that it is a good idea to use the most suitable tool for a given task. But what does that mean? There is a long list of conditions and factors that go into selecting tools, some reflecting immediate needs, some reflecting long term needs and strategy, and others reflecting the availability of resources, and these interact in many ways, many of them problematic.

I have given the genesis of Cherry Hill’s tech evolution at the end of this missive. The short version is that we started focused on minimizing size and complexity while maximizing performance, and over time have moved to an approach that balances those against building and maintenance cost along with human and infrastructure resource usage.

Among the lessons we have learned in...

Read more »

Richard Wallis: Baby Steps Towards A Library Graph

Fri, 2014-09-26 15:36

It is one thing to have a vision, regular readers of this blog will know I have them all the time, its yet another to see it starting to form through the mist into a reality. Several times in the recent past I have spoken of the some of the building blocks for bibliographic data to play a prominent part in the Web of Data.  The Web of Data that is starting to take shape and drive benefits for everyone.  Benefits that for many are hiding in plain site on the results pages of search engines. In those informational panels with links to people’s parents, universities, and movies, or maps showing the location of mountains, and retail outlets; incongruously named Knowledge Graphs.

Building blocks such as Schema.org; Linked Data in WorldCat.org; moves to enhance Schema.org capabilities for bibliographic resource description; recognition that Linked Data has a beneficial place in library data and initiatives to turn that into a reality; the release of Work entity data mined from, and linked to, the huge WorldCat.org data set.

OK, you may say, we’ve heard all that before, so what is new now?

As always it is a couple of seemingly unconnected events that throw things into focus.

Event 1:  An article by David Weinberger in the DigitalShift section of Library Journal entitled Let The Future Go.  An excellent article telling libraries that they should not be so parochially focused in their own domain whilst looking to how they are going serve their users’ needs in the future.  Get our data out there, everywhere, so it can find its way to those users, wherever they are.  Making it accessible to all.  David references three main ways to provide this access:

  1. APIs – to allow systems to directly access our library system data and functionality
  2. Linked Datacan help us open up the future of libraries. By making clouds of linked data available, people can pull together data from across domains
  3. The Library Graph –  an ambitious project libraries could choose to undertake as a group that would jump-start the web presence of what libraries know: a library graph. A graph, such as Facebook’s Social Graph and Google’s Knowledge Graph, associates entities (“nodes”) with other entities

(I am fortunate to be a part of an organisation, OCLC, making significant progress on making all three of these a reality – the first one is already baked into the core of OCLC products and services)

It is the 3rd of those, however, that triggered recognition for me.  Personally, I believe that we should not be focusing on a specific ‘Library Graph’ but more on the ‘Library Corner of a Giant Global Graph’  – if graphs can have corners that is.  Libraries have rich specialised resources and have specific needs and processes that may need special attention to enable opening up of our data.  However, when opened up in context of a graph, it should be part of the same graph that we all navigate in search of information whoever and wherever we are.

Event 2: A posting by ZBW Labs Other editions of this work: An experiment with OCLC’s LOD work identifiers detailing experiments in using the OCLC WorldCat Works Data.

ZBW contributes to WorldCat, and has 1.2 million oclc numbers attached to it’s bibliographic records. So it seemed interesting, how many of these editions link to works and furthermore to other editions of the very same work.

The post is interesting from a couple of points of view.  Firstly the simple steps they took to get at the data, really well demonstrated by the command-line calls used to access the data – get OCLCNum data from WorldCat.or in JSON format – extract the schema:exampleOfWork link to the Work – get the Work data from WorldCat, also in JSON – parse out the links to other editions of the work and compare with their own data.  Command-line calls that were no doubt embedded in simple scripts.

Secondly, was the implicit way that the corpus of WorldCat Work entity descriptions, and their canonical identifying URIs, is used as an authoritative hub for Works and their editions.  A concept that is not new in the library world, we have been doing this sort of things with names and person identities via other authoritative hubs, such as VIAF, for ages.  What is new here is that it is a hub for Works and their relationships, and the bidirectional nature of those relationships – work to edition, edition to work – in the beginnings of a library graph linked to other hubs for subjects, people, etc.

The ZBW Labs experiment is interesting in its own way – simple approach enlightening results.  What is more interesting for me, is it demonstrates a baby step towards the way the Library corner of that Global Web of Data will not only naturally form (as we expose and share data in this way – linked entity descriptions), but naturally fit in to future library workflows with all sorts of consequential benefits.

The experiment is exactly the type of initiative that we hoped to stimulate by releasing the Works data.  Using it for things we never envisaged, delivering unexpected value to our community.  I can’t wait to hear about other initiatives like this that we can all learn from.

So who is going to be doing this kind of thing – describing entities and sharing them to establish these hubs (nodes) that will form the graph.  Some are already there, in the traditional authority file hubs: The Library of Congress LC Linked Data Service for authorities and vocabularies (id.loc.gov), VIAF, ISNI, FAST, Getty vocabularies, etc.

As previously mentioned Work is only the first of several entity descriptions that are being developed in OCLC for exposure and sharing.  When others, such as Person, Place, etc., emerge we will have a foundation of part of a library graph – a graph that can and will be used, and added to, across the library domain and then on into the rest of the Global Web of Data.  An important authoritative corner, of a corner, of the Giant Global Graph.

As I said at the start these are baby steps towards a vision that is forming out of the mist.  I hope you and others can see it too.

(Toddler image: Harumi Ueda)

DPLA: Remembering the Little Rock Nine

Fri, 2014-09-26 14:21

This week, 57 years ago, was a tumultuous one for nine African American students at Central High School in Little Rock, Arkansas. Now better known as the Little Rock Nine, these high school students were part of a several year battle to integrate Little Rock School District after the landmark 1954 Brown v. Board of Education Supreme Court ruling.

From that ruling on, it was a tough uphill battle to get the Little Rock School District to integrate. On a national level, all eight congressmen from Arkansas were part of the “Southern Manifesto,” encouraging Southern states to resist integration. On a local level, white citizens’ councils, like the Capital Citizens Council and the Mothers’ League of Central High School, were formed in Little Rock to protest desegregation. They also lobbied politicians, in particular Arkansas Governor Orval Faubus, who went on to block the 1957 desegregation of Central High School.

These tensions escalated throughout September 1957—which saw the Little Rock Nine barred from entering the school by Arkansas National Guard troops sent by Faubus. Eventually, Federal District Judge Ronald Davies was successful in ordering Faubus to stop interfering with desegregation. Integration began during this week, 57 years ago.

On September 23, 1957, the nine African American students entered Central High School by a side door, while a mob of more than 1,000 people crowded the building. Local police were overwhelmed, and the protesters began attacking African American reporters outside the school building.

President Eisenhower, via Executive Order 10730, sent the U.S. Army to Arkansas to escort the Little Rock Nine into school, on September 25, 1957. The students attended classes with soldiers by their side. By the end of the month, a now federalized National Guard had mostly taken over protection of the students. While eventually the protests died down, the abuse and tension did not.  The school was eventually shut down from 1958 through fall 1959 as the struggle for segregation continued.

Through the DPLA, you can get a better sense of what that struggle and tension was like. In videos from our service hub, Digital Library of Georgia, you can view news clips recorded during this historic time in Little Rock. These videos are a powerful testament to the struggle of the Little Rock Nine, and the Civil Rights movement as a whole.

Related items in DPLA

Reporters interview students protesting the integration of Central High School. Police hold back rioters during the protest White students burn an effigy of a black student, while African American students are escorted by police into the high school President Dwight D. Eisenhower makes a statement about the Little Rock Nine and integration at Central High School Arkansas Governor Orval Faubus calls Arkansas “an occupied territory,” and a “defenseless state” against the federal troops sent by President Eisenhower Georgia Governor Marvin Griffin condemns federal troops in Little Rock, promises to maintain segregation in Georgia schools

David Rosenthal: Plenary Talk at 3rd EUDAT Conference

Fri, 2014-09-26 10:04
I gave a plenary talk at the 3rd EUDAT Conference's session on sustainability entitled Economic Sustainability of Digital Preservation. Below the fold is an edited text with links to the sources.



I'm David Rosenthal from the LOCKSS (Lots Of Copies Keep Stuff Safe) Program at the Stanford Libraries. We've been sustainably preserving digital information for a reasonably long time, and I'm here to talk about some of the lessons we've learned along the way that are relevant for research data.

In May 1995 Stanford Libraries' HighWire Press pioneered the shift of academic journals to the Web by putting the Journal of Biological Chemistry on-line. Almost immediately librarians, who pay for this extraordinarily expensive content, saw that the Web was a far better medium than paper for their mission of getting information to current readers. But they have a second mission, getting information to future readers. There were both business and technical reasons why, for this second mission, the Web was a far worse medium than paper:
  • The advent of the Web forced libraries to change from purchasing a copy of the content to renting access to the publisher's copy. If the library stopped paying the rent, it would lose access to the content.
  • Because in the Web the publisher stored the only copy of the content, and because it was on short-lived, easily rewritable media, the content was at great risk of loss and damage.
As a systems engineer, I found the paper library system interesting as an example of fault-tolerance. It consisted of a loosely-coupled network of independent peers. Each peer stored copies of its own selection of the available content on durable, somewhat tamper-evident media. The more popular the content, the more peers stored a copy. There was a market in copies; as content had fewer copies, each copy became more valuable, encouraging the peers with a copy to take more care of it. It was easy to find a copy, but it was hard to be sure you had found all copies, so undetectably altering or deleting content was difficult. There was a mechanism, inter-library loan and copy, for recovering from loss or damage to a copy.

The LOCKSS Program started in October 1998 with the goal of replicating the paper library system for the Web. We built software that allowed libraries to deploy a PC, a LOCKSS box, that was the analog for the Web of the paper library's stacks. By crawling the Web, the box collected a copy of the content to which the library subscribed and stored it. Readers could access their library's copy if for any reason they couldn't get to the publisher's copy. Boxes at multiple libraries holding the same content cooperated in a peer-to-peer network to detect and repair any loss or damage.

The program was developed and went into early production with initial funding from the NSF, and then major funding from the Mellon Foundation, the NSF and Sun Microsystems. But grant funding isn't a sustainable business model for digital preservation. In 2005, the Mellon Foundation gave us a grant with two conditions; we had to match it dollar-for-dollar and by the end of the grant in 2007 we had to be completely off grant funding. We made both conditions, and we have (with one minor exception which I will get to later) been off grant funding and in the black ever since. The LOCKSS Program has two businesses:
  • We develop, and support libraries that use, our open-source software for digital preservation. The software is free, libraries pay for support. We refer to this as the "Red Hat" business model
  • Under contract to a separate not-for-profit organization called CLOCKSS run jointly by publishers and libraries, we use our software to run a large dark archive of e-journals and e-books. This archive has recently been certified as a "Trustworthy Repository" after a third-party audit which awarded it the first-ever perfect score in the Technologies, Technical Infrastructure, Security category.
The first lesson that being self-sustaining for 7 years has taught us is "Its The Economics, Stupid". Research in two areas of preservation, e-journals and the public Web, indicates that in each of these two areas combining all current efforts preserves less than half the content that should be preserved. Why less than half? The reason is that the budget for digital preservation isn't adequate to preserve even half using current technology. This leaves us with three choices:
  • Do nothing. In that case we can stop worrying about bit rot, format obsolescence, operator error and all the other threats digital preservation systems are designed to combat. These threats are dwarfed by the threat of can't afford to preserve. It is going to mean that more than 50% of the stuff that should be available to future readers isn't.
  • Double the budget for digital preservation. This is so not going to happen. Even if it did, it wouldn't solve the problem because, as I will show, the cost per unit content is going to rise.
  • Halve the cost per unit content of current systems. This can't be done with current architectures. Yesterday morning I gave a talk at the Library of Congress describing a radical re-think of long-term storage architecture that might do the trick. You can find the text of the talk on my blog.
Unfortunately, the structure of research funding means that economics is an even worse problem for research data than for our kind of content. There's been quite a bit of research into the costs of digital preservation, but it isn't based on a lot of good data. Remedying that is important. I'm on the advisory board of an EU-funded project called 4C that is trying to remedy that. If you have any kind of cost data you can share please go to http://www.4cproject.eu/ and submit it to the Curation Cost Exchange.

As an engineer, I'm used to using rules of thumb. The one I use to summarize most of the cost research is that ingest takes half the lifetime cost, preservation takes one third, and access takes one sixth.


Research grants might be able to fund the ingest part, this is a one-time up-front cost. But preservation and access are ongoing costs for the life of the data, so grants have no way to cover them. We've been able to ignore this problem for a long time, for two reasons. From at least 1980 to 2010 costs followed Kryder's Law, the disk analog of Moore's Law, dropping 30-40%/yr. This meant that, if you could afford to store the data for a few years, the cost of storing it for the rest of time could be ignored, because of course Kryder's Law would continue forever. The second is that as the data got older, access to it was expected to become less frequent. Thus the cost of access in the long term could be ignored.

Kryder's Law held for three decades, an astonishing feat for exponential growth. Something that goes on that long gets built into people's model of the world, but as Randall Munroe points out, in the real world exponential curves cannot continue for ever. They are always the first part of an S-curve.

This graph, from Preeti Gupta of UC Santa Cruz plots the cost per GB of disk drives against time. In 2010 Kryder's Law abruptly stopped. In 2011 the floods in Thailand destroyed 40% of the world's capacity to build disks, and prices doubled. Earlier this year they finally got back to 2010 levels. Industry projections are for no more than 10-20% per year going forward (the red lines on the graph). This means that disk is now about 7 times as expensive as was expected in 2010 (the green line), and that in 2020 it will be between 100 and 300 times as expensive as 2010 projections.

Thanks to aggressive marketing, it is commonly believed that "the cloud" solves this problem. Unfortunately, cloud storage is actually made of the same kind of disks as local storage, and is subject to the same slowing of the rate at which it was getting cheaper. In fact, when all costs are taken in to account, cloud storage is not cheaper for long-term preservation than doing it yourself once you get to a reasonable scale. Cloud storage really is cheaper if your demand is spiky, but digital preservation is the canonical base-load application.

You may think that cloud storage is a competitive market; in fact it is dominated by Amazon. When Google recently started to get serious about competing, they pointed out that Amazon's margins on S3 may have been minimal at introduction, by then they were extortionate:
cloud prices across the industry were falling by about 6 per cent each year, whereas hardware costs were falling by 20 per cent. And Google didn't think that was fair. ... "The price curve of virtual hardware should follow the price curve of real hardware."Notice that the major price drop triggered by Google was a one-time event; it was a signal to Amazon that they couldn't have the market to themselves, and to smaller players that they would no longer be able to compete.

In fact commercial cloud storage is a trap. It is free to put data in to a cloud service such as Amazon's S3, but it costs to get it out. For example, getting your data out of Amazon's Glacier without paying an arm and a leg takes 2 years. If you commit to the cloud as long-term storage, you have two choices. Either keep a copy of everything outside the cloud (in other words, don't commit to the cloud), or stay with your original choice of provider no matter how much they raise the rent.

The storage part of preservation isn't the only on-going cost that will be much higher than people expect, access will be too. In 2010 the Blue Ribbon Task Force on Sustainable Digital Preservation and Access pointed out that the only real justification for preservation is to provide access. With research data this is a difficulty, the value of the data may not be evident for a long time. Shang dynasty astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.

In most cases so far the cost of an access to an individual item has been small enough that archives have not charged the reader. Research into past access patterns to archived data showed that access was rare, sparse, and mostly for integrity checking.

But the advent of "Big Data" techniques mean that, going forward, scholars increasingly want not to access a few individual items in a collection, but to ask questions of the collection as a whole. For example, the Library of Congress announced that it was collecting the entire Twitter feed, and almost immediately had 400-odd requests for access to the collection. The scholars weren't interested in a few individual tweets, but in mining information from the entire history of tweets. Unfortunately, the most the Library of Congress can afford to do with the feed is to write two copies to tape. There's no way they can afford the compute infrastructure to data-mine from it. We can get some idea of how expensive this is by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 2.5 times as expensive; until recently it was 5.5 times.

The real problem here is that scholars are used to having free access to library collections, but what scholars now want to do with archived data is so expensive that they must be charged for access. This in itself has costs, since access must be controlled and accounting undertaken. Further, data-mining infrastructure at the archive must have enough performance for the peak demand but will likely be lightly used most of the time, increasing the cost for individual scholars. A charging mechanism is needed to pay for the infrastructure. Fortunately, because the scholar's access is spiky, the cloud provides both suitable infrastructure and a charging mechanism.

For smaller collections, Amazon provides Free Public Datasets, Amazon stores a copy of the data with no charge, charging scholars accessing the data for the computation rather than charging the owner of the data for storage.

Even for large and non-public collections it may be possible to use Amazon. Suppose that in addition to keeping the two archive copies of the Twitter feed on tape, the Library kept one copy in S3's Reduced Redundancy Storage simply to enable researchers to access it. For this year, it would have averaged about $4100/mo, or about $50K. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges could be borne by the library or charged back to the researchers. If they were charged back, the 400 initial requests would each need to pay about $125 for a year's access to the collection, not an unreasonable charge. If this idea turned out to be a failure it could be terminated with no further cost, the collection would still be safe on tape. In the short term, using cloud storage for an access copy of large, popular collections may be a cost-effective approach. Because the Library's preservation copy isn't in the cloud. they aren't locked-in.

One thing it should be easy to agree on about digital preservation is that you have to do it with open-source software; closed-source preservation has the same fatal "just trust me" aspect that closed-source encryption (and cloud storage) suffer from. Sustaining open source preservation software is interesting, because unlike giants like Linux, Apache and so on it is a niche market with little commercial interest.

We have managed to sustain open-source preservation software well for 7 years, but have encountered one problem. This brings me to the exception I mentioned earlier. To sustain the free software, paid support model you have to deliver visible value to your customers regularly and frequently. We try to release updated software every 2 months, and new content for preservation weekly. But this makes it difficult to commit staff resources to major improvements to the infrastructure. These are needed to address problems that don't impact customers yet, but will in a few years unless you work on them now.

The Mellon Foundation supports a number of open-source initiatives, and after discussing this problem with them they gave us a small grant specifically to work on enhancements to the LOCKSS system such as support for collecting websites that use AJAX, and for authenticating users via Shibboleth. Occasional grants of this kind may be needed to support open-source preservation infrastructure generally, even if pay-for-support can keep it running.

Unfortunately, economics aren't the only hard problem facing the long-term storage of data. There are serious technical problems too. Lets start by examining the technical problem in its most abstract form. Since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. Inside the box there can be as much redundancy as you want, on whatever media you choose, managed by whatever anti-entropy protocols you want. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in.

Now consider every bit in that Petabyte as being like a radioactive atom, subject to a random process that flips it with a very low probability per unit time. You have just specified a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.

Here's some back-of-the-envelope hand-waving. Amazon's S3 is a state-of-the-art storage system. Its design goal is an annual probability of loss of a data object of 10-11. If the average object is 10K bytes, the bit half-life is about a million years, way too short to meet the requirement but still really hard to measure.

Note that the 10-11 is a design goal, not the measured performance of the system. There's a lot of research into the actual performance of storage systems at scale, and it all shows them under-performing expectations based on the specifications of the media. Why is this? Real storage systems are large, complex systems subject to correlated failures that are very hard to model.

Worse, the threats against which they have to defend their contents are diverse and almost impossible to model. Nine years ago we documented the threat model we use for the LOCKSS system. We observed that most discussion of digital preservation focused on these threats:
  • Media failure
  • Hardware failure
  • Software failure
  • Network failure
  • Obsolescence
  • Natural Disaster
but that the experience of operators of large data storage facilities was that the significant causes of data loss were quite different:
  • Operator error
  • External Attack
  • Insider Attack
  • Economic Failure
  • Organizational Failure
Building systems to defend against all these threats combined is expensive, and can't ever be perfectly effective. So we have to resign ourselves to the fact that stuff will get lost. This has always been true, it should not be a surprise. And it is subject to the law of diminishing returns. Coming back to the economics, how much should we spend reducing the probability of loss?

Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.

However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.

Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 150th most visited site, whereas loc.gov is the 1519th. For UK users archive.org is currently the 131st most visited site, whereas bl.uk is the 2744th.

Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more really is better.

To sum up, the good news is that sustainable preservation of digital content such as research data is possible, and the LOCKSS Program is an example.

The bad news is that people's expectations are way out of line with reality. It isn't possible to preserve nearly as much as people assume is already being preserved, nearly as reliably as they assume it is already being done. This mismatch is going to increase. People don't expect more resources yet they do expect a lot more data. They expect that the technology will get a lot cheaper but the experts no longer believe it will.

Research data, libraries and archives are a niche market. Their problems are technologically challenging but there isn't a big payoff for solving them, so neither industry nor academia are researching solutions. We end up cobbling together preservation systems out of technology intended to do something quite different, like backups.

Meredith Farkas: Whistleblowers and what still isn’t transparent

Fri, 2014-09-26 03:41

Social media is something I have in common with popular library speaker Joe Murphy. We’ve both given talks about the power of social media at loads of conferences. I love the radical transparency that social media enables. It allows for really authentic connection and also really authentic accountability. So many bad products and so much bad behavior have come to light because of social media. Everyone with a cell phone camera can now be an investigative reporter. So much less can be swept under the rug. It’s kind of an amazing thing.

But what’s disturbing is what has not become more transparent. Sexual harassment for one. When a United States senator doesn’t feel like she can name the man who told her not to lose weight after having her baby because “I like my girls chubby,” then we know this problem is bigger than just libraryland.

It’s been no secret among many women (and some men) who attend and speak at conferences like Internet Librarian and Computers in Libraries that Joe Murphy has a reputation for using these conferences as his own personal meat markets. Whether it’s true or not, I don’t know. I’ve known these allegations since before 2010, which was when I had the privilege of attending a group dinner with him.

He didn’t sexually harass anyone at the table that evening, but his behavior was entitled, cocky, and rude. He barely let anyone else get a word in edgewise because apparently what he had to say (in a group with some pretty freaking illustrious people) was more important than what anyone else had to say. The host of the dinner apologized to me afterwards and said he had no idea what this guy was like. And that was the problem. This information clearly wasn’t getting to the people who needed it most; particularly the people who invited him to speak at conferences. For me, it only cemented the fact that it’s a man’s world (even in our female-dominated profession) and men can continue to get away with and profit from offering more flash than substance and behaving badly.

Why don’t we talk about sexual harassment in the open? I can only speak from my own experience not revealing a public library administrator who sexually harassed me at a conference. First, I felt embarrassed, like maybe I’d encouraged him in some way or did something to deserve it. Second, he was someone I’d previously liked and respected and a lot of other people liked and respected him, and I didn’t want to tarnish his reputation over something that didn’t amount to that much. Maybe also the fact that he was so respected also made me scared to say something, because, in the end, it could end up hurting me.

People who are brave enough to speak out about sexual harassment and name names are courageous. As Barbara Fister wrote, they are whistleblowers. They protect other women from suffering a similar fate, which is noble. When Lisa Rabey and nina de jesus (AKA #teamharpy) wrote about behavior from Joe Murphy that many of us had been hearing about for years, they were acting as whistleblowers, though whistleblowers who had only heard about the behavior second or third-hand, which I think is an important distinction. I believe they shared this information in order to protect other women. And now they’re being sued by Joe Murphy for 1.25 million dollars in damages for defaming his character. You can read the statement of claim here. I assume he is suing them in Canada because it’s easier to sue for libel and defamation outside of the U.S.

On his blog, Wayne Biven’s Tatum wonders “whether the fact of the lawsuit might hurt Murphy within the librarian community more than any accusations of sexual harassment.” Is it the Streisand effect, whereby Joe Murphy is bringing more attention to his alleged behavior by suing these women? It’s possible that this will bite him in the ass more than the original tweets and blog post (which I hadn’t seen prior) ever could. 

I fear the impact of this case will be that women feel even less safe speaking out against sexual harassment if they believe that they could be sued for a million or more dollars. In the end, how many of us really have “proof” that we were sexually harassed other than our word? If you know something that substantiates their allegations of sexual predatory behavior, consider being a witness in #teamharpy’s case. If you don’t but still want to help, contribute to their defense fund.

That said, that this information comes second or third-hand does concern me. I don’t know for a fact that Joe Murphy is a sexual predator. Do you? Here’s what I do know. Did he creep me out when I interacted with him? Yes. Did he creep out other women at conferences? Yes. Did he behave like an entitled jerk at least some of the time? Yes. Do many people resent the fact that a man with a few years of library experience who hasn’t worked at a library in years is getting asked to speak at international conferences when all he offers is style and not substance? Yes.

While all of the rumors about him that have been swirling around for at least the past 4-5 years may be 100% true, I don’t know if they are. I don’t know if anyone has come out and said they were harassed by him beyond the general “nice shirt” comment that creeped out many women. As anyone who has read my blog for a while knows, I am terrified of groupthink. So I feel really torn when it comes to this case. Part of me wonders whether my dislike of Joe Murphy makes me more prone to believe these things. Another part of me feels that these allegations are very consistent with my experience of him and with the rumors over these many years. But I’m not going to decide whether the allegations are true without hearing it from someone who experienced it first-hand.

I wish I could end this post on a positive note, but this is pretty much sad for everyone. Sad for the two librarians who felt they were doing a courageous thing (and may well have been) by speaking out and are now being threatened by a tremendously large lawsuit. Sad for the victims of harassment who may be less likely to speak out because of this lawsuit. And sad for Joe Murphy if he is truly innocent of what he’s been accused (and imagine for a moment the consequences of tarring and feathering an innocent man). I wish we lived in a world where we felt as comfortable reporting abuse and sexual harassment as we do other wrongdoing. I wish as sharp a light was shined on this as has recently been shined on police brutality, corporate misbehavior, and income inequality. And maybe the only positive is that this is shining a light on the fact that this happens and many women, even powerful women, do not feel empowered to report it.

Photo credit: She whispered into the wrong ears by swirling thoughts

Galen Charlton: Banned books and the library of Morpheus

Fri, 2014-09-26 03:23

A notion that haunts me is found in Neil Gaiman’s The Sandman: the library of the Dreaming, wherein can be found books that no earth-bound librarian can collect.  Books that caught existence only in the dreams – or passing thoughts – of their authors. The Great American Novel. Every Great American Novel, by all of the frustrated middle managers, farmers, and factory workers who had their heart attack too soon. Every Great Nepalese Novel.  The conclusion of the Wheel of Time, as written by Robert Jordan himself.

That library has a section containing every book whose physical embodiment was stolen.  All of the poems of Sappho. Every Mayan and Olmec text – including the ones that, in the real world, did not survive the fires of the invaders.

Books can be like cockroaches. Text thought long-lost can turn up unexpectedly, sometimes just by virtue of having been left lying around until someone things to take a closer look. It is not an impossible hope that one day, another Mayan codex may make its reappearance, thumbing its nose at the colonizers and censors who despised it and the culture and people it came from.

Books are also fragile. Sometimes the censors do succeed in utterly destroying every last trace of a book. Always, entropy threatens all.  Active measures against these threats are required; therefore, it is appropriate that librarians fight the suppression, banning, and challenges of books.

Banned Books Week is part of that fight, and is important that folks be aware of their freedom to read what they choose – and to be aware that it is a continual struggle to protect that freedom.  Indeed, perhaps “Freedom to Read Week” better expresses the proper emphasis on preserving intellectual freedom.

But it’s not enough.

I am also haunted by the books that are not to be found in the Library of the Dreaming – because not even the shadow of their genesis crossed the mind of those who could have written them.

Because their authors were shot for having the wrong skin color.

Because their authors were cheated of an education.

Because their authors were sued into submission for daring to challenge the status quo.  Even within the profession of librarianship.

Because their authors made the decision to not pursue a profession in the certain knowledge that the people who dominated it would challenge their every step.

Because their authors were convinced that nobody would care to listen to them.

Librarianship as a profession must consider and protect both sides of intellectual freedom. Not just consumption – the freedom to read and explore – but also the freedom to write and speak.

The best way to ban a book is to ensure that it never gets written. Justice demands that we struggle against those who would not just ban books, but destroy the will of those who would write them.

CrossRef: CrossRef Indicators

Thu, 2014-09-25 20:35

Updated September 23, 2014

Total no. participating publishers & societies 5363
Total no. voting members 2609
% of non-profit publishers 57%
Total no. participating libraries 1902
No. journals covered 36,035
No. DOIs registered to date 69,191,919
No. DOIs deposited in previous month 582,561
No. DOIs retrieved (matched references) in previous month 35,125,120
DOI resolutions (end-user clicks) in previous month 79,193,741

Pages