You are here

Feed aggregator

Eric Hellman: Attribution Meets Open Access

planet code4lib - Mon, 2014-09-22 18:24
Credits Dancer (see on YouTube)It drives my kids crazy, but I always stay for the credits after the movie. I'm writing this while on a plane over the Atlantic, and I just watched Wes Anderson's Grand Budapest Hotel. Among the usual credits for the actors, the producers, the directors, writers, editors, composers, designers, musicians, key grips, best boys, animators, model makers and the like, Michael Taylor is credited as the painter of "Johannes von Hoytl's Boy with Apple" along with his model, Ed Munro. "The House of Waris" is credited for "Brass Knuckle-dusters and Crossed Key Pins". There's a "Drapesmaster", a Millener and two "Key Costume Cutters". There are even "Photochrom images courtesy of The Library of Congress". To reward me for watching to the end there's a funny Russian dancer over the balalaika chorus.

It says a lot about the movie industry that so much work has gone into the credits. They are a fitting recognition of the miracle of a myriad of talents collaborating to result in a Hollywood movie. But the maturity of the film industry is also reflected in the standardization of the form of this attribution.

The importance of attribution is similarly reflected by its presence is each of the Creative Commons licenses. But many of the digital media that have adopted Creative Commons licensing have not reached the sort of attribution maturity seen in the film industry. The book publishing industry, for example, hides the valuable contributions of copy editors, jacket designers, research assistants and others. It's standard practice to attribute a work to the author alone. If someone spends time to make an ebook work well, that generally doesn't get a credit alongside the author.

The Creative Commons licenses require attribution, but don't specify much about how the attribution is to be done, and it's taken a while for media specific conventions to emerge. It seems to be accepted practice, for example, that CC licensed blog posts require a back-link to the original blog post. People who use CC licensed photos to illustrate a slide presentation typically have a credits page with links to the sources at the end.

Signs of maturation were omnipresent at the 6th Conference for Open Access Scholarly Publishing, which I'm just returning from. Prominent in the list of achievements was the announcement of a "Shared Statement and Community Principles on Expectations of Scholarly Standards on Attribution", a set of attribution principles for open access scholarly publications, signed by all the important open access scholarly publishers.

The four agreed-upon principles are as follows:

  1. Researchers choosing Open Access and using liberal licenses do so because they wish to maximise access to and re-use of their work. We acknowledge the tradition of both freely giving knowledge to our communities and also the expectation that contributions will be respected and that full credit is given according to scholarly norms.
  2. Authors choose Creative Commons licenses in part to ensure attribution and the assignment of credit. The community expects that where a work is reprinted, collected, aggregated or otherwise re-used substantially as a whole that the original source, location and free availability of the original version will be both made explicit and emphasised.
  3. The community expects that where modifications have been made to an article that this will be made explicit and every practicable effort will be made to make the nature and scope of modifications explicit. Where a derivative is digital all practicable efforts should be made to make comparison with the original version as easy as possible for the user.
  4. The community assumes, consistent with the terms of the Creative Commons licenses, that unless noted otherwise authors have not endorsed any republication or modification of their original work. Where authors have explicitly endorsed the republication or modified version this should be made explicit in a way which is separate to the attribution.

These principles, and the implementation guidelines that will result from further consultations, are particularly needed because many scholars, while supporting the reuse enabled by CC BY licenses, are concerned about possible misuse. The principles reinforce that when a work is modified, the substance of the modifications should be made clear to the end user, and that further, there must be no implication that republication carries any endorsement by the original authors.

One thing that is likely to emerge from this process is the use of CrossRef DOI's as attribution urls. DOIs can be resolved (via redirection) to an authoritative web and can be maintained by the publisher so that links needn't break when content moves.

As scholarly content gets remixed, revised and repurposed, there will increasingly be a need to track contributions every bit as elaborate as for Grand Budapest Hotel. Imagine a paper by Alice analyzing data from Bob on a sample by Carol, with later corrections by Eve. Luckily we live in the future and there's already a technology and user framework that shows how it can be done. That technology, the future of attribution (I hope), is Distributed Version Control. A subsequent post will discuss why every serious publisher needs to understand GitHub.

The emphasis on community in the the "Shared Statement" is vitally important. With consultation and shared values, we'll soon all be dancing at the end of the credits.

Manage Metadata (Diane Hillmann and Jon Phipps): Late to the party?

planet code4lib - Mon, 2014-09-22 17:01

In my post last week, I mentioned a paper that Gordon Dunsire, Jon Phipps and I had written for the IFLA Satellite Meeting in Paris last month “Linked Data in Libraries: Let’s make it happen!” (note the videos!). I wanted to talk about the paper and why we wrote it, but I’m not just going to summarize it–I wouldn’t want to spoil the paper for anyone!

The paper, “Versioning Vocabularies in a Linked Data World”, was written in part because we’d seen far too many examples of vocabulary management and distribution that paid little or no attention to the necessity to maintain vocabularies over time and to make them available (over and over again, of course) to the data providers using them. It goes without saying that the vocabularies were expected to change over time, but in too many cases, vocabulary owners distributed changes in document form, or as files with new data embedded but no indication of what had changed, or worse: nothing.

We have been thinking about this problem for a long time. Even the earliest instance of the NSDL Registry (precursor of the current Open Metadata Registry, or OMR, as we like to call it) incorporated a ‘history’ view of the data, basically the ‘who, what, when’ of every change made in every vocabulary. Later on, we added the ability to declare ‘versions’ of the vocabularies themselves, taking advantage of that granular history data, for those trying to manage the updating of their ‘product’ in a rational manner. Sadly enough, not very many of our users took advantage of that feature, and we’re not entirely sure why not, but there it was. Jon has always been frustrated with our first passes at this problem, and after Gordon and I discussed the problem with others at DC-2013 last year, and my rant about the lack of version control on came out, it seemed time to think about the issue again.

At that point we were also planning our own big time versioning event: the unpublished first version of the RDA Element Sets were about to make their re-debut in ‘published’ form, reorganized, and with new URIs. Jon was also working on the GitHub connection with the OMR underlying the new RDA Registry site, working in a more automated mode as planned. He and Gordon and I had been discussing a new approach for some time, based on the way software is versioned and distributed, which is well-supported in Git and GitHub. So, as we drove back from ALA Midwinter in Philadelphia in January of last year, Jon and I blocked out the paper we’d agreed to do with Gordon on how we thought versioning should work in the semantic vocabulary world.

Consider: how do all of us computer nerds update our applications? Do we have to go to all sorts of websites (sometimes, but not always, prompted by an email) to determine which applications have changed and invoke an update? Well, sure, sometimes we do (particularly when they want more money!), but since the advent of the App Store and Google Play, we can do our updates much more easily, and for the most part those updates are ‘pushed’ to us for decisions on whether we want to update or not, we are told in a general way what has changed, and we click … and it’s done.

This is the way updates should happen in the Semantic Web data world, increasingly dependent on element sets and value vocabularies to provide descriptions of products of all kinds in order to provide access, drive sales or eyeballs, or support effective connections between resources. Now that we’re all reconciled to using URIs instead of text (even if our data hasn’t yet made that transition), shouldn’t we consider an important upside of that change, a simpler and more useful way to update our data?

So, I’ll quit there–go read the paper and let us know what you think. Don’t miss Gordon’s slides from Paris, available on his website. Note especially the last question on his final slide: “Is it time to get serious about linked data management?” We think it’s past time. After all, ‘management’ is our middle name.

LITA: LITA Members: take the LITA Education Survey

planet code4lib - Mon, 2014-09-22 16:42

LITA members, please participate in the LITA Education Survey. The survey was first sent out 2 weeks ago to all current LITA members.  Another reminder will appear in LITA members email boxes soon, or you can click the links in this posting. The survey should take no more than 10 minutes of your time and will help your LITA colleagues developing continuing education programs to meet your needs.

LITA Education Survey 2014

In our continuing efforts to make LITA education offerings meet the needs and wishes of our membership, we ask that you, the LITA members, take a few minutes to fill out the linked survey. We are looking for information on education offerings you have participated in recently and would like to know what topics, methods and calendar times work best for you.

The more responses we get the better chances we have to create education offerings that provide excellent value to you the LITA membership. We appreciate you taking 10 minutes of your time to complete the LITA Education Survey 2014.

Thank you for your time and input.

LITA Education Committee

Library of Congress: The Signal: 18 Years of Kairos Webtexts: An interview with Douglas Eyman & Cheryl E. Ball

planet code4lib - Mon, 2014-09-22 14:05

Cheryl E. Ball, associate professor of digital publishing studies at West Virginia University, is editor of Kairos

Since 1996 the electronic journal Kairos has published a diverse range of webtexts, scholarly pieces made up of a range of media and hypermedia. The 18 years of digital journal texts are both interesting in their own right and as a collection of complex works of digital scholarship that illustrate a range of sophisticated issues for ensuring long-term access to new modes of publication. Douglas Eyman, Associate Professor of Writing and Rhetoric at George Mason University is senior editor and publisher of Kairos. Cheryl E. Ball, associate professor of digital publishing studies at West Virginia University, is editor of Kairos. In this Insights Interview, I am excited to learn about the kinds of issues that this body of work exposes for considering long-term access to born-digital modes of scholarship. [There was also a presentation on Kairos at the Digital Preservation 2014 meeting.]

Trevor: Could you describe Kairos a bit for folks who aren’t familiar with it? In particular, could you tell us a bit about what webtexts are and how the journal functions and operates?

Doug: Webtexts are texts that are designed to take advantage of the web-as-concept, web-as-medium, and web-as-platform. Webtexts should engage a range of media and modes and the design choices made by the webtext author or authors should be an integral part of the overall argument being presented. One of our goals (that we’ve met with some success I think) is to publish works that can’t be printed out — that is, we don’t accept traditional print-oriented articles and we don’t post PDFs. We publish scholarly webtexts that address theoretical, methodological or pedagogical issues which surface at the intersections of rhetoric and technology, with a strong interest in the teaching of writing and rhetoric in digital venues.

Douglas Eyman, Associate Professor of Writing and Rhetoric at George Mason University is senior editor and publisher of Kairos

(As an aside, there was a debate in 1997-98 about whether we were publishing hypertexts, which then tended to be available in proprietary formats and platforms and not freely available on the WWW or not; founding editor Mick Doherty argued that we were publishing much more than only hypertexts, so we moved from calling what we published ‘hypertexts’ to ‘webtexts’ — Mick tells that story in the 3.1 loggingon column).

Cheryl: WDS (What Doug said One of the ways I explain webtexts to potential authors and administrators is that the design of a webtext should, ideally, enact authors’ scholarly arguments, so that the form and content of the work are inseparable.

Doug: The journal was started by an intrepid group of graduate students, and we’ve kept a fairly DIY approach since that first issue appeared on New Year’s day in 1996. All of our staff contribute their time and talents and help us to publish innovative work in return for professional/field recognition, so we are able to sustain a complex venture with a fairly unique economic model where the journal neither takes in nor spends any funds. We also don’t belong to any parent organization or institution, and this allows us to be flexible in terms of how the editors choose to shape what the journal is and what it does.

Cheryl: We are lucky to have a dedicated staff who are scattered across (mostly) the US: teacher-scholars who want to volunteer their time to work on the journal, and who implement the best practices of pedagogical models for writing studies into their editorial work. At any given time, we have about 25 people on staff (not counting the editorial board).

Doug: Operationally, the journal functions much like any other peer-reviewed scholarly journal: we accept submissions, review them editorially, pass on the ones that are ready for review to our editorial board, engage the authors in a revision process (depending on the results of the peer-review) and then put each submission through an extensive and rigorous copy-, design-, and code-editing process before final publication. Unlike most other journals, our focus on the importance of design and our interest in publishing a stable and sustainable archive mean that we have to add those extra layers of support for design-editing and code review: our published webtexts need to be accessible, usable and conform to web standards.

Trevor: Could you point us to a few particularly exemplary works in the journal over time for readers to help wrap their heads around what these pieces look like? They could be pieces you think are particularly novel or interesting or challenging or that exemplify trends in the journal. Ideally, you could link to it, describe it and give us a sentence or two about what you find particularly significant about it.

Cheryl: Sure! We sponsor an award every year for Best Webtext, and that’s usually where we send people to find exemplars, such as the ones Doug lists below.

Doug: From our peer-reviewed sections, we point readers to the following webtexts (the first two are especially useful for their focus on the process of webtext authoring and editing):

Cheryl: From our editorially (internally) reviewed sections, here are a few other examples:

Trevor: Given the diverse range of kinds of things people might publish in a webtext, could you tell us a bit about the kinds of requirements you have enforced upfront to try and ensure that the works the journal publishes are likely to persist into the future? For instance, any issues that might come up from embedding material from other sites, or running various kinds of database-driven works or things that might depend on external connections to APIs and such.

Doug: We tend to discourage work that is in proprietary formats (although we have published our fair share of Flash-based webtexts) and we ask our authors to conform to web standards (XHTML or HTML5 now). We think it is critical to be able to archive any and all elements of a given webtext on our server, so even in cases where we’re embedding, for instance, a YouTube video, we have our own copy of that video and its associated transcript.

One of the issues we are wrestling with at the moment is how to improve our archival processes so we don’t rely on third-party sites. We don’t have a streaming video server, so we use YouTube now, but we are looking at other options because YouTube allows large corporations to apply bogus copyright-holder notices to any video they like, regardless of whether there is any infringing content (as an example, an interview with a senior scholar in our field was flagged and taken down by a record company; there wasn’t even any background audio that could account for the notice. And since there’s a presumption of guilt, we have to go through an arduous process to get our videos reinstated.) What’s worse is when the video *isn’t* taken down, but the claimant instead throws ads on top of our authors’ works. That’s actually copyright infringement against us that is supported by YouTube itself.

Another issue is that many of the external links in works we’ve published (particularly in older webtexts) tend to migrate or disappear. We used to replace these where we can with links to (aka The Wayback Machine), but we’ve discovered that their archive is corrupted because they allow anyone to remove content from their archive without reason or notice.[1] So, despite its good intentions, it has become completely unstable as a reliable archive. But we don’t, alas, have the resources to host copies of everything that is linked to in our own archives.

Cheryl: Kairos holds the honor within rhetoric and composition of being the longest-running, and most stable, online journal, and our archival and technical policies are a major reason for that. (It should be noted that many potential authors have told us how scary those guidelines look. We are currently rewriting the guidelines to make them more approachable while balancing the need to educate authors on their necessity for scholarly knowledge-making and -preservation on the Web.)

Of course, being that this field is grounded in digital technology, not being able to use some of that technology in a webtext can be a rather large constraint. But our authors are ingenious and industrious. For example, Deborah Balzhiser et al created an HTML-based interface to their webtext that mimicked Facebook’s interface for their 2011 webtext, “The Facebook Papers.” Their self-made interface allowed them to do some rhetorical work in the webtext that Facebook itself wouldn’t have allowed. Plus, it meant we could archive the whole thing on the Kairos server in perpetuity.

Trevor: Could you give us a sense of the scope of the files that make up the issues? For instance, the total number of files, the range of file types you have, the total size of the data, and or a breakdown of the various kinds of file types (image, moving image, recorded sound, text, etc.) that exist in the run of the journal thus far?

Doug: The whole journal is currently around 20 Gb — newer issues are larger in terms of data size because there has be an increase in the use of audio and video (luckily, html and css files don’t take up a whole lot of room, even with a lot of content in them). At last count, there are 50,636 files residing in 4,545 directories (this count includes things like all the system files for WordPress installs and so on). A quick summary of primary file types:

  • HTML files:     12247
  • CSS:               1234
  • JPG files:        5581
  • PNG:               3470
  • GIF:                 7475
  • MP2/3/4:         295
  • MOV               237
  • PDF:                191

Cheryl: In fact, our presentation at Digital Preservation 2014 this year [was] partly about the various file types we have. A few years ago, we embarked on a metadata-mining project for the back issues of Kairos. Some of the fields we mined for included Dublin Core standards such as MIMEtype and DCMIType. DCMIType, for the most part, didn’t reveal too much of interest from our perspective (although I am sure librarians will see it differently!! but the MIMEtype search revealed both the range of filetypes we had published and how that range has changed over the journal’s 20-year history. Every webtext has at least one HTML file. Early webtexts (from 1996-2000ish) that have images generally have GIFs and, less prominent, JPEGs. But since PNGs rose to prominence (becoming an international standard in 2003), we began to see more and more of them. The same with CSS files around 2006, after web-standards groups starting enforcing their use elsewhere on the Web. As we have all this rich data about the history of webtextual design, and too many research questions to cover in our lifetimes, we’ve released the data in Dropbox (until we get our field-specific data repository,, completed).

Trevor: In the 18 years that have transpired since the first issue of Kairos a lot has changed in terms of web standards and functionality. I would be curious to know if you have found any issues with how earlier works render in contemporary web browsers. If so, what is your approach to dealing with that kind of degradation over time?

Cheryl: If we find something broken, we try to fix it as soon as we can. There are lots of 404s to external links that we will never have the time or human resources to fix (anyone want to volunteer??), but if an author or reader notifies us about a problem, we will work with them to correct the glitch. One of the things we seem to fix often is repeating backgrounds. lol. “Back in the days…” when desktop monitors were tiny and resolutions were tinier, it was inconceivable that a background set to repeat at 1200 pixels would ever actually repeat. Now? Ugh.

But we do not change designs for the sake of newer aesthetics. In that respect, the design of a white-text-on-black-background from 1998 is as important a rhetorical point as the author’s words in 1998. And, just as the ideas in our scholarship grow and mature as we do, so do our designs, which have to be read in the historical context of the surrounding scholarship.

Of course, with the bettering of technology also comes our own human degradation in the form of aging and poorer eyesight. We used to mandate webtexts not be designed over 600 pixels wide, to accommodate our old branding system that ran as a 60-pixel frame down the left-hand side of all the webtexts. That would also allow for a little margin around the webtext. Now, designing for specific widths — especially ones that small — seems ludicrous (and too prescriptive), but I often find myself going into authors’ webtexts during the design-editing stage and increasing their typeface size in the CSS so that I can even read it on my laptop. There’s a balance I face, as editor, of retaining the authors’ “voice” through their design and making the webtext accessible to as many readers as possible. Honestly, I don’t think the authors even notice this change.

Trevor: I understand you recently migrated the journal from a custom platform to the Open Journal System platform. Could you tell us a bit about what motivated that move and issues that occurred in that migration?

Doug: Actually, we didn’t do that.

Cheryl: Yeah, I know it sounds like we did from our Digital Preservation 2014 abstract, and we started to migrate, but ended up not following through for technical reasons. We were hoping we could create plug-ins for OJS that would allow us to incorporate our multimedia content into its editorial workflow. But it didn’t work. (Or, at least, wasn’t possible with the $50,000 NEH Digital Humanities Start-Up Grant we had to work with.) We wanted to use OJS to help streamline and automate our editorial workflow–you know, the parts about assigning reviewers and copy-editors, etc., — and as a way to archive those processes.

I should step back here and say that Kairos has never used a CMS; everything we do, we do by hand — manually SFTPing files to the server, manually making copies of webtext folders in our kludgy way of version control, using YahooGroups (because it was the only thing going in 1998 when we needed a mail system to archive all of our collaborative editorial board discussions) for all staff and reviewer conversations, etc.–not because we like being old school, but because there were always too many significant shortcomings with any out-of-the-box systems given our outside-the-box journal. So the idea of automating, and archiving, some of these processes in a centralized database such as OJS was incredibly appealing. The problem is that OJS simply can’t handle the kinds of multimedia content we publish. And rewriting the code-base to accommodate any plug-ins that might support this work was not in the budget. (We’ve written about this failed experiment in a white paper for NEH.)

[1] will obey robots.txt files if they ask not to be indexed. So, for instance, early versions of Kairos itself are no longer available on because such a file is on the Texas Tech server where the journal lived until 2004. We put that file there because we want Google to point to the current home of the journal, but we actually would like that history to be in the Internet Archive. You can think of this as just a glitch, but here’s the more pressing issue: if I find someone has posted a critical blog post of my work, if I ever get ahold of the domain it was originally posted, I can take it down there *and* retroactively make it unavailable on, even if it used to show up there. Even without such nefarious purpose, just the constant trade in domains and site locations means that no researcher can trust that archive when using it for history or any kind of digital scholarship.

LITA: Taking the Edge Off of Tech

planet code4lib - Mon, 2014-09-22 13:00
Image courtesy of Tina Franklin. Flickr 2013.

E-readers and tablets have become an increasingly popular way for patrons to access digital media. Mobile technology has altered the landscape of the types of services offered to public library patrons. Digital media services and distributers (i.e. iBookstore, Audible, Overdrive and Hoopla) allow patrons to download and stream ebooks, audiobooks, video and music. After happening upon the article “Shape Up Your Skills and Shake Up Your Library,” by Marshall Breeding for Computers in Libraries, I’m reminded that information professionals in public libraries must sharpen their tech skills in order to be of advantage to their patrons. If you belong to a library that subscribes to a digital media distributor, such as Overdrive and Hoopla, you are most likely first tier technical support for issues concerning the application and the device itself. For patrons who are not familiar with tablets and e-readers, their expectation of your assistance goes beyond navigating the library’s subscription service. You may find yourself giving instruction on where to find the wireless settings or how to properly turn the device off. It is natural to become intimidated by the technology when you’re sitting with a patron desperately attempting to figure out what the issue could be.

Not all public libraries are fortunate to have e-readers and tablets to train with. In that case, I suggest looking into alternative forms of instruction. Though I cannot promise you a complete instructional, I’ll attempt to help you brush-up on the light technical skills you’ll need before having to phone the professionals.

Familiarity is key
The first step in getting a better understanding of the technology is to become familiar with the exact services that your library is subscribed to. In the case of Overdrive and Hoopla, their services can be accessed using a computer. That is a great opportunity to explore the different features of the service. Be ready to answer certain questions: Does the library offer downloadable ebooks, audiobooks or video? What formats are they available in? What devices can be used with the service? If all else fails, you can always contact the service provider and ask for training materials or frequently asked questions and answers. If not already available, you can create instructional handouts for use by colleagues and patrons.

Take advantage of free services
To add some edge to your skills, consider utilizing the live product displays at electronics stores.
• Visit the Apple Store to use their iPads, iPad mini, etc.
• Best Buy has live displays of various Android OS tablets
• Target stores often feature Kindles and iPads
• Barnes and Noble stores have Nook displays

There are a plethora of alternate stores to consider. The imperative is to root around with the technology until you’re comfortable with its features. You want to know where the settings are located for each device because that knowledge will be useful at some point. And while you’re there, don’t be afraid to ask the salesperson questions about the device’s functionality. There is a high chance you will ask a question that will later be asked of you.

I make no assumptions here. Not all libraries have access to instructional materials or handouts for patrons. My aim is to create a starting point for self-training and instruction that is free and can be passed along to colleagues and patrons.

David Rosenthal: Three Good Reads

planet code4lib - Mon, 2014-09-22 12:11
Below the fold I'd like to draw your attention to two papers and a post worth reading.

Cappello et al have published an update to their seminal 2009 paper Towards Exascale Resilience called Towards Exascale Resilience: 2014 Update. They review progress in some critical areas in the past five years. I've referred to the earlier paper as an example of the importance of, and the difficulty of, fault-tolerance at scale. As scale increases, faults become part of the normal state of the system; they cannot be treated as an exception. It is nevertheless disappointing that the update, like its predecessor, deals only with exascale computation not also with exascale long-term storage. Their discussion of storage is limited to the performance of short-term storage in checkpointing applications. This is a critical issue, but a complete exascale system will need a huge amount of longer-term storage. The engineering problems in providing it should not be ignored.

Dave Anderson of Seagate first alerted me to the fact that, in the medium term, manufacturing economics make it impossible for flash to displace hard disk as the medium for most long-term near-line bulk storage. Fontana et al from IBM Almaden have now produced a comprehensive paper, The Impact of Areal Density and Millions of Square Inches (MSI) of Produced Memory on Petabyte Shipments of TAPE, NAND Flash, and HDD Storage Class Memories that uses detailed industry data on flash, disk and tape shipments, technologies and manufacturing investments from 2008-2012 to reinforce this message. They also estimate the scale of investment needed to increase production to meet an estimated 50%/yr growth in data. The mismatch between the estimates of data growth and the actual shipments of media on which to store it is so striking that they are forced to cast doubt on the growth estimates. It is clear from their numbers that the industry will not make the mistake of over-investing in manufacturing capacity, driving prices, and thus margins, down. This provides significant support for our argument that Storage Will Be Much Less Free Than It Used To Be.

Henry Newman has a post up at Enterprise Storage entitled Ensuring the Future of Data Archiving discussing the software architecture that future data archives require. Although I agree almost entirely with Henry's argument, I think he doesn't go far enough. We need to fix the system, not just the software. I will present my, much more radical, view of future archival system architecture in a talk at the Library of Congress' Designing Storage Architectures workshop. The text will go up here in a few days.

Nick Ruest: Islandora and nginx

planet code4lib - Mon, 2014-09-22 11:40
Islandora and nginx Background

I have been doing a fair bit of scale testing for York University Digital Library over the last couple weeks. Most of it has been focused on horizontal scaling of the traditional Islandora stack (Drupal, Fedora Commons, FedoraGSearch, Solr, and aDORe-Djtatoka). The stack is traditionally run with Apache2 in front of it, and it reserve proxies parts of the stack that are Tomcat webapps. I was curious if the stack would work with nginx, and if would get any noticeable improvements by just switching from Apache2 to nginx. The preliminary good news is that the stack works with nginx (I'll outline the config below). The not surprising news, according to this, is I'm not seeing any noticeable improvements. If time permits, I'll do some real benchmarking.

Islandora nginx configurations

Having no experience with nginx, I started searching around, and found a config by David StClair that worked. With a few slight modifications, I was able to get the stack up any running with no major issues. The only major item that I needed to figure out how to do was reverse proxying aDORe-djatoka so that it would place nice with the default settings for Islandora OpenSeadragon. All this turned out to be was figuring out what the ProxyPass and ProxyPassReverse directive equivalents were for nginx. Turns out that it is very straightforward. With Apache2, we needed:

#Fedora Commons/Islandora proxying ProxyRequests Off ProxyPreserveHost On Order deny,allow Allow from all ProxyPass /adore-djatoka ProxyPassReverse /adore-djatoka

This gives us a nice dog in a hat with Apache2.

With nginx we use the proxy_redirect directive.

server { location /adore-djatoka { proxy_pass http://localhost:8080/adore-djatoka; proxy_redirect http://localhost:8080/adore-djatoka /adore-djatoka; } }

This gives us a nice dog in a hat with nginx.

That's really only the major modification that I had to make to get the stack running with nginx. Here is my config adapted from David StClair's example.

server { server_name; root /path/to/drupal/install; ## <-- Your only path reference. # Enable compression, this will help if you have for instance advagg module # by serving Gzip versions of the files. gzip_static on; location = /favicon.ico { log_not_found off; access_log off; } location = /robots.txt { allow all; log_not_found off; access_log off; } # Very rarely should these ever be accessed outside of your lan location ~* \.(txt|log)$ { allow; deny all; } location ~ \..*/.*\.php$ { return 403; } # No no for private location ~ ^/sites/.*/private/ { return 403; } # Block access to "hidden" files and directories whose names begin with a # period. This includes directories used by version control systems such # as Subversion or Git to store control files. location ~ (^|/)\. { return 403; } location / { # This is cool because no php is touched for static content try_files $uri @rewrite; proxy_read_timeout 300; } location /adore-djatoka { proxy_pass http://localhost:8080/adore-djatoka; proxy_redirect http://localhost:8080/adore-djatoka /adore-djatoka; } location @rewrite { # You have 2 options here # For D7 and above: # Clean URLs are handled in drupal_environment_initialize(). rewrite ^ /index.php; # For Drupal 6 and bwlow: # Some modules enforce no slash (/) at the end of the URL # Else this rewrite block wouldn't be needed (GlobalRedirect) #rewrite ^/(.*)$ /index.php?q=$1; } # For Munin location /nginx_status { stub_status on; access_log off; allow; deny all; } location ~ \.php$ { fastcgi_split_path_info ^(.+\.php)(/.+)$; #NOTE: You should have "cgi.fix_pathinfo = 0;" in php.ini include fastcgi_params; fastcgi_param SCRIPT_FILENAME $request_filename; fastcgi_intercept_errors on; fastcgi_pass; } # Fighting with Styles? This little gem is amazing. # This is for D7 and D8 location ~ ^/sites/.*/files/styles/ { try_files $uri @rewrite; } location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ { expires max; log_not_found off; } } tags: drupalislandoraapache2nginx

DuraSpace News: Open Repository: Work, Plus Hack Day Fun

planet code4lib - Mon, 2014-09-22 00:00

From James Evans, Open Repository, BioMed Central

Peter Sefton: Digital Object Pattern (DOP) vs chucking files in a database, approaches to repository design

planet code4lib - Sun, 2014-09-21 23:09

At work, in the eResearch team at the University of Western Sydney we’ve been discussing various software options for working-data repositories for research data, and holding a series of ‘tool days’; informal hack-days where we try out various software packages. For the last few months we’ve been looking at “working-data repository” software for researchers in a principled way, searching for one or more perfect Digital Object Repositories for Academe (DORAs).

One of the things I’ve been ranting on about is the flexibility of the “Digital Object Pattern”, (we always need more acronyms, so lets call it DOP) for repository design, as implemented by the likes of ePrints, DSpace, Omeka, CKAN and many of the Fedora Commons based repository solutions. At its most basic, this means a repository that is built around a core set of objects (which might be called something like an Object, an ePrint, an Item, or a Data Set depending on which repository you’re talking to). These Digital Objects have:

  • Object level Metadata
  • One or more ‘files’ or ‘datastreams’ or ‘bitstreams’, which may themselves be metadata. Depending on the repository these may or may not have their own metadata.

Basic DOP Pattern

There are infinite ways to model a domain but this is a tried-and-tested pattern which is worth exploring for any repository, if only because it’s such a common abstraction that lots of protocols and user interface conventions have grown up around it.

I found this discussion of the Digital Object used in a CNRI, Digital Object Repository Server (DORS), obviously a cousin of DORA.

This data structure allows an object to have the following:

  • a set of key-value attributes that describe the object, one of which is the object’s identifier

  • a set of named ‘data elements’ that hold potentially large byte sequences (analogous to one or more data files)

  • a set of key-value attributes for each of the data elements

This relatively simple data structure allows for the simple case, but is sufficiently flexible and extensible to incorporate a wide variety of possible structures, such as an object with extensive metadata, or a single object which is available in a number of different formats. This object structure is general enough that existing services can easily map their information-access paradigm onto the structure, thus enhancing interoperability by providing a common interface across multiple and diverse information and storage systems. An example application of the DO data model is illustrated in Figure 1.

To the above list of features and advantages I’d add a couple of points on how to implement the ideal Digital Object repository:

  • Every modern repository should make it easy for people to do linked data. Instead of merely key-value attributes that describe the object, it would be better to allow for and encourage RDF-style predicate / object metadata where both the predicate and object are HTTP URIs with human-friendly labels. This is implemented natively in Fedora Commons v4. But when you are using the DOP it’s not essential as you can always add an RDF metadata data-element/file.
  • It’s nice if the files also get metadata as in the CNRI Digital Object, but using the DOP you can always add ‘file’ that describes the file relationships rather than relying on conventions like file-extensions or suffixes to say stuff like “This is a thumbnail preview of img01.jpg”
  • There really should be a way to do relationships with other objects but again, the DOP means you can DIY this feature with a ‘relationships’ data element.

(I’m trying to keep this post reasonably short, but just quickly, another really good repository pattern that complements DOP is to keep separate the concerns of Storing Stuff from Indexing Stuff for Search and Browse. That is, the Digital Objects should be stashed somewhere with all their metadata and data, and no matter what metadata type you’re using you build one or more discovery indexes from that. This is worth mentioning because as soon as some people see RDF they immediately think Triple Store, OK, but for repository design I think it’s more helpful to think Triple Index. That is, treat the RDF reasoner, SPARQL query endpoint etc as a separate concern from repositing.)

The DOP contrasts with a file-centric pattern where every file is modelled separately, with it’s own metadata, which is the approach taken by HIEv, the environmental science Working Data Repository we looked at last week. Theoretically, this gives you infinite flexibility but in practice it makes it harder to build a usable data repository.

Files as primary repository objects

Once your repository starts having a lot of stuff in it like image thumbnails, derived files like OCRed text, and transcoded versions of files (say from the proprietary TOA5 format into NETCDF) then you’re faced with the challenge of indexing them all, for search and browse in a way that they appear to clump together. I think that as HIEv matures and more and more relationships between files become important then we’ll probably want to add container objects that automatically bundle together all the related bits and pieces to do with a single ‘thing’ in the repository. For example, a time series data set may have the original proprietary file format, some rich metadata, a derived file in a standard format, a simple plot to preview the file contents, and re-sampled data set at lower resolution, all of which really have more or less the same metadata about where they came from, when, and some shared utility. So, we’ll probably end up with something like this:

Adding an abstraction to group files into Objects (once the UI gets unmanageable)

Draw a box around that and what have you got?

The Digital Object Pattern, that’s what, albeit probably implemented in a somewhat fragile way.

With the DOP, as with any repository implementation pattern you have to make some design decisions. Gerry Devine asked at our tools day this week, what do you do about data-items that are referenced by multiple objects?

First of all, it is possible for one object to reference another, or data elements in another, but if there’s a lot of this going on then maybe the commonly re-used data elements could be put in their own object. A good example of this is the way WordPress, which is probably where you’re reading this, works. All images are uploaded into a media collection, and then referenced by posts and pages: an image doesn’t ‘belong’ to a document except by association, if the document calls it in. This is a common approach for content management systems, allowing for re-use of assets across objects, but if you were building a museum collection project with a Digital Object for each physical artefact, it might be better for practical reasons to store images of objects as data elements on the object, and other images which might be used for context etc separately as image objects.

Of course if you’re a really hardcore developer you’ll probably want to implement the most flexible possible pattern and put one file per object, with a ‘master object’ to tie them together. This makes development of a usable repository significantly harder. BTW, you can do it using the DOP with one-file per Digital Object, and lots of relationships. Just be prepared for orders of magnitude more work to build a robust, usable system.

Digital Object Pattern (DOP) vs chucking files in a database, approaches to repository design is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Code4Lib: Code4Lib North (Ottawa): Tuesday October 7th, 2014

planet code4lib - Sun, 2014-09-21 18:38


  • Mark Baker - Principal Architect at Zepheira will provide a brief overview of some of Zepheira’s BibFrame tools in development.
  • Jennifer Whitney - Systems Librarian at MacOdrum Library will present OpenRefine (formerly Google Refine) – a neat and powerful tool for cleaning up messy data.
  • Sarah Simpkin, GIS and Geography Librarian & Catherine McGoveran, Government Information Librarian (both from UOttawa Library) - will team up to present on a recent UOttawa sponsored Open Data Hackfest as well as to introduce you to Open Data Ottawa.

Date: Tuesday October 7th, 2014, 7:00PM (19h00)

Location: MacOdrum Library, Carleton University, 1125 Colonel By Drive, Ottawa, ON, Ottawa, ON (map)

RSVP: You can RSVP on code4lib Ottawa Meetup page

David Rosenthal: Utah State Archives has a problem

planet code4lib - Sun, 2014-09-21 04:55
A recent discussion on the NDSA mailing list featured discussion about the Utah State Archives struggling with the costs of being forced to use Utah's state IT infrastructure for preservation. Below the fold, some quick comments.

Here's summary of the situation the Archives finds itself in:
we actually have two separate copies of the AIP. One is on m-disc and the other is on spinning disk (a relatively inexpensive NAS device connected to our server, for which we pay our IT department each month). ... We have centralized IT, where there is one big data center and servers are virtualized. Our IT charges us a monthly rate for not just storage, but also all of their overhead to exist as a department. ... and we are required by statute to cooperate with IT in this model, so we can't just go out and buy/install whatever we want. For an archives, that's a problem, because our biggest need is storage but we are funded based upon the number of people we employ, not the quantity of data we need to store, and convincing the Legislature that we need $250,000/year for just one copy of 50 TB of data is a hard sell, never mind additional copies for SIP, AIP, and/or DIP.Michelle Kimpton, who is in the business of persuading people that using DuraCloud is cheaper and better than doing it yourself, leaped at the opportunity this offered (my emphasis):
If I look at Utah State Archive storage cost, at $5,000 per year per TB vs. Amazon S3 at $370/year/TB it is such a big gap I have a hard time believing that Central IT organizations will be sustainable in the long run.  Not that Amazon is the answer to everything, but they have certainly put a stake in the ground regarding what spinning disk costs, fully loaded( meaning this includes utilities, building and personnel). Amazon S3 also provides 3 copies, 2 onsite and one in another data center.

I am not advocating by any means that S3 is the answer to it all, but it is quite telling to compare the fully loaded TB cost from an internal IT shop vs. the fully loaded TB cost from Amazon.
I appreciate you sharing the numbers Elizabeth and it is great your IT group has calculated what I am guessing is the true cost for managing data locally.Elizabeth Perkes for the Archives responded:
I think using Amazon costs more than just their fees, because someone locally still has to manage any server space you use in the cloud and make sure the infrastructure is updated. So then you either need to train your archives staff how to be a system administrator, or pay someone in the IT community an hourly rate to do that job. Depending on who you get, hourly rates can cost between $75-150/hour, and server administration is generally needed at least an hour per week, so the annual cost of that service is an additional $3,900-$7,800. Utah's IT rate is based on all costs to operate for all services, as I understand it. We have been using a special billing rate for our NAS device, which reflects more of the actual storage costs than the overhead, but then the auditors look at that and ask why that rate isn't available to everyone, so now IT is tempted to scale that back. I just looked at the standard published FY15 rates, and they have dropped from what they were a couple of years ago. The official storage rate is now .2386/GB/month, which is $143,160/year for 50 TB, or $2,863.20 per TB/year. But this doesn't get at the fundamental flaws in Michelle's marketing:
  • She suggests that Utah's IT charges reflect "the true cost for managing data locally". But that isn't what the Utah Archives are doing. They are buying IT services from a competitor to Amazon, one that they are required by statute to buy from. 
  • She compares Utah's IT with S3. S3 is a storage-only product. Using it cost-effectively, as Elizabeth points out, involves also buying AWS compute services, which is a separate business of Amazon's with its own P&L and pricing policies. For the Archives, Utah IT is in effect both S3 and AWS, so the comparison is misleading.
  • The comparison is misleading in another way. Long-term, reliable storage is not the business Utah IT is in. The Archives are buying storage services from a compute provider, not a storage provider. It isn't surprising that the pricing isn't competitive.
  • But more to the point, why would Utah IT bother to be competitive? Their customers can't go any place else, so they are bound to get gouged. I'm surprised that Utah IT is only charging 10 times the going rate for an inferior storage product
  • And don't fall for the idea that Utah IT is only charging what they need to cover their costs. They control the costs, and they have absolutely no incentive to minimize them. If an organization can hire more staff and pass the cost of doing so on to customers who are bound by statute to pay for them, it is going to hire a lot more staff than an organization whose customers can walk.
As I've pointed out before, Amazon's margins on S3 are enviable. You don't need to be very big to have economies of scale enough to undercut S3, as the numbers from Backblaze demonstrate. The Archive's 50TB is possibly not enough to do this if they were actually managing the data locally.

But the Archive might well employ a strategy similar to that I suggested for the Library of Congress Twitter collection. They already keep a copy on m-disk. Suppose they kept two copies on m-disk as the Library keeps two copies on tape, and regarded that as their preservation solution. Then they could use Amazon's Reduced Redundancy Storage and AWS virtual servers as their access solution. Running frequent integrity checks might take an additional small AWS instance, and any damage detected could be repaired from one of the m-disk copies.

Using the cloud for preservation is almost always a bad idea. Preservation is a base-load activity whereas the cloud is priced as a peak-load product. But the spiky nature of current access to archival collections is ideal for the cloud.

John Miedema: “Book Was There” by Andrew Piper. If we’re going to have ebooks that distract us, we might as well have ones that help us analyse too.

planet code4lib - Sat, 2014-09-20 18:56

“I can imagine a world without books. I cannot imagine a world without reading” (Piper, ix). In these last few generations of print there is nothing keeping book lovers from reading print books. Yet with each decade the print book yields further to the digital. But there it is, we are the first few generations of digital, and we are still discovering what that means for reading. It is important to document this transition. In Book Was There: Reading in Electronic Times, Piper describes how the print book is shaping the digital screen and what it means for reading.

Book was there. It is a quote from Gertrude Stein, who understood that it matters deeply where one reads. Piper: “my daughter … will know where she is when she reads, but so too will someone else.” (128) It is a warm promise and an observation that could be ominous, but still being explored for possibilities.

The differences between print and digital are complex, and Piper is not making a case for or against books. The book is a physical container of letters. The print book is “at hand,” a continuous presence, available for daily reference and so capable of reinforcing new ideas. The word, “digital,” comes from “digits” (at least in English), the fingers of the hand. Digital technology is ambient, but could could allow for more voices, more debate. On the other hand, “For some readers the [print] book is anything but graspable. It embodies … letting go, losing control, handing over.” (12)  And internet users are known to flock together, reinforcing what they already believe, ignoring dissent. Take another example. Some criticize the instability of the digital. Turn off the power and the text is gone. Piper counters that digital text is incredibly hard to delete, with immolation of the hard drive being the NSA recommended practice.

Other differences are still debated. There is a basic two-dimensional nature to the book, with pages facing one another and turned. One wonders if this duality affords reflection. Does the return to one-dimensional scrolling of the web page numb the mind? Writing used to be the independent act of one or two writers. Reading was a separate event. Digital works like Wikipedia are written by many contributors, organized into sections. Piper wonders if it possible to have collaborative writing that is also tightly woven like literature? There is the recent example of 10 PRINT, written by ten authors in one voice. Books have always been shared, a verb that has its origins in “shearing … an act of forking.” (88) With digital, books can be shared more easily, and readers can publish endings of their own. Books are forked into different versions. Piper cautions that over-sharing can lead to the forking that ended the development of Unix. But we now have the successful Unix. Is there a downside?

Scrolling aside, digital is really a multidimensional media. Text has been rebuilt from the ground up, with numbers first. New deep kinds of reading are becoming possible. Twenty-five years ago a professor of mine lamented that he could not read all the academic literature in his discipline. Today he can. Piper introduces what is being called “distant reading”: the use of big data technologies, natural language processing, and visualization, to analyze the history of literature at the granular level of words. In his research, he calculates how language influences the writing of a book, and how in turn the book changes the language of its time. It measures a book in a way that was never possible with disciplined close reading or speed reading. “If we’re going to have ebooks that distract us, we might as well have ones that help us analyse too.” (148)

Piper embraces the fact that we now have new kinds of reading. He asserts that these practices need not replace the old. Certainly there were always be print books for those of us who love a good slow read. I do think, however, that trade-offs are being made. Books born digital are measurably shorter than print, more suited to quick reading and analysis by numbers. New authors are writing to digital readers. Readers and reading are being shaped in turn. The reading landscape is changing. These days I am doubtful that traditional reading of print books — or even ebooks — will remain a common practice. There it is.

District Dispatch: “Outside the Lines” at ICMA

planet code4lib - Fri, 2014-09-19 21:14

(From left) David Singleton, Director of Libraries for the Charlotte Mecklenburg Library, with Public Library Association (PLA) Past President Carolyn Anthony, PLA Director Barb Macikas and PLA President Larry Neal after a tour of ImaginOn.

This week, many libraries are inviting their communities to reconnect as part of a national effort called Outside the Lines (September 14-20). Since my personal experience of new acquaintances often includes an exclamation of “I didn’t know libraries did that,” and this experience is buttressed by Pew Internet Project research that finds that only about 23 percent of people who already visit our libraries feel they know all or most of what we do, the need to invite people to rethink libraries is clear.

On the policy front, this also is a driving force behind the Policy Revolution! initiative—making sure national information policy matches the current and emerging landscape of how libraries are serving their communities. One of the first steps is simply to make modern libraries more visible to key decision-makers and influencers.

One of these influential groups, particularly for public libraries, is the International City/County Management Association (ICMA), which concluded its 100th anniversary conference in Charlotte this past week. I enjoyed connecting with city and county managers and their professional staffs over several days, both informally and formally through three library-related presentations.

The Aspen Institute kicked off my conference experience with a preview and discussion of its work emerging from the Dialogue on Public Libraries. Without revealing any details that might diminish the national release of the Aspen Institute report to come in October, I can say it was a lively and engaged discussion with city and county managers from communities of all sizes across the globe. One theme that emerged and resonated throughout the conference was one related to breaking down siloes and increasing collaboration. One participant described this force factor as “one plus one equals three” and referenced the ImaginOn partnership between the Charlotte Mecklenburg Library and the Children’s Theatre of Charlotte.

A young patron enjoys a Sunday afternoon at ImaginOn.

While one might think that the level of library knowledge and engagement in the room was perhaps exceptional, throughout my conversations, city and county managers described new library building projects and renovations, efforts to increase local millages, and proudly touted the energy and expertise of the library directors they work with in building vibrant and informed communities. In fact, they sounded amazingly like librarians in their enthusiasm and depth of knowledge!

Dr. John Bertot and I shared findings and new tools from the Digital Inclusion Survey, with a particular focus on how local communities can use the new interactive mapping tools to connect library assets to community demographics and concerns. ICMA is a partner with the American Library Association (ALA) and the University of Maryland Information Policy & Access Center on the survey, which is funded by the Institute of Museum and Library Services (IMLS). Through our presentation (ppt), we explored the components of digital inclusion and key data related to technology infrastructure, digital literacy and programs and services that support education, civic engagement, workforce and entrepreneurship, and health and wellness. Of greatest interest was—again—breaking down barriers…in this case among diverse datasets relating libraries and community priorities.

Finally, I was able to listen in on a roundtable on Public Libraries and Community Building in which the Urban Libraries Council (ULC) shared the Edge benchmarks and facilitated a conversation about how the benchmarks might relate to city/county managers’ priorities and concerns. One roundtable participant from a town of about 3,300 discovered during a community listening tour that the library was the first place people could send a fax; and often where they used a computer and the internet for the first time. How could the library continue to be the “first place” for what comes next in new technology? The answer: you need to have facility and culture willing to be nimble. One part of preparing the facility was to upgrade to a 100 Mbps broadband connection, which has literally increased traffic to this community technology hub as people drive in with their personal devices.

I was proud to get Outside the Lines at the ICMA conference, and am encouraged that so many of these city and county managers already had “met” the 21st century library and were interested in working together for stronger cities, towns, counties and states. Thanks #ICMA14 for embracing and encouraging library innovation!

The post “Outside the Lines” at ICMA appeared first on District Dispatch.

FOSS4Lib Recent Releases: Evergreen - 2.5.7-rc1

planet code4lib - Fri, 2014-09-19 20:28

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

Package: EvergreenRelease Date: Friday, September 5, 2014

FOSS4Lib Recent Releases: Evergreen - 2.6.3

planet code4lib - Fri, 2014-09-19 20:27

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

Package: EvergreenRelease Date: Friday, September 5, 2014

FOSS4Lib Recent Releases: Evergreen - 2.7.0

planet code4lib - Fri, 2014-09-19 20:27

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

Package: EvergreenRelease Date: Thursday, September 18, 2014

FOSS4Lib Upcoming Events: Fedora 4.0 in Action at The Art Institute of Chicago and UCSD

planet code4lib - Fri, 2014-09-19 20:16
Date: Wednesday, October 15, 2014 - 13:00 to 14:00Supports: Fedora Repository

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

Presented by: Stefano Cossu, Data and Application Architect, Art Institute of Chicago and Esmé Cowles, Software Engineer, University of California San Diego
Join Stefano and Esmé as they showcase new pilot projects built on Fedora 4.0 Beta at the Art Institute of Chicago and the University of California San Diego. These projects demonstrate the value of adopting Fedora 4.0 Beta and taking advantage of new features and opportunities for enhancing repository data.

HangingTogether: Talk Like a Pirate – library metadata speaks

planet code4lib - Fri, 2014-09-19 19:32

Pirate Hunter, Richard Zacks

Friday, 19 September is of course well known as International Talk Like a Pirate Day. In order to mark the day, we created not one but FIVE lists (rolled out over this whole week). This is part of our What In the WorldCat? series (#wtworldcat lists are created by mining data from WorldCat in order to highlight interesting and different views of the world’s library collections).

If you have a suggestion something you’d like us to feature, let us know or leave a comment below.

About Merrilee Proffitt

Mail | Web | Twitter | Facebook | LinkedIn | More Posts (268)

FOSS4Lib Upcoming Events: VuFind Summit 2014

planet code4lib - Fri, 2014-09-19 19:18
Date: Monday, October 13, 2014 - 08:00 to Tuesday, October 14, 2014 - 17:00Supports: VuFind

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

This year's VuFind Summit will be held on October 13-14 at Villanova University (near Philadelphia).

Registration for the two-day event is $40 and includes both morning refreshments and a full lunch for both days.

It is not too late to submit a talk proposal and, if accepted, have your registration fee waived.

State Library of Denmark: Sparse facet caching

planet code4lib - Fri, 2014-09-19 14:40

As explained in Ten times faster, distributed faceting in standard Solr is two-phase:

  1. Each shard performs standard faceting and returns the top limit*1.5+10 terms. The merger calculates the top limit terms. Standard faceting is a two-step process:
    1. For each term in each hit, update the counter for that term.
    2. Extract the top limit*1.5+10 terms by running through all the counters with a priority queue.
  2. Each shard returns the number of occurrences of each term in the top limit terms, calculated by the merger from phase 1. This is done by performing a mini-search for each term, which takes quite a long time. See Even sparse faceting is limited for details.
    1. Addendum: If the number for a term was returned by a given shard in phase 1, that shard is not asked for that term again.
    2. Addendum: If the shard returned a count of 0 for any term as part of phase 1, that means is has delivered all possible counts to the merger. That shard will not be asked again.
Sparse speedup

Sparse faceting speeds up phase 1 step 2 by only visiting the updated counters. It also speeds up phase 2 by repeating phase 1 step 1, then extracting the counts directly for the wanted terms. Although it sounds heavy to repeat phase 1 step 1, the total time for phase 2 for sparse faceting is a lot lower than standard Solr. But why repeat phase 1 step 1 at all?


Today, caching of the counters from phase 1 step 1 was added to Solr sparse faceting. Caching is tricky business to get just right, especially since the sparse cache must contain a mix of empty counters (to avoid re-allocation of large structures on the Java heap) as well as filled structures (from phase 1, intended for phase 2). But theoretically, it is simple: When phase 1 step 1 is finished, the counter structure is kept and re-used in phase 2. So time for testing:

15TB index / 5B docs / 2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed queries

Note that there are no measurements of standard Solr faceting in the graph. See the Ten times faster article for that. What we have here are 4 different types of search:

  • no_facet: Plain searches without faceting, just to establish the baseline.
  • skip: Only phase 1 sparse faceting. This means inaccurate counts for the returned terms, but as can be seen, the overhead is very low for most searches.
  • cache: Sparse faceting with caching, as described above.
  • nocache: Sparse faceting without caching.

For 1-1000 hits, nocache is actually a bit faster than cache. The peculiar thing about this hit-range is that chances are high that all shards returns all possible counts (phase 2 addendum 2), so phase 2 is skipped for a lot of searches. When phase 2 is skipped, this means wasted caching of a filled counter structure, that needs to be either cleaned for re-use or discarded if the cache is getting too big. This means a bit of overhead.

For more than 1000 hits, cache wins over nocache. Filter through the graph noise by focusing on the medians. As the difference between cache and nocache is that the base faceting time is skipped with cache, the difference of their medians should be the about the same as the difference of the medians from no_facet and skip. Are they? Sorta-kinda. This should be repeated with a larger sample.


Caching with distributed faceting means a small performance hit in some cases and a larger performance gain in other. Nothing Earth-shattering and as it works best when there is more memory allocated for caching, it is not clear in general whether it is best to use it or not. Download a Solr sparse WAR from GitHub and try for yourself.


Subscribe to code4lib aggregator