New to CrossRef? Interested in learning more about the technical aspects of CrossRef? Please join us for one of our upcoming Introduction to CrossRef Technical Basics webinars. This webinar will provide a technical introduction to CrossRef and a brief outline of how CrossRef works. New members are especially encouraged to register. All of our webinars are free to attend. If the time of the webinar is not convenient for you, we will be recording the sessions and they will be made available after the webinars for those who register.
Introduction to CrossRef Technical Basics
Date: Wednesday, Feb 11, 2015
Time: 8:00 am (San Francisco), 11:00 am (New York), 4:00 pm (London)
Moderator: Patricia Feeney
Introduction to CrossRef Technical Basics
Date: Wednesday, Mar 18, 2015
Time: 8:00 am (San Francisco), 11:00 am (New York), 4:00 pm (London)
Moderator: Patricia Feeney
Additional CrossRef webinars are also listed on our webinar page.
We look forward to having you join us!
CrossCheck: iThenticate Admin Webinar
Date: Thursday, Feb 19, 2015
Time: 7:00 am (San Francisco), 10:00 am (New York), 3:00 pm (London)
CrossCheck, powered by iThenticate now has over 600 members using the service to screen content for originality. Through demand from these members, CrossRef is trialling a webinar for CrossCheck administrators and more experienced users that will cover:
- The scale of the current CrossCheck database
- CrossCheck participation and usage
- An overview of the newest features in iThenticate
- A run-through of more administrator-specific features in iThenticate
- Advice on interpreting the reports and common issues
- Details on support resources available for publishers
- Q&A session
Representatives from CrossRef and iParadigms will run the webinar which will last one hour. We hope you can join us!
If you can't make this webinar visit our webinar page for additional webinars.
Consider for example, this request for "eResources for Minitex". Minitex is a "publicly supported network of academic, public, state government, and special libraries working cooperatively to improve library service for their users in Minnesota, North Dakota and South Dakota" and it negotiates licenses databases for libraries throughout the three states.
Question number 172 in this Request for Proposals (RFP) was: "Password storage. Indicate how passwords are stored (e.g., plain text, hash, salted hash, etc.)."
To provide context for this question, you need to know just a little bit of security and cryptography.
I'll admit to having written code 15 years ago that saved passwords as plain text. This is a dangerous thing to do, because if someone were to get unauthorized access to the computer where the passwords were stored, they would have a big list of passwords. Since people tend to use the same password on multiple systems, the breached password list could be used, not only to gain access to the service that leaked the password file, but also to other services, which might include banks, stores and other sites of potential interest to thieves.
As a result, web developers are now strongly admonished never to save the passwords as plain text. Doing so in a new system should be considered negligent, and could easily result in liability for the developer if the system security is breached. Unfortunately many businesses would rather risk paying paying lawyers a lot of money to defend themselves should something go wrong than bite the bullet and pay some engineers a little money now to patch up the older systems.
To prevent the disclosure of passwords, the current standard practice is to "salt and hash" them.
A cryptographic hash function mixes up a password so that the password cannot be reconstructed. so for example, the hash of 'my_password' is 'a865a7e0ddbf35fa6f6a232e0893bea4'. When a user enters their password, the hash of the password is recalculated and compared to the saved hash to determine whether the password is correct.
As a result of this strategy, the password can't be recovered. But it can be reset, and the fact that no one can recover the password eliminates a whole bunch of "social engineering" attacks on the security of the service.
Given a LOT of computer power, there are brute force attacks on the hash, but the easiest attack is to compute the hashes for the most common passwords. In a large file of passwords, you should be able to find some accounts that are breachable, even with the hashing. And so a "salt" is added to the password before the hash is applied. In the example above, a hash would be computed for 'SOME_CLEVER_SALTmy_password'. Which, of course, is '52b71cb6d37342afa3dd5b4cc9ab4846'.
To attack the salted password file, you'd need to know that salt. And since every application uses a different salt, each file of salted passwords is completely different. A successful attack on one hashed password file won't compromise any of the others.
Another standard practice for user-facing password management is to never send passwords unencrypted. The best way to do this is to use HTTPS, since web browser software alerts the user that their information is secure. Otherwise, any server between the user and the destination server (there might be 20-40 of these for typical web traffic) could read and store the user's password.
The Minitex RFP covers reference databases. For this reason, only a small subset of services offered to libraries are covered here. The authentication for these sorts of systems typically don't depend on the user creating a password; user accounts are used to save the results of a search, or to provide customization features. A Minitex patron can use many of the offered databases without providing any sort of password.
So here are the verbatim responses received for the Minitex RFP:
Response: "All passwords are stored using a salted hash. The salt is randomly generated and unique for each user."
My comment: This is a correct answer. However, the LearningExpress login sends passwords in the clear over HTTP.
Response: "Passwords are md5 hashed."
My comment: MD5 is the hash algorithm I used in my examples above. It's not considered very secure (see comments). OCLC Firstsearch does not force HTTPS and can send login passwords in the clear.
My comment: This just means that no passwords are used in the service.
Infogroup Library Division
Response: "Passwords are currently stored as plain text. This may change once we develop the customization for users within ReferenceUSA. Currently the only passwords we use are for libraries to access usage stats."
My comment: The user customization now available for ReferenceUSA appears at first glance to be done correctly.
EBSCO Information Services
Response: "EBSCOhost passwords in EBSCOadmin are stored in plain text."
My comment: Should note that EBSCOadmin is not a end-user facing system. So if the EBSCO systems were compromised only library administrator credentials would be exposed.
Encyclopaedia Britannica, Inc.
Response: "Passwords are stored as plain text."
My comment: I wonder if EB has an article on network security?
Response: "We store all passwords as plain text."
My comment: The ProQuest service available through my library creates passwords over HTTP but uses some client-side encryption. I have not evaluated the security of this encryption.
Scholastic Library Publishing, Inc.
Response: "Passwords are not stored. FreedomFlix offers a digital locker feature and is the only digital product that requires a login and password. The user creates the login and password. Scholastic Library Publishing, Inc does not have access to this information.”
My comment: The "FreedomFlix" service not only sends user passwords unencrypted over HTTP, it sends them in a GET query string. This means that not only can anyone see the user passwords in transit, but log files will capture and save them for long-term perusal. Third-party sites will be sent the password in referrer headers. When used on a shared computer, subsequent users will easily see the passwords. "Scholastic Library Publishing" may not have access to user passwords, but everyone else will have them.
Response: "Passwords are stored in plain text."
My comment: Like FreedomFlix, the Gale Infotrac service from Cengage sends user passwords in the clear in a GET query string. But it asks the user to enter their library barcode in the password field, so users probably wouldn't be exposing their personal passwords.
So, to sum up, adoption of up-to-date security practices is far from complete in the world of library databases. I hope that the laggards have improved since the submission date of this RFP (roughly a year ago) or at least have plans in place to get with the program. I would welcome comments to this post that provide updates. Libraries themselves deserve a lot of the blame, because for the most part the vendors that serve them respond to their requirements and priorities.
I think libraries issuing RFPs for new systems and databases should include specific questions about security and privacy practices, and make sure that contracts properly assign liability for data breaches with the answers to these questions in mind.
Note: This post is based on information shared by concerned librarians on the LITA Patron Privacy Technologies Interest Group list. Join if you care about this.
The Digital Public Library of America (DPLA) is excited to work with the Digital Library Federation (DLF) program of the Council on Library and Information Resources to offer DPLA + DLF Cross-Pollinator Travel Grants. The purpose of these grants is to extend the opportunity to attend DPLAfest 2015 to four DLF community members. Successful applicants should be able to envision and articulate a connection between the DLF community and the work of DPLA.
It is our belief that the key to sustainability of large-scale national efforts require robust community support. Connecting DPLA’s work to the energetic and talented DLF community is a positive way to increase serendipitous collaboration around this shared digital platform.
The goal of the DPLA + DLF Travel Grants is to bring cross-pollinators—active DLF community contributors who can provide unique perspectives to our work and share the vision of DPLA from their perspective—to the conference. By teaming up with the DLF to provide travel grants, it is our hope to engage DLF community members and connect them to exciting areas of growth and opportunity at DPLA.
The travel grants include DPLAfest 2015 conference registration, travel costs, meals, and lodging in Indianapolis.
The DPLA + DLF Cross-Pollinator Travel Grants is the first of a series of collaborations between CLIR/DLF and DPLA.AWARD
Four awards of up to $1,250 each to go towards the travel, board, and lodging expenses of attending DPLAfest 2015. Additionally, the grantees will each receive a complimentary full registration to the event ($75). Recipients will be required to write a blog post about their experience subsequent to DPLAfest; this blog post will be co-published by DLF and DPLA.ELIGIBILITY
Applicants must be a staff member of a current DLF member organization and not currently working on a DPLA hub team.APPLICATION
Send an email by March 5th, 5 pm EDT, containing the following items (in one document) to email@example.com, with the subject “DPLAFest Travel Grant: [Full Name]:
- Cover letter of nomination from the candidate’s supervisor/manager or an institutional executive, including acknowledgement that candidate would not have been funded to attend DPLAfest.
- Personal statement from the candidate (ca. 500 words) explaining their educational background, what their digital library/collections involvement is, why they are excited about digital library/collections work, and how they see themselves benefiting from and participating in DPLAfest.
- A current résumé.
Applications may be addressed to the DPLA + DLF Cross-Pollinator Committee.EVALUATION
Candidates will be selected by DPLA and DLF staff. In assessing the applications, we will look for a demonstrated commitment to digital work, and will consider the degree to which participation might enhance communication and collaboration between the DLF and DPLA communities. Applicants will be notified of their status no later than March 16, 2015.
Below the fold, some thoughts on the Klein et al paper.
As regards link rot, they write:
In order to combat link rot, the Digital Object Identifier (DOI) was introduced to persistently identify journal articles. In addition, the DOI resolver for the URI version of DOIs was introduced to ensure that web links pointing at these articles remain actionable, even when the articles change web location.But even when used correctly, such as http://dx.doi.org/10.1371/journal.pone.0115253, DOIs introduce a single point of failure. This became obvious on January 20th when the doi.org domain name briefly expired. DOI links all over the Web failed, illustrating yet another fragility of the Web. It hasn't been a good time for access to academic journals for other reasons either. Among the publishers unable to deliver content to their customers in the last week or so were Elsevier, Springer, Nature, HighWire Press and Oxford Art Online.
I've long been a fan of Herbert van de Sompel's work, especially Memento. He's a co-author on the paper and we have been discussing it. Unusually, we've been disagreeing. We completely agree on the underlying problem of the fragility of academic communication in the Web era as opposed to its robustness in the paper era. Indeed, in the introduction of another (but much less visible) recent paper entitled Towards Robust Hyperlinks for Web-Based Scholarly Communication Herbert and his co-authors echo the comparison between the paper and Web worlds from the very first paper we published on the LOCKSS system a decade and a half ago. Nor am I critical of the research underlying the paper, which is clearly of high quality and which reveals interesting and disturbing properties of Web-based academic communication. All I'm disagreeing with Herbert about is the way the research is presented in the paper.
My problem with the presentation is that this paper, which has a far higher profile than other recent publications in this area, and which comes at a time of unexpectedly high visibility for web archiving, seems to me to be excessively optimistic, and to fail to analyze the roots of the problem it is addressing. It thus fails to communicate the scale of the problem.
The paper is, for very practical reasons of publication in a peer-reviewed journal, focused on links from academic papers to the web-at-large. But I see it as far too optimistic in its discussion of the likely survival of the papers themselves, and the other papers they link to (see Content Drift below). I also see it as far too optimistic in its discussion of proposals to fix the problem of web-at-large references that it describes (see Dependence on Authors below).
All the proposals depend on actions being taken either before or during initial publication by either the author or the publisher. There is evidence in the paper itself (see Getting Links Right below) that neither authors nor publishers can get DOIs right. Attempts to get authors to deposit their papers in institutional repositories notoriously fail. The LOCKSS team has met continual frustration in getting publishers to make small changes to their publishing platforms that would make preservation easier, or in some cases even possible. Viable solutions to the problem cannot depend on humans to act correctly. Neither authors nor publishers have anything to gain from preservation of their work.
In addition, the paper fails to even mention the elephant in the room, the fact that both the papers and the web-at-large content are copyright. The archives upon which the proposed web-at-large solutions rest, such as the Internet Archive, are themselves fragile. Not just for the normal economic and technical reasons we outlined nearly a decade ago, but because they operate under the DMCA's "safe harbor" provision and thus must take down content upon request from a claimed copyright holder. The archives such as Portico and LOCKSS that preserve the articles themselves operate instead with permission from the publisher, and thus must impose access restrictions.
This is the root of the problem. In the paper world in order to monetize their content the copyright owner had to maximize the number of copies of it. In the Web world, in order to monetize their content the copyright owner has to minimize the number of copies. Thus the fundamental economic motivation for Web content militates against its preservation in the ways that Herbert and I would like.
None of this is to suggest that developing and deploying partial solutions is a waste of time. It is what I've been doing the last quarter of my life. There cannot be a single comprehensive technical solution. The best we can do is to combine a diversity of partial solutions. But we need to be clear that even if we combine everything anyone has worked on we are still a long way from solving the problem. Now for some details.
Content DriftAs regards content drift, they write:
Content drift is hardly a matter of concern for references to journal articles, because of the inherent fixity that, especially PDF-formated, articles exhibit. Nevertheless, special-purpose solutions for long-term digital archiving of the digital journal literature, such as LOCKSS, CLOCKSS, and Portico, have emerged to ensure that articles and the articles they reference can be revisited even if the portals that host them vanish from the web. More recently, the Keepers Registry has been introduced to keep track of the extent to which the digital journal literature is archived by what memory organizations. These combined efforts ensure that it is possible to revisit the scholarly context that consists of articles referenced by a certain article long after its publication.While I understand their need to limit the scope of their research to web-at-large resources, the last sentence is far too optimistic.
First, research using the Keepers Registry and other resources shows that at most 50% of all articles are preserved. So future scholars depending on archives of digital journals will encounter large numbers of broken links.
Second, even the 50% of articles that are preserved may not be accessible to a future scholar. CLOCKSS is a dark archive and is not intended to provide access to future scholars unless the content is triggered. Portico is a subscription archive, future scholars' institutions may not have a subscription. LOCKSS provides access only to readers at institutions running a LOCKSS box. These restrictions are a response to the copyright on the content and are not susceptible to technical fixes.
Third, the assumption that journal articles exhibit "inherent fixity" is, alas, outdated. Both the HTML and PDF versions of articles from state-of-the-art publishing platforms contain dynamically generated elements, even when they are not entirely generated on-the-fly. The LOCKSS system encounters this on a daily basis. As each LOCKSS box collects content from the publisher independently, each box gets content that differs in unimportant respects. For example, the HTML content is probably personalized ("Welcome Stanford University") and updated ("Links to this article"). PDF content is probably watermarked ("Downloaded by 192.168.1.100"). Content elements such as these need to be filtered out of the comparisons between the "same" content at different LOCKSS boxes. One might assume that the words, figures, etc. that form the real content of articles do not drift, but in practice it would be very difficult to validate this assumption.
Soft-404 ResponsesI've written before about the problems caused for archiving by "soft-403 and soft-404" responses by Web servers. These result from Web site designers who believe their only audience is humans, so instead of providing the correct response code when they refuse to supply content, they return a pretty page with a 200 response code indicating valid content. The valid content is a refusal to supply the requested content. Interestingly, PubMed is an example, as I discovered when clicking on the (broken) PubMed link in the paper's reference 58.
Klein et al define a live web page thus:
On the one hand, the HTTP transaction chain could end successfully with a 2XX-level HTTP response code. In this case we declared the URI to be active on the live web. Their estimate of the proportion of links which are still live is thus likely to be optimistic, as they are likely to have encountered at least soft-404s if not soft-403s.
Getting Links RightEven when the dx.doi.org resolver is working, its effectiveness in persisting links depends on its actually being used. Klein et al discover that in many cases it isn't:
one would assume that URI references to journal articles can readily be recognized by detecting HTTP URIs that carry a DOI, e.g., http://dx.doi.org/10.1007/s00799-014-0108-0. However, it turns out that references rather frequently have a direct link to an article in a publisher's portal, e.g. http://link.springer.com/article/10.1007%2Fs00799-014-0108-0, instead of the DOI link.The direct link may well survive relocation of the content within the publisher's site. But journals are frequently bought and sold between publishers, causing the link to break. I believe there are two causes for these direct links, publisher's platforms inserting them so as not to risk losing the reader, but more importantly the difficulty for authors to create correct links. Cutting and pasting from the URL bar in their browser necessarily gets the direct link, creating the correct one via dx.doi.org requires the author to know that it should be hand-edited, and to remember to do it.
Attempts to ensure linked materials are preserved suffer from a similar problem:
The solutions component of Hiberlink also explores how to best reference archived snapshots. The common and obvious approach, followed by Webcitation and Perma.cc, is to replace the original URI of the referenced resource with the URI of the Memento deposited in a web archive. This approach has several drawbacks. First, through removal of the original URI, it becomes impossible to revisit the originally referenced resource, for example, to determine what its content has become some time after referencing. Doing so can be rather relevant, for example, for software or dynamic scientific wiki pages. Second, the original URI is the key used to find Mementos of the resource in all web archives, using both their search interface and the Memento protocol. Removing the original URI is akin to throwing away that key: it makes it impossible to find Mementos in web archives other than the one in which the specific Memento was deposited. This means that the success of the approach is fully dependent on the long term existence of that one archive. If it permanently ceases to exist, for example, as a result of legal or financial pressure, or if it becomes temporally inoperative as a result of technical failure, the link to the Memento becomes rotten. Even worse, because the original URI was removed from the equation, it is impossible to use other web archives as a fallback mechanism. As such, in the approach that is currently common, one link rot problem is replaced by another.The paper, and a companion paper, describe Hiberlink's solution, which is to decorate the link to the original resource with an additional link to its archived Memento. Rene Voorburg of the KB has extended this by implementing robustify.js:
robustify.js checks the validity of each link a user clicks. If the linked page is not available, robustify.js will try to redirect the user to an archived version of the requested page. The script implements Herbert Van de Sompel's Memento Robust Links - Link Decoration specification (as part of the Hiberlink project) in how it tries to discover an archived version of the page. As a default, it will use the Memento Time Travel service as a fallback. You can easily implement robustify.js on your web pages in so that it redirects pages to your preferred web archive. Note, however, that soft-403s and soft-404s pose the same problem for robustify.js as they do for all Web archiving technologies.
Dependence on AuthorsMany of the solutions that have been proposed to the problem of reference rot also suffer from dependence on authors:
Webcitation was a pioneer in this problem domain when, years ago, it introduced the service that allows authors to archive, on demand, web resources they intend to reference. ... But Webcitation has not been met with great success, possibly the result of a lack of authors' awareness regarding reference rot, possibly because the approach requires an explicit action by authors, likely because of both.Webcitation is not the only one:
To a certain extent, portals like FigShare and Zenodo play in this problem domain as they allow authors to upload materials that might otherwise be posted to the web at large. The recent capability offered by these systems that allows creating a snapshot of a GitHub repository, deposit it, and receive a DOI in return, serves as a good example. The main drivers for authors to do so is to contribute to open science and to receive a citable DOI, and, hence potentially credit for the contribution. But the net effect, from the perspective of the reference rot problem domain, is the creation of a snapshot of an otherwise evolving resource. Still, these services target materials created by authors, not, like web archives do, resources on the web irrespective of their authorship. Also, an open question remains to which extent such portals truly fulfill a long term archival function rather than being discovery and access environments.Hiberlink is trying to reduce this dependence:
In the solutions thread of Hiberlink, we explore pro-active archiving approaches intended to seamlessly integrate into the life cycle of an article and to require less explicit intervention by authors. One example is an experimental Zotero extension that archives web resources as an author bookmarks them during note taking. Another is HiberActive, a service that can be integrated into the workflow of a repository or a manuscript submission system and that issues requests to web archives to archive all web at large resources referenced in submitted articles.But note that these services (and Voorburg's) depend on the author or the publisher installing them. Experience shows that authors are focused on getting their current paper accepted, large publishers are reluctant to implement extensions to their publishing platforms that offer no immediate benefit, and small publishers lack the expertise to do so.
Ideally, these services would be back-stopped by a service that scanned recently-published articles for web-at-large links and submitted them for archiving, thus requiring no action by author or publisher. The problem is that doing so requires the service to have access to the content as it is published. The existing journal archiving services, LOCKSS, CLOCKSS and Portico have such access to about half the published articles, and could in principle be extended to perform this service. In practice doing so would need at least modest funding. The problem isn't as simple as it appears at first glance, even for the articles that are archived. For those that aren't, primarily from less IT-savvy authors and small publishers, the outlook is bleak.
AcknowledgementI have to thank Herbert van de Sompel for greatly improving this post through constructive criticism. But it remains my opinion alone.
Adam Field and Claire Knowles, OR2015 Developer Track Co-Chairs; Cool Tools, Daring Demos and Fab Features
Indianapolis, IN The OR2015 developer track presents an opportunity to share the latest developments across the technical community. We will be running informal sessions of presentations and demonstrations showcasing community expertise and progress:
We’ve just released “More Like This,” a major upgrade to LibraryThing for Libraries’ “Similar items” recommendations. The upgrade is free and automatic for all current subscribers to LibraryThing for Libraries Catalog Enhancement Package. It adds several new categories of recommendations, as well as new features.
We’ve got text about it below, but here’s a short (1:28) video:
Similar items now has a See more link, which opens More Like This. Browse through different types of recommendations, including:
- Similar items
- More by author
- Similar authors
- By readers
- Same series
- By tags
- By genre
You can also choose to show one or several of the new categories directly on the catalog page.
Click a book in the lightbox to learn more about it—a summary when available, and a link to go directly to that item in the catalog.
Rate the usefulness of each recommended item right in your catalog—hovering over a cover gives you buttons that let you mark whether it’s a good or bad recommendation.
Try it Out!
Click “See more” to open the More Like This browser in one of these libraries:
- Spokane County Library District
- Arapahoe Public Library
- Waukegan Public Library
- Cape May Public Library
- SAILS Library Network
Find out more
Find more details for current customers on what’s changing and what customizations are available on our help pages.
The following is a guest post by Barrie Howard, IT Project Manager at the Library of Congress.
This is the first post in a series about digital preservation training inspired by the Library’s Digital Preservation Outreach & Education (DPOE) Program. Today I’ll focus on some exceptional individuals, who among other things, have completed one of the DPOE Train-the-Trainer workshops and delivered digital preservation training. I am interviewing Stephanie Kom, North Dakota State Library; Carol Kussmann, University of Minnesota Libraries; and Sara Ring, Minitex (a library network providing continuing education and other services to MN, ND and SD), who recently led an introductory virtual course on digital preservation.
Barrie: Carol, you attended the inaugural DPOE Train-the-Trainer Workshop in Washington, and Stephanie and Sara, you attended the first regional event at the Indiana State Archives during the summer of 2012, correct? Can you tell the readers about your experiences and how you and others have benefited as a result?
Carol: In addition to learning about the DPOE curriculum itself the most valuable aspect of these Train-the-Trainer workshops was meeting new people and building relationships. In the inaugural workshop, we met people from across the country, many whom I have looked to for advice or worked with on other projects. Because of the Indiana regional training, we now have a sizable group of trainers in the Midwest that I feel comfortable with in talking about DPOE and other electronic record issues. We work with each other and provide feedback and assistance when we go out and train others or work on digital preservation issues in our own roles.
Stephanie: We were just starting a digital program at my institution so the DPOE training was beyond helpful in just informing me what needed to be done to preserve our future digital content. It gave me the tools to explain our needs to our IT department. I also echo Carol’s thoughts on the networking opportunities. It was a great way to meet people in the region that are working with the same issues.
Sara: As my colleagues mentioned, in addition to learning the DPOE curriculum, what was most valuable to me was meeting new colleagues and forming relationships to build upon after the workshop. Shortly after the training, about eight of us began meeting virtually on a regular basis to offer our first digital preservation course (using the DPOE curriculum). Our small upper Midwest collaborative included trainers from North Dakota, South Dakota, Minnesota and Wisconsin. We had trainers from libraries, state archives and a museum participating, and we found we all had different strengths to share with our audience. Our first virtual course, “Managing Digital Content Over Time: An Introduction to Digital Preservation,” reached about 35 organizations of all types, and our second virtual course reached about 20 organizations in the region.
Barrie: Since becoming official DPOE trainers, you have developed a virtual course to provide an introduction to digital preservation. Can you provide a few details about the course, and have you developed any other training materials from the DPOE Curriculum?
Stephanie, Carol, Sara: The virtual course we offered was broken up as three sessions, scheduled every other week. Each session covered two of the DPOE modules. Using the DPOE workshop materials as a starting point we added local examples from our own organizations and built in discussion questions and polls for the attendees so that we had plenty of interaction.
Evaluations from this first offering informed us that people wanted to know more about various tools used to manage and preserve digital content. In response, in our second offering of the course we built in more demonstrations of tools to help identify, manage and monitor digital content over time. Since we were discussing and demonstrating tools that dealt with metadata, we added more content about technical and preservation metadata standards. We also built in take-home exercises for attendees to complete between sessions. Attendees have responded well to these changes and find the take-home exercises that we have built in really useful.
We also created a Google Site for this course, with an up-to-date list of resources, best practices and class exercises. Carol created step-by-step guides that people can follow for understanding and using tools that can assist with managing and preserving their electronic records. These can be found on the University of Minnesota Libraries Digital Preservation Page.
Working through Minitex, we have developed three different classes related to digital preservation; An Introduction to Digital Preservation (webinar); the DPOE virtual course that was mentioned; and a full day in-person DPOE-based workshop. We have presented each of these at least two times.
Barrie: The DPOE curriculum, which is built upon the OAIS Reference Model, recently underwent a revision. Have you noticed any significant changes in the materials since you attended the workshop in 2011 or 2012? What improvements have you observed?
Carol: What I like about DPOE is that it provides a framework for people to talk about common issues related to digital preservation. The main concepts have not changed – which is good, but there has been a significant increase to the number of examples and resources. The “Digital Preservation Trends” slides were not available in the 2011 training. Keeping up to date on what people are doing, exploring new resources and tools, and following changing best practices is very important as digital preservation continues to be a moving target.
Sara, Stephanie: We found the “Digital Preservation Trends” slides, the final module covered in the DPOE workshop, to be a nice addition to the baseline curriculum. We don’t think they existed when we attended the DPOE train-the-trainer workshop back in 2012. We both especially like the “Engaging with the Digital Preservation Community” section which lists some of the organizations, listservs, and conferences that would be of interest to digital preservation practitioners. When you’re new to digital preservation (or the only one at your organization working with digital content), it can be overwhelming knowing where to start. Providing resources like this offers a way to get involved in the digital preservation community; to learn from each other. We always try to close our digital preservation classes by providing community resources like this.
Barrie: Regarding training opportunities, could you compare the strengths and challenges of traditional in-person learning environments to distance learning options?
Stephanie, Carol, Sara: Personally we all prefer in-person learning environments over virtual and believe that most people would agree. We saw this preference echoed in the DPOE 2014 Training Needs Assessment Survey (PDF).
The main strength of in-person is the interaction with the presenter and other participants; as a presenter you can adjust your presentation immediately based on audience reactions and their specific needs and understanding. As a participant you can meet and relate to other people in similar situations, and there are more opportunities at in-person workshops for having those types of discussions with colleagues during breaks or during lunch.
However, in-person learning is not always feasible with travel time and costs, and in this part of the country, weather often gets in the way (we have all had our share of driving through blizzard conditions in Minnesota and North Dakota). Convenience and timeliness is definitely a benefit of long distance learning; more people from a single institution can often attend for little or no additional cost. As trainers we have worked really hard to build in hands-on activities in our virtual digital preservation courses, but could probably do a lot more to encourage networking among the attendees.
Barrie: Are there plans to convene the “Managing Digital Content Over Time” series in 2015?
Stephanie, Carol, Sara: Yes, we plan on offering at least one virtual course this spring. We’ll be checking in with our upper Midwest collaborative of trainers to see who is interested in participating this time around. Minitex provides workshops on request, so we may do more virtual or in-person classes if there is demand.
Barrie: How has the DPOE program influenced and/or affected the work that you do at your organization?
Carol: The inaugural DPOE Training (2011) took place while I was working on an NDIIPP project led by the Minnesota State Archives to preserve and provide access to government digital records which provided me with additional tools with which to work from during the project. After the project ended, I continued to use the information I learned during the project and DPOE training to develop a workflow for processing and preserving digital records at the Minnesota State Archives.
Since then, I became a Digital Preservation Analyst at the University of Minnesota Libraries where I continue to focus on digital preservation workflows, education and training, and other related activities. Overall, the DPOE training helped to build a foundation from which to discuss digital preservation with others whether in a classroom setting, conference presentation or one-on-one conversations. I look forward to continuing to work with members of the DPOE community.
Sara: As a digitization and metadata training coordinator at Minitex, a large part of my job is developing and presenting workshops for library professionals in our region. Participating in the DPOE training (2012) was one of the first steps I took to build and expand our training program at Minitex to include digital preservation. The DPOE program has also given me the opportunity to build up our own small cohort of DPOE trainers in the region, so we can schedule regular workshops based on who is available to present at the time.
Stephanie: I started the digitization program at our institution in 2012. Digital preservation has become a main component of that program and I am still working to get a full-fledged plan moving. Our institution is responsible for preserving other digital content and I would like our preservation plan to encompass all aspects of our work here at the library. I think one of the great things about the DPOE training is that the different pieces can be implemented before starting to produce digital content or it can be retrofitted into an already-established digital program. It can be more work when you already have a lot of digital content but the training materials make each step seem manageable.
December 2014 saw the Sustainable Development Policy Institute and Alif Ailaan launch the Pakistan Data Portal at the 30th Annual Sustainable Development Conference. The portal, built using CKAN by Open Knowledge, provides an access point for viewing and sharing data relating to all aspects of education in Pakistan.
A particular focus of this project was to design an open data portal that could be used to support advocacy efforts by Alif Ailaan, an organisation dedicated to improving education outcomes in Pakistan.
The Pakistan Data Portal (PDP) is the definitive collection of information on education in Pakistan and collates datasets from private and public research organisations on topics including infrastructure, finance, enrollment, and performance to name a few. The PDP is a single point of access against which change in Pakistani education can be tracked and analysed. Users, who include teachers, parents, politicians and policy makers are able to browse historical data can compare and contrast it across regions and years to reveal a clear, customizable picture of the state of education in Pakistan. From this clear overview, the drivers and constraints of reform can be identified which allow Alif Ailaan and others pushing for change in the country to focus their reform efforts.
Pakistan is facing an education emergency. It is a country with 25m children out of education and 50% girls of school age do not attend classes. A census has not been completed since 1998 and there are problems with the data that is available. It is outdated, incomplete, error-ridden and only a select few have access to much of it. An example that highlights this is a recent report from ASER, which estimates the number of children out of school at 16 million fewer than the number computed by Alif Ailaan in another report. NGOs and other advocacy groups have tended to only be interested in data when it can be used to confirm that the funds they are utilising are working. Whilst there is agreement on the overall problem, If people can not agree on its’ scale, how can a consensus solution be hoped for?
Alif Ailaan believe if you can’t measure the state of education in the country, you cant hope to fix it fix it. This forms the focus of their campaigning efforts. So whilst the the quality of the data is a problem, some data is better than no data, and the PDP forms a focus for gathering quality information together and for building a platform from which to build change and promote policy change— policy makers can make accurate decisions which are backed up.
The data accessible through the portal is supported by regular updates from the PDP team who draw attention to timely key issues and analyse the data. A particular subject or dataset will be explored from time to time and these general blog post are supported by “The Week in Education” which summarises the latest education news, data releases and publications.
CKAN was chosen as the portal best placed to meet the needs of the PDP. Open Knowledge were tasked with customising the portal and providing training and support to the team maintaining it. A custom dashboard system was developed for the platform in order to present data in an engaging visual format.
As explained by Asif Mermon, Associate Research Fellow at SDPI, the genius of the portal is the shell. As institutions start collecting data, or old data is uncovered, it can be added to the portal to continually improve the overall picture.
The PDP is in constant development to further promote the analysis of information in new ways and build on the improvement of the visualizations on offer. There are also plans to expand the scope of the portal, so that areas beyond education can also reap its’ benefits. A further benefit is that the shell can then be be exported around the world so other countries will be able to benifit from the development.
The PDP initiative is part of the multi-year DFID-funded Transforming Education Pakistan (TEP) campaign aiming to increase political will to deliver education reform in Pakistan. Accadian, on behalf of HTSPE, appointed the Open Knowledge Foundation to build the data observatory platform and provide support in managing the upload of data including onsite visits to provide training in Pakistan.
We’re pleased to announce the release of Hydra 9.0.0. This Hydra gem brings together a set of compatible gems for working with Fedora 4. Amongst others it bundles Hydra-head 9.0.1 and Active-Fedora 9.0.0. In addition to working with Fedora 4, Hydra 9 includes many improvements and bug fixes. Especially notable is the ability to add RDF properties on repository objects themselves (no need for datastreams) and large-file streaming support.
The new gem represents almost a year of effort – our thanks to all those who made it happen!
There are key advantages for users and developers by combining Islandora 7 and Fedora 4.
Charlottetown, PEI, CA Islandora is an open source software framework for managing and discovering digital assets utilizing a best-practices framework that includes Drupal, Fedora, and Solr. Islandora is implemented and built by an ever-growing international community.
Geoffrey Bilder @gbilder will be part of a panel entitled Why is it taking so long?. The panel will explore why some types of change in curation practice take so long and why others happen quickly. The panel will be moderated by Carly Strasser @carlystrasser, Manager of Strategic Partnerships for DataCite. The panel will take place on Monday, February 9th at 16:30 at 30 Euston Square in London. Learn more. #idcc15
The British Parliament is celebrating the 800th anniversary of Magna Carta:
On Thursday 5 February 2015, the four surviving original copies of Magna Carta were displayed in the Houses of Parliament – bringing together the documents that established the principle of the rule of law in the place where law is made in the UK today. The closing speech of the ceremony in the House of Lords was given by Sir Tim Berners-Lee, who is reported to have said:
I invented the acronym LOCKSS more than a decade and a half ago. Thank you, Sir Tim!
On October 24, 2014 Linus Torvalds added overlayfs to release 3.18 of the Linux kernel. Various Linux distributions have implemented various versions of overlayfs for some time, but now it is an official part of Linux. Overlayfs is a simplified implementation of union mounts, which allow a set of file systems to be superimposed on a single mount point. This is useful in many ways, for example to make a read-only file system such as a CD-ROM appear to be writable by mounting a read-write file system "on top" of it.
Other Unix-like systems have had union mounts for a long time. BSD systems first implemented it in 4.4BSD-Lite two decades ago. The concept traces back five years earlier to my paper for the Summer 1990 USENIX Conference Evolving the Vnode Interface which describes a prototype implementation of "stackable vnodes". Among other things, it could implement union mounts as shown in the paper's Figure 10:
This use of stackable vnodes was in part inspired by work at Sun two years earlier on the Translucent File Service, a user-level NFS service by David Hendricks that implemented a restricted version of union mounts. All I did was prototype the concept, and like many of my prototypes it served mainly to discover that the problem was harder than I initially thought. It took others another five years to deploy it in SunOS and BSD. Because they weren't hamstrung by legacy code and semantics by far the most elegant and sophisticated implementation was around the same time by Rob Pike and the Plan 9 team. Instead of being a bolt-on addition, union mounting was fundamental to the way Plan 9 worked.
About five years later Erez Zadok at Stony Brook led the FiST project, a major development of stackable file systems including two successive major releases of unionfs, a unioning file system for Linux.
About the same time I tried to use OpenBSD's implementation of union mounts early in the boot sequence to construct the root directory by mounting a RAM file system over a read-only root file system on a CD, but gave up on encountering deadlocks.
In 2009 Valerie Aurora published a truly excellent series of articles going into great detail about the difficult architectural and implementation issues that arise when implementing union mounts in Unix kernels. It includes the following statement, with which I concur:
The consensus at the 2009 Linux file systems workshop was that stackable file systems are conceptually elegant, but difficult or impossible to implement in a maintainable manner with the current VFS structure. My own experience writing a stacked file system (an in-kernel chunkfs prototype) leads me to agree with these criticisms.Note that my original paper was only incidentally about union mounts, it was a critique of the then-current VFS structure, and a suggestion that stackable vnodes might be a better way to go. It was such a seductive suggestion that it took nearly two decades to refute it! My apologies for pointing down a blind alley.
The overlayfs implementation in 3.18 is minimal:
Overlayfs allows one, usually read-write, directory tree to be overlaid onto another, read-only directory tree. All modifications go to the upper, writable layer.But given the architectural issues doing one thing really well has a lot to recommend itself over doing many things fairly well. This is, after all, the use case from my paper.
It took a quarter of a century, but the idea has finally been accepted. And, even though I had to build a custom 3.18 kernel to do so, I am using it on a Raspberry Pi serving as part of the CLOCKSS Archive.
Thank you, Linus! And everyone else who worked on the idea during all that time!
References (date order):
- David Hendricks, The Translucent File Service, pp. 87-93, Proceedings of the Autumn 1988 EUUG Conference, Vienna, Austria, October 1988.
- David S. H. Rosenthal, Evolving the Vnode Interface, Proceedings of the Summer 1990 USENIX Conference, Anaheim, 1990.
- Rob Pike, Dave Presotto, Sean Dorward, Bob Flandrena, Ken Thompson, Howard Trickey & Phil Winterbottom, Plan 9 from Bell Labs, Computing Systems Vol. 8, No. 3, Summer 1995.
- Jan-Simon Pendry & Marshall Kirk McKusick, Union Mounts in 4.4BSD-Lite. Proceedings of the USENIX Technical Conference on UNIX and Advanced Computing Systems: pp. 25–33 December 1995.
- David S. H. Rosenthal, A Digital Preservation Network Appliance Based on OpenBSD, BSDCon, 2003.
- Charles P. Wright, Jay Dave, Puja Gupta, Harikesavan Krishnan, Erez Zadok, and Mohammad Nayyer Zubair, Versatility and Unix Semantics in a Fan-Out Unification File System, Stony Brook University Technical Report FSL-04-01b, November 2004.
- Valerie Aurora, Unioning file systems: Architecture, features, and design choices, lwn.net, March 2009
- Valerie Aurora, Union file systems: Implementations, part I. lwn.net, March 2009.
- Valeria Aurora, Unioning file systems: Implementations, part 2. lwn.net April 2009.
- Miklos Szeredi, overlay filesystem, October 2014.
The UNT Libraries has made use of the ARK identifier specification for a number of years and have used these identifiers throughout our infrastructure on a number of levels. This post is to give a little background about where, when, why and a little about how we assign our ARK identifiers.Terminology
The first thing we need to do is get some terminology out of the way so that we can talk about the parts consistently. This is taken from the ARK documentationhttp://example.org/ark:/12025/654xz321/s3/f8.05v.tiff \________________/ \__/ \___/ \______/ \____________/ (replaceable) | | | Qualifier | ARK Label | | (NMA-supported) | | | Name Mapping Authority | Name (NAA-assigned) (NMA) | Name Assigning Authority Number (NAAN)
The ARK syntax can be summarized,[http://NMA/]ark:/NAAN/Name[Qualifier]
For the UNT Libraries we were assigned a Name Assigning Authority Number (NAAN) of 67531 so all of our identifiers will start like this ark:/67531/
We mint Names for our ARKs locally with a home-grown system locally called a “Number Server” this Python Web service receives a request for a new number, assigns that number a prefix based on which instance we pull from and returns the new Name.Namespaces
We have four different namespaces that we use for minting identifiers. They are the following, metapth, metadc, metarkv, and coda. Additionally we have a metatest namespace which we use when we need to test things out but it isn’t used that often. Finally we have a historic namespace that is no longer used that is metacrs. Here is the breakdown of how we use these namespaces.
We try to assign all items that end up on The Portal to Texas History with Names from the metapth namespace whenever possible. We assign all other public facing digital objects the metadc namespace. This means that the UNT Digital Library and The Gateway to Oklahoma History both share Names from the metadc namespace. The metarkv namespace is used for “archive only” objects that go directly into our archival repository system, these include large Web archiving datasets. The coda namespace is used within our archival repository called Coda. As was stated earlier the metatest namespace is only used for testing and these items are thrown away after processing.Name assignment
We assign Names in our systems in programatic ways, this is always done as part of our digital item ingest process. We tend to process items in batches, most often we try to process several hundred items at any given time and sometimes we process several thousand items. When we process items they are processed in parallel and therefore there is no logical order to how the Names are assigned to objects. They are in the order that they were processed but may have no logical order past that.
We also don’t assume that our Names are continuous. If you have an identifier metapth123 and metapth125 we don’t assume that there is an item metapth124, sure it may be there, but it also may never have been assigned. When we first started with these systems we would get worked up if we assigned several hundred or a few thousands identifiers and then had to delete those items, now this isn’t an issue at all but that took some time to get over.
Another assumption that can’t be made in our systems is that if you have an item, Newspaper Vol 1 Issue 2 that has an identifier of metapth333 there is no guarantee that Newspaper Vol. 1 Issue 3 will have metapth334, it might but it isn’t guaranteed either. Another thing that happens in our systems is that items can be shared between systems and the membership to either the Portal, UNT Digital Library or Gateway is notated in the descriptive metadata. Therefore you can’t say all metapth* identifiers are Portal or all metadc* identifiers are not the Portal, you have to look them up based on the metadata.
Once a number is assigned it is never assigned again. This sounds like a silly thing to say but it is important to remember, we don’t try and save identifiers, or reuse them as if we will run out of them.Level of assignment
We currently assign an ARK identifier at the level of the intellectual object. So for example, a newspaper issue gets and ARK, a photograph gets an ARK, a book, a map, a report, an audio recording, a video recording gets an ARK. The sub-parts of an item are not given further unique identifiers because the way that we tend to interface with them is in the form of formatted URLs such as those described here or from other URL based patterns such as the URLs we use to retrieve items from Coda.http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/manifest-md5.txt http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/coda_directives.py http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/bagit.txt http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/bag-info.txt http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/0=untl_aip_1.0 http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/data/01_data/queries.xlsx http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/data/01_data/README.txt http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadata.xml http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadata/ba3ce7a1-0e3b-44cb-8b41-5d9d1b0438fe.jhove.xml http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadata/7fe68777-54a2-4c71-95b2-aa33204ae84b.jhove.xml http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadc498968.aip.mets.xml Lessons Learned Things I would do again.
- I would most likely use just an incrementing counter for assigning identifiers. Name minters such as Noid are also an option but I like the numbers with a short prefix.
- I would not use a prefix such as UNT do stay away from branding as much as possible. Even metapth is way too branded (see below).
- I would only have one namespace for non-archival items. Two namespaces for production data just invite someone to screw up (usually me) and then suddenly the reason for having one namespace over the other is meaningless. Just manage one namespace and move on.
- I would not have a six or seven character prefix. metapth and metadc came as baggage from our first system, we decided that the 30k identifiers we already minted had set our path. Now after 1,077,975 identifiers in those namespaces, it seems a little silly that those the first 3% of our items would have such an effect on us still today.
- I would not brand our namespaces so closely to our systems names such as metapth, metadc, and the legacy metacrs people read too much into the naming convention. This is a big reason for opaque Names in the first place, and is pretty important.
- I would probably pad my identifiers out to eight digits. While you can’t rely on the ARKs to be generated in a given order, once they are assigned it is helpful to be able to sort by them and have a consistent order, metapth1, metapth100, metapth100000 don’t always sort nicely like metapth00000001, metapth00000100, metapth00100000 do. But then again longer run numbers of zeros are harder to transcribe and I had a tough time just writing this example. Maybe I wouldn’t do this.
I don’t think any of this post applies only to ARK identifiers as most identifier schemes at some level have to have a decision made about how you are going to mint unique names for things. So hopefully this is useful to others.
If you have any specific questions for me let me know on twitter.
Last updated February 7, 2015. Created by Peter Murray on February 7, 2015.
Log in to edit this page.
From the release announcement
I'm pleased to announce the release of Hydra 9.0.0! This is the first release of the Hydra gem for Fedora 4 and represents almost a year of effort. In addition to working with Fedora 4, Hydra 9 includes many improvements and bug fixes. Especially notable is the ability to add RDF properties on repository objects themselves (no need for datastreams) and large-file streaming support.
Semantic enrichment is an active area of development for many publishers. Our enrichment processes are based on the use of different Knowledge Models (e.g., an ontology or thesaurus) which provide the terms required to describe different subject disciplines.
The CrossRef Taxonomy Interest Group is a collaboration among publishers, and sponsored by CrossRef, to share the Knowledge Models they are using, creating opportunities for standardization, collaboration and interoperability. Please join the webinar to get an introduction to the work this group is doing, use cases for the information collected and learn how your organization can contribute to the project.
Christian Kohl - Director Information and Publishing Technology, De Gruyter
Graham McCann - Head of Content and Platform Management, IOP Publishing
The webinar will take place on Tuesday, March 3rd at 11 am ET.