You are here

Feed aggregator

Library of Congress: The Signal: Digital Forensics and Digital Preservation: An Interview with Kam Woods of BitCurator.

planet code4lib - Fri, 2015-05-15 15:44

We’ve written about the BitCurator project a number of times, but the project has recently entered a new phase and it’s a great time to check in again. The BitCurator Access project began in October 2014 with funding through the Mellon Foundation. BitCurator Access is building on the original BitCurator project to develop open-source software that makes it easier to access disk images created as part of a forensic preservation process.

Kam Woods, speaking during a workshop at UNC.

Kam Woods has been a part of BitCurator from the beginning as its Technical Lead, and he’s currently a Research Scientist in the School of Information and Library Science at the University of North Carolina at Chapel Hill. As part of our Insights Interview series we talked with Woods about the latest efforts to apply digital forensics to digital preservation.

Butch: How did you end up working on the BitCurator project?

Kam: In late 2010, I took a postdoc position in the School of Information and Library Science at UNC, sponsored by Cal Lee and funded by a subcontract from an NSF grant awarded to Simson Garfinkel (then at the Naval Postgraduate School). Over following months I worked extensively with many of the open source digital forensics tools written by Simson and others, and it was immediately clear that there were natural applications to the issues faced by collecting organizations preserving born-digital materials. The postdoc position was only funded for one year, so – in early 2011 – Cal and I (along with eventual Co-PI Matthew Kirschenbaum) began putting together a grant proposal to the Andrew W. Mellon Foundation describing the work that would become the first BitCurator project.

Butch: If people have any understanding at all of digital forensics it’s probably from television or movies, but I suspect the actions you see there are pretty unrealistic. How would you describe digital forensics for the layperson? (And as an aside, what do people on television get “most right” about digital forensics?)

Kam: Digital forensics commonly refers to the process of recovering, analyzing, and reporting on data found on digital devices. The term is rooted in law enforcement and corporate security practices: tools and practices designed to identify items of interest (e.g. deleted files, web search histories, or emails) in a collection of data in order to support a specific position in a civic or criminal court case, to pinpoint a security breach, or to identify other kinds of suspected misconduct.

The goals differ when applying these tools and techniques within archives and data preservation institutions, but there are a lot of parallels in the process: providing an accurate record of chain of custody, documenting provenance, and storing the data in a manner that resists tampering, destruction, or loss. I would direct the interested reader to the excellent and freely available 2010 Council on Library and Information Resources report Digital Forensics and Born-Digital Content in Cultural Heritage Institutions (pdf) for additional detail.

You’ll occasionally see some semblance of a real-world tool or method in TV shows, but the presentation is often pretty bizarre. As far as day-to-day practices go, discussions I’ve had with law enforcement professionals often include phrases like “huge backlogs” and “overextended resources.” Sound familiar to any librarians and archivists?

Butch: Digital forensics has become a hot topic in the digital preservation community, but I suspect that it’s still outside the daily activity of most librarians and archivists. What should librarians and archivists know about digital forensics and how it can support digital preservation?

Forensic write-blockers used to capture disk images from physical media.

Kam: One of the things Cal Lee and I emphasize in workshops is the importance of avoiding unintentional or irreversible changes to source media. If someone brings you a device such as a hard disk or USB drive, a hardware write-blocker will ensure that if you plug that device into a modern machine, nothing can be written to it, either by you or some automatic process running on your operating system. Using a write-blocker is a baseline risk-reducing practice for anyone examining data that arrives on writeable media.

Creating a disk image – a sector-by-sector copy of a disk – can support high-quality preservation outcomes in several ways. A disk image retains the entirety of any file system contained within the media, including directory structures and timestamps associated with things like when particular files were created and modified. Retaining a disk image ensures that as your external tools (for example, those used to export files and file system metadata) improve over time, you can revisit a “gold standard” version of the source material to ensure you’re not losing something of value that might be of interest to future historians or researchers.

Disk imaging also mitigates the risk of hardware failure during an assessment. There’s no simple, universal way to know how many additional access events an older disk may withstand until you try to access it. If a hard disk begins to fail while you’re reading it, chances of preserving the data are often higher if you’re in the process of making a sector-by-sector copy in a forensic format with a forensic imaging utility. Forensic disk image formats embed capture metadata and redundancy checks to ensure a robust technical record of how and when that image was captured, and improve survivability over raw images if there is ever damage to your storage system. This can be especially useful if you’re placing a material in long-term offline storage.

There are many situations where it’s not practical, necessary, or appropriate to create a disk image, particularly if you receive a disk that is simply being used as an intermediary for data transfer, or if you’re working with files stored on a remote server or shared drive. Most digital forensics tools that actually analyze the data you’re acquiring (for example, Simson Garfinkel’s bulk extractor, which searches for potentially private and sensitive information and other items of interest) will just as easily process a directory of files as they would a disk image. Being aware of these options can help guide informed processing decisions.

Finally, collecting institutions spend a great deal of time and money assessing, hiring and training professionals to make complex decisions about what to preserve, how to preserve it and how to effectively provide and moderate access in ways that serve the public good. Digital forensics software can reduce the amount of manual triage required when assessing new or unprocessed materials, prioritizing items that are likely to be preservation targets or require additional attention.

Butch: How does BitCurator Access extend the work of the original phases of the BitCurator project?

Kam: One of the development goals for BitCurator Access is to provide archives and libraries with better mechanisms to interact with the contents of complex digital objects such as disk images. We’re developing software that runs as a web service and allows any user with a web browser to easily navigate collections of disk images in many different formats. This includes: providing facilities to examine the contents of the file systems contained within those images; interact with visualizations of file system metadata and organization (including timelines indicating changes to files and folders); and download items of interest. There’s an early version and installation guide in the “Tools” section of

We’re also working on software to automate the process of redacting potentially private and sensitive information – things like Social Security Numbers, dates of birth, bank account numbers and geolocation data – from these materials based on reports produced by digital forensics tools. Automatic redaction is a complex problem that often requires knowledge of specific file format structures to do correctly. We’re using some existing software libraries to automatically redact where we can, flag items that may require human attention and prepare clear reports describing those actions.

Finally, we’re exploring ways in which we can incorporate emulation tools such as those developed at the University of Freiburg using the Emulation-as-a-Service model.

Butch: I’ve heard archivists and curators express ethical concerns about using digital forensics tools to uncover material that an author may not have wished be made available (such as earlier drafts of written works). Do you have any thoughts on the ethical considerations of using digital forensics tools for digital preservation and/or archival purposes?

The Digital Forensics Laboratory at UNC SILS.

Kam: There’s a great DPC Technology Watch report from 2012, Digital Forensics and Preservation (pdf), in which Jeremy Leighton John frames the issue directly: “Curators have always been in a privileged position due to the necessity for institutions to appraise material that is potentially being accepted for long-term preservation and access; and this continues with the essential and judicious use of forensic technologies.”

What constitutes “essential and judicious” is an area of active discussion. It has been noted elsewhere (see the CLIR report I mentioned earlier) that the increased use of tools with these capabilities may necessitate revisiting and refining the language in donor agreements and ethics guidelines.

As a practical aside, the Society of American Archivists Guide to Deeds of Gift includes language alerting donors to concerns regarding deleted content and sensitive information on digital media. Using the Wayback Machine, you can see that this language was added mid-2013, so that provides some context for the impact these discussions are having.

Butch: An area that the National Digital Stewardship Alliance has identified as important for digital preservation is the establishment of testbeds for digital preservation tools and processes. Do you have some insight into how got established, and how valuable it is for the digital forensics and preservation communities?

Kam: was originally created by Simson Garfinkel to serve as a home for corpora he and others developed for use in digital forensics education and research. The set of materials on the site has evolved over time, but several of the currently available corpora were captured as part of scripted, simulated real-world scenarios in which researchers and students played out roles involving mock criminal activities using computers, USB drives, cell phones and network devices.

These corpora strike a balance between realism and complexity, allowing students in digital forensics courses to engage with problems similar to those they might encounter in their professional careers while limiting the volume of distractors and irrelevant content. They’re freely distributed, contain no actual confidential or sensitive information, and in certain cases have exercises and solution guides that can be distributed to instructors. There’s a great paper linked in the Bibliography section of that site entitled “Bringing science to digital forensics with standardized forensic corpora” (pdf) that goes into the need for such corpora in much greater detail.

Various media sitting in the UNC SILS Digital Forensics Laboratory.

We’ve used disk images from one corpus in particular – the “M57-Patents Scenario” – in courses taken by LIS students at UNC and in workshops run by the Society of American Archivists. They’re useful in illustrating various issues you might run into when working with a hard drive obtained from a donor, and in learning to work with various digital forensics tools. I’ve had talks with several people about the possibility of building a realistic corpus that simulated, say, a set of hard drives obtained from an artist or author. This would be expensive and require significant planning, for reasons that are most clearly described in the paper linked in the previous paragraph.

Butch: What are the next steps the digital preservation community should address when it comes to digital forensics?

Kam: Better workflow modeling, information sharing and standard vocabularies to describe actions taken using digital forensics tools are high up on the list. A number of institutions do currently document and publish workflows that involve digital forensics, but differences in factors like language and resolution make it difficult to compare them meaningfully. It’s important to be able to distinguish those ways in which workflows differ that are inherent to the process, rather than the way in which that process is described.

Improving community-driven resources that document and describe the functions of various digital forensics tools as they relate to preservation practices is another big one. Authors of these tools often provide comprehensive documentation, but they doesn’t necessarily emphasize those uses or features of the tools that are most relevant to collecting institutions. Of course, a really great tool tutorial doesn’t really help someone who doesn’t know about that tool, or isn’t familiar with the language being used to describe what it does, so you can flip this: describing a desired data processing outcome in a way that feels natural to an archivist or librarian, and linking to a tool that solves part or all of the related problem. We have some of this already, scattered around the web; we just need more of it, and better organization.

Finally, a shared resource for robust educational materials that reflect the kinds of activities students graduating from LIS programs may undertake using these tools. This one more or less speaks for itself.

LITA: The ‘I’ Word: Internships

planet code4lib - Fri, 2015-05-15 14:00
Image courtesy of DeKalb CSB.

Two weeks ago, I completed a semester-long, advanced TEI internship where I learned XSLT and utilized it to migrate two digital collections (explained more here, and check out the blog here) in the Digital Library Program. During these two weeks, I’ve had time to reflect on the impact that internships, especially tech-based, have on students.

At Indiana University, a student must complete an internship to graduate with a dual degree or specialization. However, this is my number one piece of advice for any student, but especially library students: do as many internships as you possibly can. The hands-on experience obtained during an internship is invaluable moving to a real-life position, and something we can’t always experience in courses. This is especially true for internships introducing and refining tech skills.

I’m going to shock you: learning new technology is difficult. It takes time. It takes patience. It takes a project application. Every new computer program or tech skill I’ve learned has come with a “drowning period,” also known as the learning period. Since technology exists in a different space, it is difficult to conceptualize how it works, and therefore how to learn and understand it.

An internship, usually unpaid or for student-paid credit, is the perfect safe zone for this drowning period. The student has time to fail and make mistakes, but also learn from them in a fairly low-pressure situation. They work with the people actually doing the job in the real world who can serve as a guide for learning the skills, as well as a career mentor.

The supervisors and departments also benefit from free labor, even if it does take time. Internships are also a chance for supervisors to revisit their own knowledge and solidify it by teaching others. They can look at their standards and see if anything needs updated or changed. Supervisors can directly influence the next generation of librarians, teaching them skills and hacks it took them years to figure out.

My two defining internships were: the Walt Whitman Archive at the University of Nebraska-Lincoln, introducing me to digital humanities work, and the Digital Library Program, solidifying my future career. What was your defining library internship? What kinds of internships does your institution offer to students and recent graduates? How does your institution support continuing education and learning new tech skills?

D-Lib: Semantic Description of Cultural Digital Images: Using a Hierarchical Model and Controlled Vocabulary

planet code4lib - Fri, 2015-05-15 11:44
Article by Lei Xu and Xiaoguang Wang, Wuhan University, Hubei, China

D-Lib: Metamorph: A Transformation Language for Semi-structured Data

planet code4lib - Fri, 2015-05-15 11:44
Article by Markus Michael Geipel, Christophe Boehme and Jan Hannemann, German National Library

D-Lib: Linked Data URIs and Libraries: The Story So Far

planet code4lib - Fri, 2015-05-15 11:44
Article by Ioannis Papadakis, Konstantinos Kyprianos and Michalis Stefanidakis, Ionian University

D-Lib: Facing the Challenge of Web Archives Preservation Collaboratively: The Role and Work of the IIPC Preservation Working Group

planet code4lib - Fri, 2015-05-15 11:44
Article by Andrea Goethals, Harvard Library, Clément Oury, International ISSN Centre, David Pearson, National Library of Australia, Barbara Sierman, KB National Library of the Netherlands and Tobias Steinke, Deutsche Nationalbibliothek

D-Lib: Olio

planet code4lib - Fri, 2015-05-15 11:44
Editorial by Laurence Lannom, CNRI

D-Lib: An Assessment of Institutional Repositories in the Arab World

planet code4lib - Fri, 2015-05-15 11:44
Article by Scott Carlson, Rice University

D-Lib: Helping Members of the Community Manage Their Digital Lives: Developing a Personal Digital Archiving Workshop

planet code4lib - Fri, 2015-05-15 11:44
Article by Nathan Brown, New Mexico State University Library

Open Library Data Additions: Amazon Crawl: part ee

planet code4lib - Fri, 2015-05-15 09:17

Part ee of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

District Dispatch: How do library patrons feel about digital content?

planet code4lib - Fri, 2015-05-15 06:26

Join a panel of Book Industry Study Group (BISG) and American Library Association (ALA) leaders at this year’s 2015 ALA Annual Conference in San Francisco when they discuss the results of a newly-released study on public library patrons’ use of digital content.

During the conference session “Digital Content in Public Libraries: What Do Patrons Think?” panelists will discuss the results of a new study by the BISG and ALA that was designed to provide invaluable insight into how readers interact with e-books in a library environment. The session takes place from 3:00 to 4:00 p.m. on Sunday, June 28, 2015, at the Moscone Convention Center in room 131 of the North Building.

The digital content survey was developed to understand the behavior of library patrons, including their use of digital resources and other services offered by public libraries. The study examined the impact of digital consumption behaviors, including the adoption of new business models, on library usage across America.

  • Kathy Rosa, director, Office for Research and Statistics, American Library Association
  • Carrie Russell, program director, Public Access to Information, Office for Information Technology Policy, American Library Association
  • Nadine Vassallo, project manager, Research & Information, Book Industry Study Group

View all ALA Washington Office conference sessions

The post How do library patrons feel about digital content? appeared first on District Dispatch.

Jonathan Rochkind: On approaches to Bridging the Gap in access to licensed resources

planet code4lib - Thu, 2015-05-14 20:57

A previous post I made reviewing the Ithaka report “Streamlining access to Scholarly Resources” got a lot of attention. Thanks!

The primary issue I’m interested in there: Getting our patrons from a paywalled scholarly citation on the open unauthenticated web, to an authenticated library-licensed copy, or other library services. “Bridging the gap”.

Here, we use Umlaut to turn our “link resolver” into a full-service landing page offering library services for both books and articles:  Licensed online copies, local print copies, and other library services.

This means we’ve got the “receiving” end taken care of — here’s a book and an article example of an Umlaut landing page — the problem reduces to getting the user from the open unauthenticated web to an Umlaut page for the citation in question.

Which is still a tricky problem.  In this post, brief discussion of two things: 1) The new “Google Scholar Button” browser extension from Google, which is interesting in this area, but I think ultimately not enough of a solution to keep me from looking for more, and 2) Possibilities of Zotero open source code toward our end.

The Google Scholar Button

In late April Google released a browser plugin for Chrome and Firefox called the “Google Scholar Button”.

This plugin will extract the title of an article from a page (either text you’ve selected on the page first, or it will try to scrape a title from HTML markup), and give you search results for that article title from Google Scholar, in a little popup window.

Interestingly, this is essentially the same thing a couple of third party software packages have done for a while: The LibX “Magic Button”, and Lazy Scholar.  But now we get it in an official Google release, instead of hacky workarounds to Google’s lack of API from open source.

The Google Scholar Button is basically trying to bridge the same gap we are; it provides a condensed version of google scholar search results, with a link to an open access PDF if Google knows about one (I am still curious how many of these open access PDF’s are not-entirely-licensed copies put up by authors or professors without publisher permissions);

And it in some cases provides an OpenURL link to a library link resolver, which is just what we’re looking for.

However, it’s got some limitations that keep me from considering it a satisfactory ‘Bridging the Gap’ solution:

  • In order to get the OpenURL link to your local library link resolver while you are off campus, you have to set your Google Scholar preferences in your browser, which is pretty confusing to do.
  • The title has to match in Google Scholar’s index of course. Which is definitely extensive enough to still be hugely useful, as evidenced by the open source predecessors to Google Scholar Button trying to do the same thing.
  • But most problematically at all, Google Scholar Button results will only show the local library link resolver link for some citations: The ones that have been registered as having institutional fulltext access in your institutional holdings registered with Google.  I want to get users to the Umlaut landing page for any citation they want, even if we don’t have licensed fulltext (and we might even if Google doesn’t think we do, the holdings registrations are not always entirely accurate), I want to show them local physical copies (especially for books), and ILL and other document delivery services.
    • The full Google Scholar gives a hard-to-find but at least it’s there OpenURL link for “no local fulltext” under a ‘more’ link, but the Google Scholar Button version doesn’t offer even this.
    • Books/monographs might not be the primary use case, but I really want a solution that works for books too — and books are something users may be especially interested in a physical copy instead of online fulltext for, and books are also something that our holdings registration with Google pretty much doesn’t include, even ebooks.  And book titles are a lot less likely to return hits in Google Scholar at all.

I really want a solution that works all or almost all of the time to get the patron to our library landing page, not just some of the time, and my experiments with Google Scholar Button revealed more of a ‘sometimes’ experience.

I’m not sure if the LibX or Lazy Scholar solutions can provide an OpenURL link in all cases, regardless of Google institutional holdings registration.  They are both worth further inquiry for sure.  But Lazy Scholar isn’t open source and I find it’s UI not great for our purposes. And I find LibX a bit too heavy weight for solving this problem, and have some other concerns about it.

So let’s consider another avenue for “Bridging the Gap”….

Zotero’s scraping logic

Instead of trying to take a title and find a hit in a mega-corpus of scholarly citations  like the Google Scholar Button approach, another approach would be to try to extract the full citation details from the source page, and construct an OpenURL to send straight to our landing page.

And, hey, it has occurred to me, there’s some software that already can scrape citation data elements from quite a long list of web sites our patrons might want to start from.  Zotero. (And Mendeley too for that matter).

In fact, you could use Zotero as a method of ‘Bridging the Gap’ right now. Sign up for a Zotero account, install the Zotero extension. When you are on a paywalled citation page on the unauthenticated open web (or a search results page on Google Scholar, Amazon, or other places Zotero can scrape from), first import your citation into Zotero. Then go into your Zotero library, find the citation, and — if you’ve properly set up your OpenURL preferences in Zotero — it’ll give you a link to click on that will take you to your institutional OpenURL resolver. In our case, our Umlaut landing page.

We know from some faculty interviews that some faculty definitely use Zotero, hard to say if a majority do or not. I do not know how many have managed to set up their OpenURL preferences in Zotero, if this is part of their use of it.

Even of those who have, I wonder how many have figured out on their own that they can use Zotero to “bridge the gap” in this way.  But even if we undertook an education campaign, it is a somewhat cumbersome process. You might not want to actually import into your Zotero library, you might want to take a look at the article first. And not everyone chooses to use Zotero, and we don’t want to require them to for a ‘briding the gap’ solution.

But that logic is there in Zotero, the pretty tricky task of compiling and maintaining ‘scraping’ rules for a huge list of sites likely to be desirable as ‘Bridging the Gap’ sources. And Zotero is open source, hmm.

We could imagine adding a feature to Zotero that let the user choose to go right to an institutional OpenURL link after scraping, instead of having to import and navigate to their Zotero library first.  But I’m not sure such a feature would match the goals of the Zotero project, or how to integrate it into the UX in a clear way without disturbing from Zotero’s core functionality.

But again, it’s open source.  We could imagine ‘forking’ Zotero, or extracting just the parts of Zotero that matter for our goal, into our own product that did exactly what we wanted. I’m not sure I have the local resources to maintain a ‘forked’ version of plugins for several browsers.

But Zotero also offers a bookmarklet.  Which doesn’t have as good a UI as the browser plugins, and which doesn’t support all of the scrapers. But which unlike a browser plugin you can install on iOS and Android mobile browsers (although it’s a bit confusing to do so, at least it’s possible).  And which it’s probably ‘less expensive’ for a developer to maintain a ‘fork’ of — we really just want to take Zotero’s scraping behavior, implemented via bookmarklet, and completely replace what you do with it after it’s scraped. Send it to our institutional OpenURL resolver.

I am very intrigued by this possibility, it seems at least worth some investigatory prototypes to have patrons test.  But I haven’t yet figured out how where to actually find the bookmarklet code, and related code in Zotero that may be triggered by it, let alone the next step of figuring out if it can be extracted into a ‘fork’.  I’ve tried looking around on the Zotero repo, but I can’t figure out what’s what.  (I think all of Zotero is open source?).

Anyone know the Zotero devs, and want to see if they want to talk to me about it with any advice or suggestions? Or anyone familiar with the Zotero source code themselves and want to talk to me about it?

Filed under: General

Patrick Hochstenbach:

planet code4lib - Thu, 2015-05-14 15:31
Our webshop at Filed under: Comics Tagged: cartoon, logo, webshop

Patrick Hochstenbach: Figure drawing on mondays

planet code4lib - Thu, 2015-05-14 14:31
Filed under: Figure Drawings Tagged: art, art model, Nude, Nudes, sketchbook


Subscribe to code4lib aggregator