You are here

Feed aggregator

Library of Congress: The Signal: Read Collections as Data Report Summary

planet code4lib - Wed, 2017-02-15 18:47

Our Collections as Data event in September 2016 on exploring the computational use of library collections was a success on several levels, including helping steer our team at National Digital Initiatives in our path of action.

We are pleased to release the following summary report which includes an executive summary of the event, the outline of our work in this area over the past five months, and the work of our colleagues Oliver Baez Bendorf, Dan Chudnov, Michelle Gallinger, and Thomas Padilla. If you are interested in what we mean when we talk about collections as data or the infrastructure necessary to support this work, this is for you.

The format of this summary report is itself an experiment. We contracted authors and artists to comment on this important topic from their diverse perspectives in order to create a holistic resource reflective of what made the symposium so great. You will read a reflection on the event from keynote speaker Thomas Padilla, a recommendation on how to implement a computational environment for scholars by Dan Chudnov and Michelle Gallinger, as well as a series of collages by the artist Oliver Baez Bendorf representing key themes from the day.

Mark your calendars for the next #AsData event on July 24th-25th at the Library of Congress.  By featuring stories from humanities researchers, journalists, students, social scientists, artists, and story tellers who have used library collections computationally, we hope to communicate the possibilities of this approach to a broad general audience.

DOWNLOAD THE FULL REPORT HERE.

READ

ON A COLLECTIONS AS DATA IMPERATIVE | THOMAS PADILLA

LIBRARY OF CONGRESS DIGITAL SCHOLARS PILOT PROJECT REPORT | DAN CHUDNOV AND MICHELLE GALLINGER

#ASDATA ON THE SIGNAL

POSTER SERIES

DOWNLOAD THE #ASDATA POSTER SERIES | OLIVER BAEZ BENDORF

WATCH

ALL SPEAKER PRESENTATIONS AVAILABLE FOR STREAMING ON THE LIBRARY OF CONGRESS YOUTUBE PAGE

MORE

ARCHIVED SCHEDULE, TRANSCRIPTS AND SLIDE DECKS ON EVENT SITE

 

Islandora: Islandoracon Logo!

planet code4lib - Wed, 2017-02-15 16:52

The Islandoracon Planning Committee is very pleased to unveil the logo that will grace our conference t-shirts in Hamilton, Ontario this May:

With all due credit to both the remarkable musical that inspired the image, and to the entirely different man named Hamilton who actually founded the city, the concept for this image comes from Bryan Brown at FSU. It also brings back the now-ubiquitous Islandora lobster (also known as the CLAWbster), who was created for the first Islandoracon and has gone on to dominate Islandora CLAW repositories in many different guises.

Cynthia Ng: Attempting to Prevent the Feeling of Unproductiveness Even When We Are Productive

planet code4lib - Wed, 2017-02-15 16:31
This piece was originally published on February 15, 2017 as part of the The Human in the Machine publication project. It seems it is not uncommon to finish a full day of work and feel completely unproductive. Sometimes I wonder if that’s simply a symptom of how we define what’s “productive”. The Problem I admit, … Continue reading Attempting to Prevent the Feeling of Unproductiveness Even When We Are Productive

DPLA: Travel Funding Available to attend DPLAfest 2017

planet code4lib - Wed, 2017-02-15 14:45

At DPLA, it is very important to us that DPLAfest bring together a broad array of professionals and advocates who care about access to culture to discuss everything from technology and open access to copyright, public engagement, and education. We celebrate the diversity of our DPLA community of partners and users and want to ensure that these perspectives are represented at DPLAfest, which is why we are thrilled to announce three fully funded travel awards to attend DPLAfest 2017.  

Our goal is to use this funding opportunity to promote the widest possible range of views represented at DPLAfest. We require that applicants represent one or more of the following:

  • Professionals from diverse ethnic and/or cultural backgrounds representing one or more of the following groups: American Indian/Alaska Native, Asian, Black/African American, Hispanic/Latino or Native Hawaiian/Other Pacific Islander
  • Professionals whose work or institutions primarily serve and/or represent historically underserved populations including, but not limited to, LGBTQ communities, incarcerated people and ex-offenders, people of color, people with disabilities, or Native American and tribal communities
  • Individuals who would not otherwise have the financial capacity to attend DPLAfest 2017
  • Graduate students and/or early career professionals
  • Students or professionals who live and/or work in the Greater Chicago metro area

Requirements:

  • Visit the DPLAfest Scholarships page and complete the application form by March 1, 2017.
  • Award recipients must attend the entire DPLAfest, taking place on Thursday, April 20 between 9:00am and 5:45pm and Friday, April 21 between 9:00am and 3:30pm.
  • Award recipients agree to write a blog post about their experience at the event to be published on DPLA’s blog within two weeks following the end of the event.

Please note that this funding opportunity will provide for airfare and/or other required travel to Chicago if needed, lodging for two nights in one of the event hotels, and complimentary registration for DPLAfest. The award will not provide for meals or incidental expenses such as cab fares or local public transportation.  All applicants will be notified regarding their application status during the week of March 13.

We appreciate your help sharing this opportunity with interested individuals in your personal and professional networks.

Click here to apply

District Dispatch: Concerns about FCC E-rate letter on fiber broadband deployment

planet code4lib - Tue, 2017-02-14 22:34

While we anticipated the Federal Communications Commission (FCC) would take a look at its Universal Service Fund (USF) programs once Chairman Pai was in place, we did not anticipate the speed at which moves to review and evaluate previous actions would occur. After the Commission retracted the “E-rate Modernization Report,” our E-rate ears have been itching with concern that our bread and butter USF program would attract undue attention. We did not have long to wait.

Photo credit Wikimedia Commons

Last week, FCC Commissioner Michael O’Rielly sent a letter (pdf) to the Universal Service Administrative Company (USAC) seeking detailed information on libraries and schools that applied in 2016 for E-rate funding for dark fiber and self-provisioned fiber. Our main concern is that the tenor of the Commissioner’s inquiries calls into question the need for these fiber applications. The FCC’s December 2014 E-rate Modernization Order allowed libraries and schools to apply for E-rate on self-construction costs for dark fiber and applicant owned fiber. Allowing E-rate eligibility of self-construction costs “levels the playing field” with the more typical leased fiber service offered by a third party, like a local telecommunications carrier. Because we know from our members that availability of high-capacity broadband at reasonable costs continues to be a significant barrier for libraries that want to increase broadband capacity of their libraries, ALA advocated for this change in several filings with the FCC.

We find Commissioner O’Rielly’s concern about overbuilding to be misplaced. The real issue is getting the best broadband service at the lowest cost, thus ensuring the most prudent use of limited E-rate and local funds. As we explained in our September 2013 comments (pdf) filed in response to then Acting Chair Mignon Clyburn’s opening of the E-rate modernization proceeding, “It is not a good stewardship of E-rate funds (or local library funds) to pay more for leasing a circuit when ownership is less expensive.”

To help ensure that applicants get the lowest cost for their fiber service the FCC already has in-place detailed E-rate bidding regulations that require cost be the most important factor when evaluating bids from providers. As the Commission stated in its December 2014 E-rate Modernization Order (pdf), incumbent providers “Are free to offer dark-fiber service themselves, or to price their lit-fiber service at competitive rates to keep or win business – but if they choose not to do so, it is market forces and their own decisions, not the E-rate rules” that preclude their ability to compete with a self-construction option. The Commission’s reforms to allow self-construction costs for dark fiber and applicant owned fiber were correct in 2014 and remain so. In addition, applicants will evaluate and select the best, most cost effective fiber option for their library or school.

If the last few weeks are any indication of activity at the FCC, we’re in for a busy spring.

The post Concerns about FCC E-rate letter on fiber broadband deployment appeared first on District Dispatch.

David Rosenthal: RFC 4810

planet code4lib - Tue, 2017-02-14 16:00
A decade ago next month Wallace et al published RFC 4810 Long-Term Archive Service Requirements. Its abstract is:
There are many scenarios in which users must be able to prove the existence of data at a specific point in time and be able to demonstrate the integrity of data since that time, even when the duration from time of existence to time of demonstration spans a large period of time. Additionally, users must be able to verify signatures on digitally signed data many years after the generation of the signature. This document describes a class of long-term archive services to support such scenarios and the technical requirements for interacting with such services.Below the fold, a look at how it has stood the test of time.

The RFC's overview of the problem a long-term archive (LTA) must solve is still exemplary, especially in its stress on the limited lifetime of cryptographic techniques (Section 1):
Digital data durability is undermined by continual progress and change on a number of fronts. The useful lifetime of data may exceed the life span of formats and mechanisms used to store the data. The lifetime of digitally signed data may exceed the validity periods of public-key certificates used to verify signatures or the cryptanalysis period of the cryptographic algorithms used to generate the signatures, i.e., the time after which an algorithm no longer provides the intended security properties. Technical and operational means are required to mitigate these issues.But note the vagueness of the very next sentence:
A solution must address issues such as storage media lifetime, disaster planning, advances in cryptanalysis or computational capabilities, changes in software technology, and legal issues.There is no one-size-fits-all affordable digital preservation technology, something the RFC implicitly acknowledges. But it does not even mention the importance of basing decisions on an explicit threat model when selecting or designing an appropriate technology. More than 18 months before the RFC was published, the LOCKSS team made this point in Requirements for Digital Preservation Systems: A Bottom-Up Approach. Our explicit threat model was very useful in the documentation needed for the CLOCKSS Archive's TRAC certification.

How to mitigate the threats? Again, Section 1 is on point:
A long-term archive service aids in the preservation of data over long periods of time through a regimen of technical and procedural mechanisms designed to support claims regarding a data object. For example, it might periodically perform activities to preserve data integrity and the non-repudiability of data existence by a particular point in time or take actions to ensure the availability of data. Examples of periodic activities include refreshing time stamps or transferring data to a new storage medium.Section 4.1.1 specifies a requirement that is still not implemented in any ingest pipeline I've encountered:
The LTA must provide an acknowledgement of the deposit that permits the submitter to confirm the correct data was accepted by the LTA.It is normal for Submission Information Packages (SIPs) to include checksums of their components, bagit is typical in this respect. The checksums allow the archive increased confidence that the submission was not corrupted in transit. But they don't do anything to satisfy RFC 4810's requirement that the submitter be reassured that the archive got the right data. Even if the archive reported the checksums to the submitter, this doesn't tell the submitter anything useful. The archive could simply have copied the checksums from the submission without validating them.

SIPs should include a nonce. The archive should prepend the nonce to each checksummed item, and report the resulting checksum back to the submitter, who can validate them, thus mitigating among others the threat that the SIP might have been tampered with in transit. This is equivalent to the first iteration of Shah et al's audit technology.

Note also how the RFC follows (without citing) the OAIS Reference Model in assuming a "push" model of ingest.

The RFC correctly points out that an LTA will rely on, and trust, services that must not be provided by the LTA itself, for example (Section 4.2.1):
Supporting non-repudiation of data existence, integrity, and origin is a primary purpose of a long-term archive service. Evidence may be generated, or otherwise obtained, by the service providing the evidence to a retriever. A long-term archive service need not be capable of providing all evidence necessary to produce a non-repudiation proof, and in some cases, should not be trusted to provide all necessary information. For example, trust anchors [RFC3280] and algorithm security policies should be provided by other services. An LTA that is trusted to provide trust anchors could forge an evidence record verified by using those trust anchors. and (Section 2):
Time Stamp: An attestation generated by a Time Stamping Authority (TSA) that a data item existed at a certain time. For example, [RFC3161] specifies a structure for signed time stamp tokens as part of a protocol for communicating with a TSA. But the RFC doesn't explore the problems that this reliance causes. Among these are:
  • Recursion. These services, which depend on encryption technologies that decay over time, must themselves rely on long-term archiving services to maintain, for example, a time-stamped history of public key validity. The RFC does not cite Petros Maniatis' 2003 Ph.D. thesis Historic integrity in distributed systems on precisely this problem.
  • Secrets. The encryption technologies depend on the ability to keep secrets for extended periods even if not, as the RFC explains, for the entire archival period. Keeping secrets is difficult and it is more difficult to know whether, or when, they leaked. The damage to archival integrity which the leak of secrets enables may only be detected after the fact, when recovery may not be possible. Or it may not be detected, because the point at which the secret leaked may be assumed to be later than it actually was.
These problems were at the heart of the design of the LOCKSS technology. Avoiding reliance on external services or extended secrecy led to its peer-to-peer, consensus-based audit and repair technology.

Encryption poses many problems for digital preservation. Section 4.5.1 identifies another:
A long-term archive service must provide means to ensure confidentiality of archived data objects, including confidentiality between the submitter and the long-term archive service. An LTA must provide a means for accepting encrypted data such that future preservation activities apply to the original, unencrypted data. Encryption, or other methods of providing confidentiality, must not pose a risk to the associated evidence record. Easier said than done. If the LTA accepts encrypted data without the decryption key, the best it can do is bit-level preservation. Future recovery of the data depends on the availability of the key which, being digital information, will need itself to have been stored in an LTA. Another instance of the recursive nature of long-term archiving.

"Mere bit preservation" is often unjustly denigrated as a "solved problem". The important archival functions are said to be "active", modifying the preserved data and therefore requiring access to the plaintext. Thus, on the other hand, the archive might have the key and, in effect, store the plaintext. The confidentiality of the archived data then depends on the archive's security remaining impenetrable over the entire archival period, something about which one might reasonably be skeptical.

Section 4.2.2 admits that "mere bit preservation" is the sine qua non of long-term archiving:
Demonstration that data has not been altered while in the care of a long-term archive service is a first step towards supporting non-repudiation of data.and goes on to note that "active preservation" requires another service:
Certification services support cases in which data must be modified, e.g., translation or format migration. An LTA may provide certification services.It isn't clear why the RFC thinks it is appropriate for an LTA to certify the success of its own operations. A third-party certification service would also need access to pre- and post-modification plaintext, increasing the plaintext's attack surface and adding another instance of the problems caused by LTAs relying on external services discussed above.

Overall, the RFC's authors did a pretty good job. Time has not revealed significant inadequacies beyond those knowable at the time of publication.

Hydra Project: Rebranding the Hydra Project

planet code4lib - Tue, 2017-02-14 08:57

Some of you may be aware that the Hydra Project has been attempting to trademark its “product” in the US and in Europe.  During this process we became aware of MPDV, a German company that has a wide ranging trademark on the use of ‘Hydra’ for computer software and that their claim to the word considerably predates ours.  Following discussions with their lawyers, our attorney advised that we should agree to MPDV’s demand that we cease use of the name “Hydra” and, having sought a second opinion, we have agreed that we will do so.  Accordingly, we need to embark on a program to rebrand ourselves.  MPDV have given us six months to do this which our lawyer deems “generous”.

The Steering Group, in consultation with the Hydra Partners, has already started mapping out a process to follow over the coming months but will welcome input from the Hydra Community – particularly help in identifying a new name, a matter of some urgency.  We will be especially interested in hearing from anyone with prior success in any naming and (re-)branding initiatives!  Rather than seeing this as a setback we are looking at the process as a way to refocus and re-invigorate the project ahead of new, exciting developments such as cloud-hosted delivery.

Please share your ideas via any of Hydra’s mailing lists.  If you use Slack you may like to look at a new Hydra channel called #branding where some interesting ideas are being discussed.

Terry Reese: MarcEdit Update: All Versions

planet code4lib - Tue, 2017-02-14 05:59

All versions have been updated.  For specific information about workstream work, please see: MarcEdit Workstreams: MacOS and Windows/Linux

MarcEdit Mac Changelog:

*************************************************
** 2.2.35
*************************************************
* Bug Fix: Delimited Text Translator: The 3rd delimiter wasn’t being set reliably. This should be corrected.
* Enhancement: Accessibility: Users can now change the font and font sizes in the application.
* Enhancement: Delimited Text Translator: Users can enter position and length on all fields.

MarcEdit Windows/Linux Changelog:

6.2.460
* Enhancement: Plugin management: automated updates, support for 3rd party plugins, and better plugin management has been added.
* Bug Fix: Delimited Text Translator: The 3rd delimiter wasn’t being set reliably. This should be corrected.
* Update: Field Count: Field count has been updated to improve counting when dealing with formatting issues.
* Enhancement: Delimited Text Translator: Users can enter position and length on all fields.

Downloads are available via the automated updating tool or via the Downloads (http://marcedit.reeset.net/downloads) page.

–tr

 

Library Tech Talk (U of Michigan): The Joy of Insights: How to harness qualitative data in your work

planet code4lib - Tue, 2017-02-14 00:00

Quantitative data gives you the hard numbers: what, how many times, when, generally who, and where. Quantitative data also leaves out the biggest and possibly most important factor: why.

DuraSpace News: REMINDER: VIVO 2017 Conference Proposals Due on March 26

planet code4lib - Tue, 2017-02-14 00:00

Austin, TX  Mark your calendars to submit abstracts for presentations, workshops, posters and demos for the 8th Annual VIVO Conference by March 26, 2017!  

DuraSpace News: New Faces at Hydra-In-A-Box

planet code4lib - Tue, 2017-02-14 00:00

Austin, TX  As the Hydra-in-a-Box project prepares for major developments in 2017 – release of the Hyku repository minimum viable product, a HykuDirect hosted service pilot program, and a higher-performing aggregation system at DPLA – we welcome three stars who recently joined the project team. Please join us in welcoming Michael Della Bitta, Heather Greer Klein, and Kelcy Shepherd.

DuraSpace News: Announcing: DuraSpace Hot Topics Webinar Series, "Introducing DSpace 7: Next Generation UI"

planet code4lib - Tue, 2017-02-14 00:00

DuraSpace is pleased to announce its latest Hot Topics Webinar Series, "Introducing DSpace 7: Next Generation UI"

Curated by Claire Knowles, Library Digital Development Manager, The University of Edinburgh.

DuraSpace News: We’re Hiring: DuraSpace Seeks Business Development Manager

planet code4lib - Tue, 2017-02-14 00:00

Austin, TX  A cornerstone of the DuraSpace mission is focused around expanding collaborations with academic, scientific, cultural, technology, and research communities in support of projects and services to help ensure that current and future generations will have access to our collective digital heritage. The DuraSpace organization seeks a Business Development Manager to cultivate and deepen those relationships and partnerships particularly with international organizations and consortia to elevate the organization’s profile and to expand the services it offers.

LibUX: Listen: LITA Persona Task Force (7:48)

planet code4lib - Mon, 2017-02-13 22:44

This week’s episode of Metric: A User Experience Podcast with Amanda L. Goodman (@godaisies) gives you a peek into the work of the LITA Persona Task Force, who are charged with defining and developing personas that are to be used in growing membership in the Library and Information Technology Association.

You can also  download the MP3 or subscribe to Metric: A UX Podcast on OverCastStitcher, iTunes, YouTube, Soundcloud, Google Music, or just plug our feed straight into your podcatcher of choice.

Jonathan Rochkind: bento_search 1.7.0 released

planet code4lib - Mon, 2017-02-13 19:04

bento_search is the gem for making embedding of external searches in Rails a breeze, focusing on search targets and use cases involving ‘scholarly’ or bibliographic citation results.

Bento_search isn’t dead, it just didn’t need much updating. But thanks to some work for a client using it, I had the opportunity to do some updates.

Bento_search 1.7.0 includes testing under Rails 5 (the earlier versions probably would have worked fine in Rails 5 already), some additional configuration options, a lot more fleshing out of the EDS adapter, and a new ConcurrentSearcher demonstrating proper use of new Rails5 concurrency API.  (the older BentoSearch::MultiSearcher is now deprecated).

See the CHANGES file for full list.

As with all releases of bento_search to date, it should be strictly backwards compatible and an easy upgrade. (Although if you are using Rails earlier than 4.2, I’m not completely confident, as we aren’t currently doing automated testing of those).


Filed under: General

Karen Coyle: Miseducation

planet code4lib - Mon, 2017-02-13 17:04
There's a fascinating video created by the Southern Poverty Law Center (in January 2017) that focuses on Google but is equally relevant to libraries. It is called The Miseducation of Dylann Roof.

 

 In this video, the speaker shows that by searching on "black on white violence" in Google the top items are all from racist sites. Each of these link only to other racist sites. The speaker claims that Google's algorithms will favor similar sites to ones that a user has visited from a Google search, and that eventually, in this case, the user's online searching will be skewed toward sites that are racist in nature. The claim is that this is what happened to Dylan Roof, the man who killed 9 people at an historic African-American church - he entered a closed information system that consisted only of racist sites. It ends by saying: "It's a fundamental problem that Google must address if it is truly going to be the world's library."

I'm not going to defend or deny the claims of the video, and you should watch it yourself because I'm not giving a full exposition of its premise here (and it is short and very interesting). But I do want to question whether Google is or could be "the world's library", and also whether libraries do a sufficient job of presenting users with a well-round information space.

It's fairly easy to dismiss the first premise - that Google is or should be seen as a library. Google is operating in a significantly different information ecosystem from libraries. While there is some overlap between Google and library collections, primarily because Google now partners with publishers to index some books, there is much that is on the Internet that is not in libraries, and a significant amount that is in libraries but not available online. Libraries pride themselves on providing quality information, but we can't really take the lion's share of the credit for that; the primary gatekeepers are the publishers from whom we purchase the items in our collections. In terms of content, most libraries are pretty staid, collecting only from mainstream publishers.

I decided to test this out and went looking for works promoting Holocaust denial or Creationism in a non-random group of libraries. I was able to find numerous books about deniers and denial, but only research libraries seem to carry the books by the deniers themselves. None of these come from mainstream publishing houses. I note that the subject heading, Holocaust denial literature, is applied to both those items written from the denial point of view, as well as ones analyzing or debating that view.

Creationism gets a bit more visibility; I was able to find some creationist works in public libraries in the Bible Belt. Again, there is a single subject heading, Creationism, that covers both the pro- and the con-. Finding pro- works in WorldCat is a kind of "needle in a haystack" exercise.

Don't dwell too much on my findings - this is purely anecdotal, although a true study would be fascinating. We know that libraries to some extent reflect their local cultures, such as the presence of the Gay and Lesbian Archives at the San Francisco Public Library.  But you often hear that libraries "cover all points of view," which is not really true.

The common statement about libraries is that we gather materials on all sides of an issue. Another statement is that users will discover them because they will reside near each other on the library shelves. Is this true? Is this adequate? Does this guarantee that library users will encounter a full range of thoughts and facts on an issue?

First, just because the library has more than one book on a topic does not guarantee that a user will choose to engage with multiple sources. There are people who seek out everything they can find on a topic, but as we know from the general statistics on reading habits, many people will not read voraciously on a topic. So the fact that the library has multiple items with different points of view doesn't mean that the user reads all of those points of view.

Second, there can be a big difference between what the library holds and what a user finds on the shelf. Many public libraries have a high rate of circulation of a large part of their collection, and some books have such long holds lists that they may not hit the shelf for months or longer. I have no way to predict what a user would find on the shelf in a library that had an equal number of books expounding the science of evolution vs those promoting the biblical concept of creation, but it is frightening to think that what a person learns will be the result of some random library bookshelf.

But the third point is really the key one: libraries do not cover all points of view, if by points of view you include the kind of mis-information that is described in the SPLC video. There are many points of view that are not available from mainstream publishers, and there are many points of view that are not considered appropriate for anything but serious study. A researcher looking into race relations in the United States today would find the sites that attracted Roof to provide important insights, as SPLC did, but you will not find that same information in a "reading" library.

Libraries have an idea of "appropriate" that they share with the publishing community. We are both scientific and moral gatekeepers, whether we want to admit it or not. Google is an algorithm functioning over an uncontrolled and uncontrollable number of conversations. Although Google pretends that its algorithm is neutral, we know that it is not. On Amazon, which does accept self-published and alternative press books, certain content like pornography is consciously kept away from promotions and best seller lists. Google has "tweaked" its algorithms to remove Holocaust denial literature from view in some European countries that forbid the topic. The video essentially says that Google should make wide-ranging cultural, scientific and moral judgments about the content it indexes.

I am of two minds about the idea of letting Google or Amazon be a gatekeeper. On the one hand, immersing a Dylann Roof in an online racist community is a terrible thing, and we see the result (although the cause and effect may be hard to prove as strongly as the video shows). On the other hand, letting Google and Amazon decide what is and what is not appropriate does not sit well at all. As I've said before having gatekeepers whose motivations are trade secrets that cannot be discussed is quite dangerous.

There has been a lot of discussion lately about libraries and their supposed neutrality. I am very glad that we can have that discussion. With all of the current hoopla about fake news, Russian hackers, and the use of social media to target and change opinion, we should embrace the fact of our collection policies, and admit widely that we and others have thought carefully about the content of the library. It won't be the most radical in many cases, but we care about veracity, and that''s something that Google cannot say.

Library of Congress: The Signal: IEEE Big Data Conference 2016: Computational Archival Science

planet code4lib - Mon, 2017-02-13 14:32

This is a guest post by Meredith Claire Broadway,a consultant for the World Bank.

Jason Baron, Drinker Biddle & Reath LLP, “Opening Up Dark Digital Archives Through The Use of Analytics To Identify Sensitive Content,” 2016. Photo by Meredith Claire Broadway.

Computational Archival Science can be regarded as the intersection between the archival profession and “hard” technical fields, such as computer science and engineering. CAS applies computational methods and resources to large-scale records and archives processing, analysis, storage, long-term preservation and access. In short: big data is a big deal for archivists, particularly because old-school pen-and-paper methodologies don’t apply to digital records. To keep up with big data, the archival profession is called upon to open itself up to new ideas and collaborate with technological professionals.

Naturally, collaboration was the  theme of the IEEE Big Data Conference ’16: Computational Archival Science workshop. There were many speakers with projects that drew on the spirit of collaboration by applying computational methods — such as machine learning, visualization and neuro-linguistic programming — to archival problems. Subjects ranged from improving optical-character-recognition efforts with topic modeling to utilizing vector-space models so that archives can better anonymize PII and other sensitive content.

For example, “Content-based Comparison for Collections Identification” was presented by a team led by Maria Esteva of the Texas Advanced Computing Center. Maria and her team created an automated method of discovering the identity of datasets that appear to be similar or identical but may be housed in two different repositories or parts of different collections. This service is important to archivists because datasets often exist in multiple formats and versions and in different stages of completion. Traditionally, archives determine issues such as these through manual effort and metadata entry. A shift to automation of content-based comparison allows archivists to identify changes, connections and differences between digital records with greater accuracy and efficiency.

The team’s algorithm operates in straightforward manner. First, two collections are analyzed to determine the types of records they contain and then a list is generated for each collection. Next the records analysis creates a list of pairs from each collection for comparison. Finally, a summary report is created to show differences between the collections.

Weijia Xu, Ruizhu Huang, Maria Esteva, Jawon Song, Ramona Walls, “Content-based Comparison for Collections Identification,” 2015.

To briefly summarize Maria’s findings, metadata alone isn’t enough for the content-based comparison algorithm to determine whether a dataset is unique. The algorithm needs more information from datasets to make improved comparisons.

Automated collection-based comparison is the future of digital archives. Naturally, it raises questions, among them, “What is the best way for archivists to meet automated methods?” and  ”How can current archival workflows be aligned with computational efforts?”

The IEEE Computational Archival Science session ended on a contemplative note. Keynote speaker Mark Conrad, of the National Archives and Records Administration, asked the gathering about what skills they thought the new generation of computational archival scientists should be taught. Topping the list were answers such as “coding,” “text mining” and “history of archival practice.”

What interested me most was the ensuing conversation about how CAS deserves its own academic track. The assembly agreed that CAS differs enough from the traditional Library and Information Science and Archival tracks, in both the United States and Canada, that it qualifies as a new area of study.

CAS differs from the LIS and Archival fields in large part due to its technology-centric nature. “Hard” technical skills take more than two years (the usual time it takes to complete an LIS master’s program) to develop, a fact I can personally attest to as a former LIS student and R beginner. It makes sense, then, that for CAS students to receive a robust education they should have a unique curriculum.

If CAS, LIS and the Archival Science fields merge, there’s an assumption that they will run the risk of taking an “inch-deep, mile-wide” approach to studies. Our assembly agreed that, in this case, “less is more” if it allows students to cultivate fully developed skills.

Of course these were just the opinions of those present at the IEEE workshop. As the session emphasized, CAS encourages collaboration, discussion and differing opinions. If you have something to add to any of my points, please leave a comment.

DuraSpace News: VIVO Updates for Feb 12–VIVO 1.9.2, Helping each other, VIVO Camp

planet code4lib - Mon, 2017-02-13 00:00

From Mike Conlon, VIVO Project Director

VIVO 1.9.2 Released  VIVO 1.9.2 is a maintenance release addressing several bugs. Upgrading to 1.9.2 should be straightforward, there are no ontology or functional changes.  Bugs fixed:

Dan Scott: schema.org, Wikidata, Knowledge Graph: strands of the modern semantic web

planet code4lib - Sun, 2017-02-12 21:05

My slides from Ohio DevFest 2016: schema.org, Wikidata, Knowledge Graph: strands of the modern semantic web

And the video, recorded and edited by the incredible amazing Patrick Hammond:

In November, I had the opportunity to speak at Ohio DevFest 2016. One of the organizers, Casey Borders, had invited me to talk about schema.org, structured data, or something in that subject area based on a talk about schema.org and RDFa he had seem me give at the DevFest Can-Am in Waterloo a few years prior. Given the Google-oriented nature of the event and the 50-minute time slot, I opted to add in coverage of the Google Knowledge Graph and its API, which I had been exploring from time to time since its launch in late 2014.

Alas, the Google Knowledge Graph Search API is still quite limited; it returns quite minimal data in comparison to the rich cards that you see in regular Google search results. The JSON results only include links for an image, a corresponding Wikipedia page, and for the ID of the entity. I also uncovered errors that had lurked in the documentation for quite some time; happily, the team quickly responded to correct those problems.

So I dug back in time and also covered Freebase, the database of linked and structured data that had both allowed individual contributions and which had made its database freely available--until it was purchased by Google, fed into the Knowledge Graph, and shut down. Not many people knew what we had once had until it was gone (Ed Summers did, for one), but such is the way of commercial entities.

In that context, Wikidata looks something like the Second Coming of an open (for contribution and access) linked and structured database, with sustainability derived financially from the Wikimedia Foundation and structurally by its role in underpinning Wikipedia and Wikimedia Commons. Google also did a nice thing by putting resources into adding the appropriately licensed data they could liberate from Freebase: approximately 19 million statements and IDs.

The inclusion of Google Knowledge Graph IDs in Wikidata means that we can use the Google Search API to find an entity ID, then pull the corresponding richer data from Wikidata for that ID to populate relationships and statements. You can get there from here! Ultimately, my thesis is that Wikidata can and will play a very important role in the modern (much more pragmatic) semantic web.

Pages

Subscribe to code4lib aggregator