archives

ARCHITECTING ScholarSphere: How We Built a Repository App That Doesn't Feel Like Yet Another Janky Old Repository App

ARCHITECTING ScholarSphere: How We Built a Repository App That Doesn't Feel Like Yet Another Janky Old Repository App

Slide presentation

Pitfall! Working with Legacy Born Digital Materials in Special Collections

Pitfall! Working with Legacy Born Digital Materials in Special Collections

Slide presentation

Hacking the DPLA

Hacking the DPLA

Slide presentation

  • Nate Hill, Chattanooga Public Library, nathanielhill AT gmail.com
  • Sam Klein, Wikipedia, metasj AT gmail.com

The Digital Public Library of America is a growing open-source platform to support digital libraries and archives of all kinds. DPLA-alpha is available for testing, with data from six initial Hubs. New APIs and data feeds are in development, with the next release scheduled for April.

Come learn what we are doing, how to contribute or hack the DPLA roadmap, and how you (or your favorite institution) can draw from and publish through it. Larger institutions can join as a (content or service) hub, helping to aggregate and share metadata and services from across their {region, field, archive-type}. We will discuss current challenges and possibilities (UI and API suggestions wanted!), apps being built on the platform, and related digitization efforts.

DPLA has a transparent community and planning process; new participants are always welcome. Half the time will be for suggestions and discussion. Please bring proposals, problems, partnerships and possible paradoxes to discuss.

EAD without XSLT: A Practical New Approach to Web-Based Finding Aids

EAD without XSLT: A Practical New Approach to Web-Based Finding Aids

Slide presentation

The Avalon Media System: A Next Generation Hydra Head For Audio and Video Delivery

The Avalon Media System: A Next Generation Hydra Head For Audio and Video Delivery

Slide presentation

Practical Relevance Ranking for 10 million books.

Practical Relevance Ranking for 10 million books

  • Tom Burton-West, University of Michigan Library, tburtonw@umich.edu

HathiTrust Full-text search indexes the full-text and metadata for over 10 million books. There are many challenges in tuning relevance ranking for a collection of this size. This talk will discuss some of the underlying issues, some of our experiments to improve relevance ranking, and our ongoing efforts to develop a principled framework for testing changes to relevance ranking.

Some of the topics covered will include:

  • Length normalization for indexing the full-text of book-length documents
  • Indexing granularity for books
  • Testing new features in Solr 4.0:
    • New ranking formulas that should work better with book-length documents: BM25 and DFR.
    • Grouping/Field Collapsing. Can we index 3 billion pages and then use Solr's field collapsing feature to rank books according to the most relevant page(s)?
    • Finite State Automota/Block Trees for storing the in-memory index to the index. Will this allow us to allow wildcards/truncation despite over 2 billion unique terms per index?

n Characters in Search of an Author

n Characters in Search of an Author

  • Jay Luker, IT Specialist, Smithsonian Astrophysics Data System, jluker@cfa.harvard.edu

When it comes to author names the disconnect between our metadata and what a user might enter into a search box presents challenges when trying to maximize both precision and recall [0]. When indexing a paper written by "Wäterwheels, A" a goal should be to preserve as much as possible the original information. However, users searching by author name may frequently omit the diaeresis and search for simply, "Waterwheels". The reverse of this scenario is also possible, i.e., your decrepit metadata contains only the ASCII, "Supybot, Zoia", whereas the user enters, "Supybot, Zóia". If recall is your highest priority the simple solution is to always downgrade to ASCII when indexing and querying. However this strategy sacrifices precision, as you will be unable to provide an "exact" search, necessary in cases where "Hacker, J" and "Häcker, J" really are two distinct authors.

Evolving Towards a Consortium MARCR Redis Datastore

Evolving Towards a Consortium MARCR Redis Datastore

Slide presentation

Citation search in SOLR and second-order operators

Citation search in SOLR and second-order operators

  • Roman Chyla, Astrophysics Data System, roman.chyla AT (cfa.harvad.edu|gmail.com)

Citation search is basically about connections (Is the paper read by a friend of mine more important than others? Get me a paper read by somebody who cites many papers/is cited by many papers?), but the implementation of the citation search is surprisingly useful in many other areas.

I will show 'guts' of the new citation search for astrophysics, it is generic and can be applied recursively to any Lucene query. Some people would call it a second-order operation because it works with the results of the previous (search) function. The talk will see technical details of the special query class, its collectors, how to add a new search operator and how to influence relevance scores. Then you can type with me: friends_of(friends_of(cited_for(keyword:"black holes") AND keyword:"red dwarf"))

Hybrid Archival Collections Using Blacklight and Hydra

Hybrid Archival Collections Using Blacklight and Hydra

Slide presentation