Practical Relevance Ranking for 10 million books

  • Tom Burton-West, University of Michigan Library, tburtonw@umich.edu

HathiTrust Full-text search indexes the full-text and metadata for over 10 million books. There are many challenges in tuning relevance ranking for a collection of this size. This talk will discuss some of the underlying issues, some of our experiments to improve relevance ranking, and our ongoing efforts to develop a principled framework for testing changes to relevance ranking.

Some of the topics covered will include:

  • Length normalization for indexing the full-text of book-length documents
  • Indexing granularity for books
  • Testing new features in Solr 4.0:
    • New ranking formulas that should work better with book-length documents: BM25 and DFR.
    • Grouping/Field Collapsing. Can we index 3 billion pages and then use Solr's field collapsing feature to rank books according to the most relevant page(s)?
    • Finite State Automota/Block Trees for storing the in-memory index to the index. Will this allow us to allow wildcards/truncation despite over 2 billion unique terms per index?
  • Relevance testing methodologies:Query log analysis, Click models, Interleaving, A/B testing, and Test collection based evaluation.
  • Testing of a new high-performance storage system to be installed in early 2013. We will report on any tests we are able to run prior to conference time.
  • </ul> Download slides: http://www.hathitrust.org/documents/HathiTrust-Code4LIB-201302.pptx
    </br>

    Download the video