Indexing big data with Tika, Solr & map-reduce

  • Scott Fisher, California Digital Library, scott.fisher AT ucop BORK edu
  • Erik Hetzner, California Digital Library, erik.hetzner AT ucop BORK edu

code4lib 2012, Wednesday, February 8 2012, 13:00-13:20

The Web Archiving Service at the California Digital Library has crawled a large amount of data, in every format found on the web: 30 TB, comprising about 600 million fetched URLs. In this talk we will discuss how we parsed this data using Tika and map-reduce, and how we indexed this data with Solr, tweaked the relevance ranking, and were able to provide our users with a better search experience.