Library Text Mining

Rob Sanderson

Using the TeraGrid1 and the SRB DataGrid2, we have sufficient computational and storage facilities to run normally prohibitively expensive processing tasks. By integrating text and data mining tools3[4] within the Cheshire35 information architecture, we can parse the natural language present in 20 million MARC records (the University of Californiaâ€™s MELVYL collection) and extract information to provide to search/retrieve applications. In this talk, weâ€™ll discuss the results of applying new techniques to â€˜oldâ€™ data.

1: http://www.teragrid.org 2: http://www.sdsc.edu/srb 3: http://www.ailab.si/orange 4: http://www-tsujii.is.s.u-tokyo.ac.jp/ 5: http://www.cheshire3.org/

Rob Sanderson, (azaroth@liv.ac.uk)