You are here
Search Engine Relevancy Tuning - A Static Rank Framework for Solr/Lucene
- Mike Schultz, formerly Summon Search Architect, firstname.lastname@example.org
code4lib 2012, Thursday, February 9 2012, 11:40-12:00
Solr/Lucene provides a lot of flexibility for adjusting relevancy scoring and improving search results. Roughly speaking there are two areas of concern: Firstly, a 'dynamic rank' calculation that is a function of the user query and document text fields. And secondly, a 'static rank' which is independent of the query and generally is a function of non-text document metadata. In this talk I will outline an easily understood, hand-tunable static rank system with a minimal number of parameters.
The obvious major feature of a search engine is to return results relevant to a user query. Perhaps less obvious is the huge role query independent document features play in achieving that. Google's PageRank is an example of a static ranking of web pages based on links and other secret sauce. In the Summon service, our 800 million documents have features like publication date, document type, citation count and Boolean features like the-article-is-peer-reviewed. These fields aren't textual and remain 'static' from query to query, but need to influence a document's relevancy score. In our search results, with all query related features being equal, we'd rather have more recent documents above older ones, Journals above Newspapers, and articles that are peer reviewed above those that are not. The static rank system I will describe achieves this and has the following features:
- Query-time only calculation - nothing is baked into the index - with parameters adjustable at query time
- The system is based on a signal metaphor where components are 'wired' together. System components allow multiplexing, amplifying, summing, tunable band-pass filtering, string-to-value-mapping all with a bare minimum of parameters.
- An intuitive approach for mixing dynamic and static rank that is more effective than simple adding or multiplying.
- A way of equating disparate static metadata types that leads to understandable results ordering.
How people search the library from a single search box
- Cory Lown, North Carolina State University Libraries, email@example.com
code4lib 2012, Wednesday, February 8 2012, 09:35-09:55
Searching the library is complex. There's the catalog, article databases, journal title and database title look-ups, the library website, finding aids, knowledge bases, etc. How would users search if they could get to all of these resources from a single search box? I'll share what we've learned about single search at NCSU Libraries by tracking use of QuickSearch (http://www.lib.ncsu.edu/search/index.php?q=aerospace+engineering), our home-grown unified search application. As part of this talk I will suggest low-cost ways to collect real world use data that can be applied to improve search. I will try to convince you that data collection must be carefully planned and designed to be an effective tool to help you understand what your users are telling you through their behavior. I will talk about how the fragmented library resource environment challenges us to provide useful and understandable search environments. Finally, I will share findings from analyzing millions of user transactions about how people search the library from a production single search box at a large university library.
Discovering Digital Library User Behavior with Google Analytics
- Kirk Hess, Digital Humanities Specialist, University of Illinois Urbana-Champaign, firstname.lastname@example.org
code4lib 2012, Wednesday, February 8 2012, 09:15-09:35
The presentation will review tracking search queries, adding events such as clicking external links or downloading files, and custom variables, to track user behavior that is normally difficult to track. We'll also discuss using jQuery scripts to add tracking code to the page without having to modify the underlying web application. Once you've collected data, you may use the Google Analytics API to extract data and integrate it with data from your digital repository to show granular data about individual items in your Digital Library. Finally, we'll discuss how this information allows you to improve the user experience, and summarize some of the research we are doing with our digital repository and the data gathered from Google Analytics.
The Golden Road (To Unlimited Devotion): Building a Socially Constructed Archive of Grateful Dead Artifacts
- Robin Chandler, University of California (Santa Cruz), chandler [at] ucsc [dot] edu
- Susan Chesley Perry, University of California (Santa Cruz), chesley [at] ucsc [dot] edu
- Kevin S. Clarke, University of California (Santa Cruz), ksclarke [at] ucsc [dot] edu
code4lib 2012, Tuesday 7 February 2012, 14:20-14:40 (slides available online)
The Grateful Dead Archive at the University of California (Santa Cruz) is a collection of over 600 linear feet of material, including: business records, photographs, posters, fan envelopes, tickets, video, audio (oral histories, interviews and music) and 3-d objects such as stage props and band merchandise. In addition, with the release of the Grateful Dead Archive Online website in 2012, the Archive will start actively collecting artifacts from an enthusiastic community of Grateful Dead fans.
This talk will discuss the challenges of merging a traditional archive with a socially constructed one. We will also present the first round of development and explain how we're using tools like Omeka, ContentDM, UC3 Merritt, djatoka, Kaltura, Google Maps, and Solr to lay the foundation for a robust and engaging site. Future directions, like the integration/development of better curation tools and what we hope to learn from opening the archive to contributions from a large community of fans, will also be discussed.
Your UI can make or break the application (to the user, anyway)
- Robin Schaaf, University of Notre Dame, email@example.com
code4lib 2012, Thursday 9 February 2012, 11:00-11:20
UI development is hard and too often ends up as an after-thought to computer programmers - if you were a CS major in college I'll bet you didn't have many, if any, design courses. I'll talk about how to involve the users upfront with design and some common pitfalls of this approach. I'll also make a case for why you should do the screen design before a single line of code is written. And I'll throw in some ideas for increasing usability and attractiveness of your web applications. I'd like to make a case study of the UI development of our open source ERMS.
Dirty Usability: Rapid Prototyping with Bootstrap
- Shaun Ellis, Princeton University Libraries, firstname.lastname@example.org
code4lib 2012, Thursday 9 February 2012, 11:20-11:40
"The code itself is unimportant; a project is only as useful as people actually find it." - Linus Torvalds 
Usability has been a buzzword for some time now, but what is the process for making the the transition toward a better user experience, and hence, better designed library sites? I will discuss the one facet of the process my team is using to redesign the Finding Aids site for Princeton University Libraries (still in development). The approach involves the use of rapid prototyping, with Bootstrap , to make sure we are on track with what users and stakeholders expect up front, and throughout the development process.
Design for Developers
- Lisa Kurt, University of Nevada, Reno, email@example.com
code4lib 2012, Tuesday 7 February 2012, 14:00-14:20
Users expect good design. This talk will delve into what makes really great design, what to look for, and how to do it. Learn the principles of great design to take your applications, user interfaces, and projects to a higher level. With years of experience in graphic design and illustration, Lisa will discuss design principles, trends, process, tools, and development. Design examples will be from her own projects as well as a variety from industry. You’ll walk away with design knowledge that you can apply immediately to a variety of applications and a number of top notch go-to resources to get you up and running.
HathiTrust Large Scale Search: Scalability meets Usability
- Tom Burton-West, DLPS, University of Michigan Library, tburtonw AT umich edu
code4lib 2012, Tuesday 7 February 2012, 13:00-13:20
HathiTrust Large-Scale search provides full-text search services over nearly 10 million full-text books using Solr for the back-end. Our index is around 5-6 TB in size and each shard contains over 3 billion unique terms due to content in over 400 languages and dirty OCR.
Searching the full-text of 10 million books often results in very large result sets. By conference time a number of features designed to help users narrow down large result sets and to do exploratory searching will either be in production or in preparation for release. There are often trade-offs between implementing desirable user features and keeping response time reasonable in addition to the traditional search trade-offs of precision versus recall.
We will discuss various scalability and usability issues including:
- Trade-offs between desirable user features and keeping response time reasonable and scalable
- Our solution to providing the ability to search within the 10 million books and also search within each book
- Migrating the personal collection builder application from a separate Solr instance to an app which uses the same back-end as full-text search.
- Design of a scalable multilingual spelling suggester
- Providing advanced search features combining MARC metadata with OCR
- The dismax mm and tie parameters
- Weighting issues and tuning relevance ranking
- Displaying only the most "relevant" facets
- Tuning relevance ranking
- Dirty OCR issues
- CJK tokenizing and other multilingual issues.
ALL TEH METADATAS! or How we use RDF to keep all of the digital object metadata formats thrown at us
- Declan Fleming, University of California, San Diego, dfleming AT ucsd DING edu
code4lib 2012, Tuesday 7 February 2012, 11:40-12:00
What's the right metadata standard to use for a digital repository? There isn't just one standard that fits documents, videos, newspapers, audio files, local data, etc. And there is no standard to rule them all. So what do you do? At UC San Diego Libraries, we went down a conceptual level and attempted to hold every piece of metadata and give each holding place some context, hopefully in a common namespace. RDF has proven to be the ideal solution, and allows us to work with MODS, PREMIS, MIX, and just about anything else we've tried. It also opens up the potential for data re-use and authority control as other metadata owners start thinking about and expressing their data in the same way. I'll talk about our workflow which takes metadata from a stew of various sources (CSV dumps, spreadsheet data of varying richness, MARC data, and MODS data), normalizes them into METS by our Metadata Specialists who create an assembly plan, and then ingests them into our digital asset management system. The result is a beautiful graph of RDF triples with metadata poised to be expressed as HTML, RSS, METS, XML, and opens linked data possibilities that we are just starting to explore.
HTML5 Microdata and Schema.org
- Jason Ronallo, North Carolina State University Libraries, firstname.lastname@example.org
code4lib 2012, Tuesday 7 February 2012, 11:20-11:40
When the big search engines announced support for HTML5 microdata and the schema.org vocabularies, the balance of power for semantic markup in HTML shifted.
- What is microdata?
- Where does microdata fit with regards to other approaches like RDFa and microformats?
- Where do libraries stand in the worldview of Schema.org and what can they do about it?
- How can implementing microdata and schema.org optimize your sites for search engines?
- What tools are available?
Related Code4Lib Journal article
Video starts at 01:05:30