You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 20 min 33 sec ago

DuraSpace News: PROGRAMS available: DLF’s Liberal Arts Colleges Pre-Conf, 2016 DLF Forum, and NDSA’s Digital Preservation 2016

Tue, 2016-08-16 00:00

From Bethany Nowviskie, Director of the Digital Library Federation (DLF) at CLIR, on behalf of organizing committees and local hosts for Liberal Arts Colleges Pre-Conference, 2016 DLF Forum, and NDSA Digital Preservation 2016

DuraSpace News: Fedora Project in Australia this Fall

Tue, 2016-08-16 00:00

Austin, TX  Traveling to Melbourne, Australia for the eResearch Australasia Conference 2016?

Mita Williams: The Hashtag Syllabus: Part Two

Mon, 2016-08-15 19:33

Last week I finally uploaded a bibliography of just under 150 items from the Leddy Library that could be found on the BlackLivesCDNSyllabi that has been circulating on Twitter since July 5th. In this post, I will go into some technical detail why it took me so long to do this.

For the most part, the work took time simply because there were lots of items from the original collection that was collected by Monique Woroniak in a Storify collection that needed to be imported into Zotero. I’m not exactly sure how many items are in that list, but in my original Zotero library of materials there are 220 items.

Because I’ve made this library public, you can open Zotero while on the page and download all or just some of the citations that I’ve collected.

I transferred the citations into Zotero because I wanted to showcase how citations could be repurposed using its API as well as through its other features. I’m a firm believer in learning by doing because sometimes you only notice the low beam once you’ve hit your head. In this case, it was only when I tried to reformat my bibliography using  Zotero’s API, I then learned that  Zotero’s API has a limit of 150 records.

(This is why I decided to showcase primarily the scholarly works in the “Leddy Library” version of the #BlackLivesCDNSyllabus and cut down the list to below 15o by excluding websites, videos, and musical artists.)

One of the most underappreciated features of Zotero is its API.

To demonstrate its simple power: here’s the link to the Leddy Library #BlackLivesCDNSyllabus using the API in which I’ve set the records to be formatted using the MLA Style: [documentation]

You can embed this code into a website using jQuery like so:

<!doctype html> <html lang="en"> <head>   <meta charset="utf-8">   <title>Leddy Library #BlackLivesCDNSyllabus</title>   <style>   body {     font-size: 12px;     font-family: Arial;   }   </style>   <script src=""></script> </head> <body>   <h1>Leddy Library #BlackLivesCDNSyllabus</h1> <p>   <div id="a"></div> <script> $( "#a" ).load("" ); </script>   </body> </html>

The upshot of using the API is that when you need to update the bibliography, any additions to your Zotero group will automatically be reflected through the API: you don’t need to update the website manually.

For my purposes, I didn’t want to use Zotero to generate a just bibliography: I wanted it to generate a list of titles and links so that a user could directly travel from bibliographic entry to the Leddy Library catalogue to see if and where a book was waiting on a shelf in the Leddy Library.

Now, I know that’s not the purpose of a bibliography – a bibliography presents identifying information about a work and it doesn’t have to tell you where it is located (unless, of course, that item is available online, then, why wouldn’t you?).  Generally you don’t want to embed particular information such as links to your local library catalogue into your bibliography precisely because that information makes your bibliography less useful to everyone else who isn’t local.

The reason why I wanted to include direct links to material is largely because I believe our library catalogue’s OpenURL resolver has been realized so poorly that it is actually harmful to the user experience. You see, if you use our resolver while using Google Scholar to find an article – the resolver works as it should.

But if the reader is looking for a book, the resolver states that there is No full text available — even the library currently has the book on the shelf (this information is under the holdings tab).

In order to ensure that book material would be found without ambiguity, myself and our library’s co-op student manually added URLs that pointed directly to each respective record in the library catalogue to each of the 150 or so Zotero entries in our #BlackLivesCDNSylllabus collection. This took some time.

Now all I had to do was create a blog entry that included the bibliography…

I will now explain two ways you can re-purpose the display of Zotero records for your own use.

The first method I investigated was the creation of my own Zotero Citation Style. Essentially, I took an existing citation style and then added the option to include the call number and the URL field using the Visual Citation Style Editor,  a project which was the result of a collaboration of Columbia University Libraries and Mendeley from some years ago.

I took my now customized citation style and uploaded it up to a server and now I can use it as my own style whenever I need it:

I can now copy this text and paste into my library’s website ‘blog form’ and in doing so, all the URLs will automatically turn into active links.

There’s another method to achieve the same ends but in an even easier way. Zotero has an option called Reports that allows you to generate a printer-friendly report of a collection of citations.

Unfortunately, the default view of the report is to show you every single field that has information in it. Luckily there is the Zotero Reports Customizer which allows one to limit what’s shown in the report:

There’s only one more hack left to mention. While the Zotero Report Customizer was invaluable, it doesn’t allow you to remove the link from each item’s title. The only option seemed to remove the almost 150 links by hand…

Luckily the text editor Sublime Text has an amazing power: Quick Find All — which allows the user to select all the matching text at once.

Then after I had the beginning of all the links selected for, I used the ‘Expand selection to quotes’ option that you can add to Sublime Text via Package Control and then removed the offending links. MAGIC!

The resulting HTML was dropped into my library’s Drupal-driven blog form and results in a report that looks like this:

Creating and sharing bibliographies and lists of works from our library catalogues should not be this hard.

It should not be so hard for people to share their recommendations of books, poets, and to creative works with each other.

It brings all to mind this mind this passage by Paul Ford from his essay The Sixth of Grief Is Retro-computing:

Technology is What We Share

Technology is what we share. I don’t mean “we share the experience of technology.” I mean: By my lights, people very often share technologies with each other when they talk. Strategies. Ideas for living our lives. We do it all the time. Parenting email lists share strategies about breastfeeding and bedtime. Quotes from the Dalai Lama. We talk neckties, etiquette, and Minecraft, and tell stories that give us guidance as to how to live. A tremendous part of daily life regards the exchange of technologies. We are good at it. It’s so simple as to be invisible. Can I borrow your scissors? Do you want tickets? I know guacamole is extra. The world of technology isn’t separate from regular life. It’s made to seem that way because of, well…capitalism. Tribal dynamics. Territoriality. Because there is a need to sell technology, to package it, to recoup the terrible investment. So it becomes this thing that is separate from culture. A product.

Let’s not make sharing just another product that we have to buy from a library vendor. Let’s remember that sharing is not separate from culture.

This is the second part series called The Hashtag Syllabus. Part One is a brief examination of the recent phenomenon of generating and capturing crowdsourced syllabi on Twitter and Part Three looks to Marshall McLuhan and Patrick Wilson for comment on the differences between a library and a bibliography.

LibUX: On ethics in the digital divide

Mon, 2016-08-15 17:02

People without a good understanding of the tech ecosystem are vulnerable to people who want to sell them things and can’t properly evaluate what they they are being sold. — Jessamyn West

SearchHub: Parallel Computing in SolrCloud

Mon, 2016-08-15 16:30

As we countdown to the annual Lucene/Solr Revolution conference in Boston this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Joel Bernstein’s session about Parallel Computing in SolrCloud.

This presentation provides a deep dive into SolrCloud’s parallel computing capabilities – breaking down the framework into four main areas: shuffling, worker collections, the Streaming API, and Streaming Expressions. The talk describes how each of these technologies work individually and how they interact with each other to provide a general purpose parallel computing framework.

Also included is a discussion of some of the core use cases for the parallel computing framework. Use cases involving real-time map reduce, parallel relational algebra, and streaming analytics will be covered.

Joel Bernstein is a Solr committer and search engineer for the open source ECM company Alfresco.

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco from Lucidworks

Join us at Lucene/Solr Revolution 2016, the biggest open source conference dedicated to Apache Lucene/Solr on October 11-14, 2016 in Boston, Massachusetts. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Parallel Computing in SolrCloud appeared first on

Islandora: Meet Your New Technical Lead

Mon, 2016-08-15 13:23

Hi, I'm Danny, and I'm the newest hire of the Islandora Foundation. My role within the Foundation is to serve as Technical Lead, and I want to take some time to introduce myself and inform everyone of just exactly what I'll be doing for them.

I guess for starters, I should delve a bit into my background. Since a very young age, I've always considered myself to be pretty nerdy. As soon as I learned how to read, my father had me in front of a 386 and taught me DOS. In high school, I discovered Linux and was pretty well hooked. It was around that time I started working with HTML, and eventually javascript and Flash. I graduated with honors from East Tennessee State University with a B.Sc. in Mathematics and a Physics minor, and was exposed to a lot of C++ and FORTRAN. I specialized in Graph Theory, which I didn't think at the time would lead to a career as a programmer, since I had decided to be an actuary after completing school. Fast forward a few years, and I have a couple actuarial exams under my belt and have become well versed in VBA programming and Microsoft Office. But I didn't really like it, and wanted to do more than spreadsheet automation. So I moved to Canada and went back to school for Computer Science, but quickly found my way into the workforce for pragmatic reasons (read: I had bills to pay). I managed to score a job in the small video game industry that's evolved on PEI. I did more Flash (sadly) but was also exposed to web frameworks like Ruby on Rails and Django. A lot of my time was spent writing servers for Facebook games, and tackled everything from game logic to payment systems. But that career eventually burned me out, as it eventually does to most folks, and I applied for a position at a small company named discoverygarden that I heard was a great place to work.

And that's how I first met Islandora. I was still pretty green for the transition from Drupal 6 to 7, but shortly after the port I was able to take on more meaningful tasks. After learning tons about the stack while working with discoverygarden, I was given the opportunity to work on what would eventually become CLAW. And now I'm fortunate enough to have the opportunity to see that work through as an employee of the Islandora Foundation. So before I start explaining my duties as Tech Lead, I'd really like to thank the Islandora Foundation for hiring me, and discoverygarden for helping me gain the skills I needed to grow into this position.

Now, as is tradition in Islandora, I'll describe my roles as hats.  I'm essentially wearing three of them:

  • Hat #1: 7.x-1.x guy 
    1. We have increasingly more defined processes and workflows, and I'm committed to making sure those play out the way they should. But, for whatever reason, If there's a time where a pull request has sat for too long and the community hasn't responded, I will make sure it is addressed. I will either try to facilitate a community member who has time/interest to look at it, and if that's not possible, I will review myself.
    2. I will take part in and help chair committers' calls every other Thursday.
    3. I will attend a handful of Interest Group meetings.  There's too many for me to attend them all, but I'm starting with the IR and Security interest groups.
    4. Lastly, I will be serving as Release Manager for the next release, and will be working towards continuing to document and standardize the process to the best of my abilities, so that it's easier for other community members to take part in and lead that process from here on out.

  • Hat #2: CLAW guy
    1. We're currently in the process of pivoting from a Drupal 7 to Drupal 8 codebase, and I'm going to be sheparding that process as transparently as possible.  This means I will be working with community members to develop a plan for the minimum viable product (or MVP for short).  This will help defend against scope creep, and force ourselves as a community to navigate what all these changes mean.  Between Fedora 4, PCDM, and Drupal 8, there's a lot that's different, and we need to all be on the same page.  For everyone's sake, this work will be conducted as much as possible by conversations through the mailing lists, instead of solely at the tech calls.  In the Apache world, if it doesn't happen on the list, it never happened.  And I think that's a good approach to making sure people can at least track down the rationale for why we're doing certain things in the codebase.
    2. Using the MVP, I will be breaking down the work into the smallest consumable pieces possible.  In the past few years I've learned a lot about working with volunteer developers, and I fully understand that people have day jobs with other priorities.  By making the units of work as small as possible, we have better chance of receiving more contributions from interested community members.  In practice, I think this means I will be writing a lot of project templates to decrease ramp-up time for people, filling in boilerplate, and ideally even providing tests beforehand.
    3. I will be heavily involved in Sprint planning, and will be running community sprints for CLAW.
    4. I will be chairing and running CLAW tech calls, along with Nick Ruest, the CLAW Project Director.

  • Hat #3: UBC project guy
    • As part of a grant, the Foundation is working with the University of British Columbia Library and Press to integrate a Fedora 4 with a front-end called Scalar. They will also be using CLAW as a means of ingesting multi-pathway books. So I will be overseeing contractor work for the integration with Scalar, while also pushing CLAW towards full book support.

I hope that's enough for everyone to understand what I'll be doing for them, and how I can be of help if anyone needs it.  If you need to get in touch with me, I can be found on the lists, in #islandora on IRC as dhlamb, or at  I look forward to working with everyone in the future to help continue all the fantastic work that's been done by everyone out there.


LibUX: Content Style Guide – University of Illinois Library

Mon, 2016-08-15 00:36

University of Illinois Library has made their content style guide available through a creative commons license. I feel like more than anyone I point to Suzanne Chapman‘s work. I saw her credited in the site’s footer and thought, “oh, yeah – of course.” She’s pretty great.

One of the best ways to ensure that our website is user-friendly is to follow industry best practices, keep the content focused on key user tasks, and keep our content up-to-date at all times.

Also, walk through their 9 Principles for Quality Content.

We have many different users with many different needs. They are novice and expert users, desktop and mobile users, people with visual, hearing, motor, or cognitive impairments, non-native English speakers, and users with different cultural expectations. Following these guidelines will help ensure a better experience for all our users. They will also help us create a more sustainable website.

  1. Content is in the right place
  2. Necessary, needed, useful, and focused on patron needs
  3. Unique
  4. Correct and complete
  5. Consistent, clear, and concise
  6. Structured
  7. Discoverable and makes sense out of context
  8. Sustainable (future-friendly)
  9. Accessible

All of these are elaborated and link out to a rabbit-hole of further reading.

Jonathan Rochkind: UC Berkeley Data Science intro to programming textbook online for free

Sat, 2016-08-13 22:36

Looks like a good resource for library/information professionals who don’t know how to program, but want to learn a little bit of programming along with (more importantly) computational and inferential thinking, to understand the technological world we work in. As well as those who want to learn ‘data science’!

Data are descriptions of the world around us, collected through observation and stored on computers. Computers enable us to infer properties of the world from these descriptions. Data science is the discipline of drawing conclusions from data using computation. There are three core aspects of effective data analysis: exploration, prediction, and inference. This text develops a consistent approach to all three, introducing statistical ideas and fundamental ideas in computer science concurrently. We focus on a minimal set of core techniques that they apply to a vast range of real-world applications. A foundation in data science requires not only understanding statistical and computational techniques, but also recognizing how they apply to real scenarios.

For whatever aspect of the world we wish to study—whether it’s the Earth’s weather, the world’s markets, political polls, or the human mind—data we collect typically offer an incomplete description of the subject at hand. The central challenge of data science is to make reliable conclusions using this partial information.

In this endeavor, we will combine two essential tools: computation and randomization. For example, we may want to understand climate change trends using temperature observations. Computers will allow us to use all available information to draw conclusions. Rather than focusing only on the average temperature of a region, we will consider the whole range of temperatures together to construct a more nuanced analysis. Randomness will allow us to consider the many different ways in which incomplete information might be completed. Rather than assuming that temperatures vary in a particular way, we will learn to use randomness as a way to imagine many possible scenarios that are all consistent with the data we observe.

Applying this approach requires learning to program a computer, and so this text interleaves a complete introduction to programming that assumes no prior knowledge. Readers with programming experience will find that we cover several topics in computation that do not appear in a typical introductory computer science curriculum. Data science also requires careful reasoning about quantities, but this text does not assume any background in mathematics or statistics beyond basic algebra. You will find very few equations in this text. Instead, techniques are described to readers in the same language in which they are described to the computers that execute them—a programming language.

Filed under: General

Jenny Rose Halperin: Hello world!

Fri, 2016-08-12 22:30

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!

LITA: Using Text Editors in Everyday Work

Fri, 2016-08-12 14:35

In the LITA Blog Transmission featuring yours truly I fumbled in trying to explain a time logging feature in the Windows native program Notepad. You can see in the screenshot above that the syntax is .LOG and you put it at the top of the file. Then every time you open the file it adds a time stamp at the end of the file and places the cursor there so you can begin writing. This specific file is where I keep track of my continuing education work. Every time I finish a webinar or class I open this file and write it down (I’ll be honest, I’ve missed a few). At the end of the year I’ll have a nice document with dates and times that I can use to write my annual report.

I use Microsoft Notepad several times a day. In addition to its cool logging function I find it a dependable, simple text editor. Notepad is great for distraction-free, no-frills writing. If I have to copy and paste something from the web into a document—or from one document to another—and I don’t want to drag in all the formatting from the source I put the text in Notepad first. It cleans all the formatting off the text and lets me use it as I need. I use it as a quick way to create to-do lists or jot down notes while working on something else. It launches quickly and lets me get to work right away. Plus, if you delete all the text you’ve written—after you’re done working of course—you can close Notepad and there’s no dialog box asking if you want to save the file.

Prior to becoming a librarian I worked as a programmer writing code. Every single coder I worked with used Notepad to create, revise, and edit code. Sure, you can work in the program you’re writing and your office’s text editor—and you often do; we used something like the vi text editor—but sometimes you need to think through your code and you can’t do that in an executable. I used to have several Notepad files of handy code so that I could reference it quickly without needing to search through source code for it.

I’ve been thinking about Notepad more and more as I prepare for a coding program at my library. A good text editor is essential to writing code. Once you start using one you’ll find yourself reaching for it all the time. But it isn’t all Notepad all the time. If I actually have to troubleshoot code—which these days is mostly things in WordPress—I use Notepad++:

You can see the color highlighting that Notepad++ uses which is a great visual way to see if there are problems in your code without even reading it. It also features a document map which is a high-level view of your entire document on the right-hand side of the screen that highlights where you are in the code. There’s a function list that lists all the functions called in the file. Notepad++ has some other cool text editor functions like multi-editing (editing in several places in the file at the same time), and column mode editing (where you can select a column of text to edit instead of entire lines of code). It’s a very handy tool when you’re working on code.

These are not the only text editors out there. A quick search for lists of text editors gives you more choices than you need. Notepad++ is at the top of several lists and I have to say that I like it better than others I’ve tried. The best thing is most of these text editors are free so they’re easy to try out and see what works for you. They all have very similar feature sets so it often comes down to the user interface. While these two options are Windows operating system only, there are plenty of good text editors for Mac users, too.

Text editors won’t be the starting point for my coding program. We’ll focus on some non-tech coding exercises and some online tools like Scratch or Tynker and some physical items like Sphero or LEGO Mindstorm. While these are geared towards children they are great for adults who have never interacted with code. (Sphero and Mindstorm do have a cost associated with them) When I get to the point in our coding program where I want to talk about text editors I’ll focus on Notepad and Notepad++ but let people know there are other options. If I know my patrons, they’ll have suggestions for me.

Do you have any cool tips for your favorite text editor or perhaps just a recommendation?

SearchHub: Pivoting to the Query: Using Pivot Facets to build a Multi-Field Suggester

Fri, 2016-08-12 13:43

Suggesters, also known as autocomplete, typeahead or “predictive search” are powerful ways to accelerate the conversation between user and search application. Querying a search application is a little like a guessing game – the user formulates a query that they hope will bring back what they want – but sometimes there is an element of “I don’t know what I don’t know” – so the initial query may be a bit vague or ambiguous. Subsequent interactions with the search application are sometimes needed to “drill-in” to the desired information. Faceted navigation and query suggestions are two ways to ameliorate this situation. Facets generally work after the fact – after an initial attempt has been made, whereas suggesters seek to provide feedback in the act of composing the initial query – to improve it’s precision from the start. Facets also provide a contextual multi-dimensional visualization of the result set that can be very useful in the “discovery” mode of search.

A basic tenet of suggester implementations is to never suggest queries that will not bring back results. To do otherwise is pointless (it also does not inspire confidence in your search application!). Suggestions can come from a number of sources – previous queries that were found to be popular, suggestions that are intended to drive specific business goals and suggestions that are based on the content that has been indexed into the search collection. There are also a number of implementations that are available in Solr/Lucene out-of-the-box.

My focus here is on providing suggestions that go beyond the single term query – that provide more detail on the desired results and combine the benefits of multi-dimensional facets with typeahead. Suggestions derived from query logs can have this context but these are not controlled in terms of their structure. Suggestions from indexed terms or field values can also be used but these only work with one field at a time. Another focus of this and my previous blogs is to inject some semantic intelligence into the search process – the more the better. One way to do that is to formulate suggestions that make grammatical sense – constructed from several metadata fields – that create query phrases that clearly indicate what will be returned.

So what do I mean by “suggestions that make grammatical sense”? Just that we can think of the metadata that we may have in our search index (and if we don’t have, we should try to get it!) as attributes or properties of some items or concepts represented by indexed documents. There are potentially a large number of permutations of these attribute values, most of which make no sense from a grammatical perspective. Some attributes describe the type of thing involved (type attributes), and others describe the properties of the thing. In a linguistic sense, we can think of these as noun and adjective properties respectively.

To provide an example of what I mean, suppose that I have a search index about people and places. We would typically have fields like first_name, last_name, profession, city and state. We would normally think of these fields in this order or maybe last_name, first_name city, state – profession as in:

Jones, Bob Cincinnati, Ohio – Accountant


Bob Jones, Accountant, Cincinnati, Ohio

But we would generally not use:

Cincinnati Accountant Jones Ohio Bob

Even though this is a valid mathematical permutation of field value ordering. So if we think of all of the possible ways to order a set of attributes, only some of these “make sense” to us as “human-readable” renderings of the data.

Turning Pivot Facets “Around” – Using Facets to generate query suggestions

While facet values by themselves are a good source of query suggestions because they encapsulate a record’s “aboutness”, they can only do so one attribute at a time. This level of suggestion is already available out-of-the-box with Solr/Lucene Suggester implementations which use the same field value data that facets do in the form of a so-called uninverted index (aka the Lucene FieldCache or indexed Doc Values). But what if we want to combine facet fields as above? Solr pivot facets (see “Pivot Facets Inside And Out” for background on pivot facets) provide one way of combining an arbitrary set of fields to produce a cascading or nested sets of field values. Think of is as a way of generating a facet value “taxonomy” – on the fly. How does this help us? Well, we can use pivot facets (at index time) to find all of the permutations for a compound phrase “template” composed of a sequence of field names – i.e. to build what I will call “facet phrases”. Huh? Maybe an example will help.

Suppose that I have a music index, which has records for things like songs, albums, musical genres and the musicians, bands or orchestras that performed them as well as the composers, lyricists and songwriters that wrote them. I would like to search for things like “Jazz drummers”, “Classical violinists”, “progressive rock bands”, “Rolling Stones albums” or “Blues songs” and so on. Each of these phrases is composed of values from two different index fields – for example “drummer”, “violinist” and “band” are musician or performer types. “Rolling Stones” are a band which as a group is a performer (we are dealing with entities here which can be single individuals or groups like the Stones). “Jazz”, “Classical”, “Progressive Rock” and “Blues” are genres and “albums” and “songs” are recording types (“song” is also a composition type). All of these things can be treated as facets. So if I create some phrase patterns for these types of queries like “musician_type, recording_type” or “genre, musician_type” or “performer, recording_type” and submit these as pivot facet queries, I can construct many examples of the above phrases from the returned facet values. So for example, the pivot pattern “genre, musician_type” would return things like, “jazz pianist”, “rock guitarist”, “classical violinist”, “country singer” and so on – as long as I have records in the collection for each of these category combinations.

Once I have these phrases, I can use them as query suggestions by indexing them into a collection that I use for this purpose. It would also be nice if the precision that I am building into my query suggestions was honored at search time. This can be done in several ways. When I build my suggester collection using these pivot patterns, I can capture the source fields and send them back with the suggestions. This would enable precise filter or boost queries to be used when they are submitted by the search front end. One potential problem here is if the user types the exact same query that was suggested – i.e. does not select from the typeahead dropdown list. In this case, they wouldn’t get the feedback from the suggester but we want to ensure that the results would be exactly the same.

The query autofiltering technique that I have been developing and blogging about is another solution to matching the precision of the response with the added precision of these multi-field queries. It would work whether or not the user clicked on a suggestion or typed in the phrase themselves and hit “enter”. Some recent enhancements to this code that enable it to respond to verb, prepositional or adjectives and to adjust the “context” of the generated filter or boost query, provide another layer of precision that we can use in our suggestions. That is, suggestions can be built from templates or patterns in which we can add “filler” terms such as the verbs, prepositions and adjectives that the query autofilter now supports.

Once again, an example may help to clear up the confusion. In my music ontology, I have attributes for “performer” and “composer” on documents about songs or recordings of songs. Many artists whom we refer to as “singer-songwriters” for example, occur as both composers and performers. So if I want to search for all of their songs regardless of whether they wrote or performed them, I can search for something like:

Jimi Hendrix songs

If I want to just see the songs that Jimi Hendrix wrote, I would like to search for

“songs Jimi Hendrix wrote” or “songs written by Jimi Hendrix”

which should return titles like “Purple Haze”, “Foxy Lady” and “The Wind Cries Mary”

In contrast the query:

“songs Jimi Hendrix performed”

should include covers like “All Along the Watchtower” (for your listening pleasure, here’s a link), “Hey Joe” and “Sgt Peppers Lonely Hearts Club Band”


“songs Jimi Hendrix covered”

would not include his original compositions.

In this case, the verb phrases “wrote” or “written by”, “performed” or “covered” are not field values in the index but they tell us that the user wants to constrain the results either to compositions or to performances. The new features in the query autofilter can handle these things now but what if we want to make suggestions like this?

To do this, we write pivot template pattern like this

${composition_type} ${composer} wrote

${composition_type} written by ${composer}

${composition_type} ${performer} performed

Code to do Pivot Facet Mining

The source code to build multi-field suggestions using pivot facets is available on github. The code works as a Java Main client that builds a suggester collection in Solr.

The design of the suggester builder includes one or more “query collectors” that feed query suggestions to a central “suggester builder” that a) validates the suggestions against the target content collection and b) can obtain context information from the content collection using facet queries (see below). One of the implementations of query collector is the PivotFacetQueryCollector. Other implementations can get suggestions from query logs, files, Fusion signals and so on.

The github distribution includes the music ontology dataset that was used for this blog article and a sample configuration file to build a set of suggestions on the music data. The ontology itself is also on github as a set of XML files that can be used to create a Solr collection but note that some preprocessing of the ontology was done to generate these files. The manipulations that I did on the ontology to ‘denormalize’ or flatten it will be the subject of a future blog post as it relates to techniques that can be used to ‘surface’ interesting relationships and make them searchable without the need for complex graph queries.

Using facets to obtain more context about the suggestions

The notion of “aboutness” introduced above can be very powerful. Once we commit to building a special Solr collection (also known as a ‘sidecar’ collection) just for typeahead, there are other powerful search features that we now have to work with. One of them is contextual metadata. We can get this by applying facets to the query that the suggester builder uses to validate the suggestion against the content collection. One application of this is to generate security trimming ACL values for a suggestion by getting the set of ACLs for all of the documents that a query suggestion would hit on – using facets. Once we have this, we can use the same security trimming filter query on the suggester collection that we use on the content collection. That way we never suggest a query to a user that cannot return any results for them – in this case because they don’t have access to any of the documents that the query would return. Another thing we can do when we build the suggester collection is to use facets to obtain context about various suggestions. As discussed in the next section, we can use this context to boost suggestions that share contextual metadata with recently executed queries. 

Dynamic or On-The-Fly Predictive Analytics

One of the very powerful and extremely user-friendly things that you can do with typeahead is to make it sensitive to recently issued queries. Typeahead is one of those use cases where getting good relevance is critical because the user can only see a few results and can’t use facets or paging to see more. Relevance is often dynamic in a search session meaning that what the user is looking for can change – even in the course of a single session. Since typeahead starts to work with only a few characters entered, the queries start at a high level of ambiguity. If we can make relevance sensitive to recently searched things we can save the user a lot of a) work and b) grief. Google seems to do just this. When I was building the sample Music Ontology, I was using Google and Wikipedia (yes, I did contribute!) to lookup songs and artists and to learn or verify things like who was/were the songwriter(s) etc. I found that if I was concentrating on a single artist or genre, after a few searches, Google would start suggesting long song titles with just two or three characters entered!! It felt as if it “knew” what my search agenda was! Honestly, it was kinda spooky but very satisfying.

So how can we get a little of Google’s secret sauce in our own typeahead implementations? Well the key here is context. If we can know some things about what the user is looking for we can do a better job of boosting things with similar properties. And we can get this context using facets when we build the suggestion collection! In a nutshell, we can use facet field values to build boost queries to use in future queries in a user session. The basic data flow is shown below:



This requires some coordination between the suggester builder and the front-end (typically Javascript based) search application. The suggester builder extracts context metadata for each query suggestion using facets obtained from the source or content collection and stores these values with the query suggestions in the suggester collection. To demonstrate how this contextual metadata can be used in a typeahead app, I have written a simple Angular JS application that uses this facet-based metadata in the suggester collection to boost suggestions that are similar to recently executed queries. When a query is selected from a typeahead list, the metadata associated with that query is cached and used to construct a boost query on subsequent typeahead actions.

So, for example if I type in the letter ‘J’ into the typeahead app, I get

Jai Johnny Johanson Bands
Jai Johnny Johanson Groups
J.J. Johnson
Jai Johnny Johanson
Juke Joint Jezebel
Juke Joint Jimmy
Juke Joint Johnny

But if I have just searched for ‘Paul McCartney’, typing in ‘J’ now brings back:

John Lennon
John Lennon Songs
John Lennon Songs Covered
James P Johnson Songs
John Lennon Originals
Hey Jude

The app has learned something about my search agenda! To make this work, the front end application caches the returned metadata for previously executed suggester results and stores this in a circular queue on the client side. It then uses the most recently cached sets of metadata to construct a boost query for each typeahead submission. So when I executed the search for “Paul McCartney”, the returned metadata was:

genres_ss:Rock,Rock & Roll,Soft Rock,Pop Rock
hasPerformer_ss:Beatles,Paul McCartney,José Feliciano,Jimi Hendrix,Joe Cocker,Aretha Franklin,Bon Jovi,Elvis Presley ( … and many more)
composer_ss:Paul McCartney,John Lennon,Ringo Starr,George Harrison,George Jackson,Michael Jackson,Sonny Bono

From this returned metadata – taking the top results, the cached boost query was:

genres_ss:”Rock”^50 genres_ss:”Rock & Roll”^50 genres_ss:”Soft Rock”^50 genres_ss:”Pop Rock”^50
hasPerformer_ss:”Beatles”^50 hasPerformer_ss:”Paul McCartney”^50 hasPerformer_ss:”José Feliciano”^50 hasPerformer_ss:”Jimi Hendrix”^50
composer_ss:”Paul McCartney”^50 composer_ss:”John Lennon”^50 composer_ss:”Ringo Starr”^50 composer_ss:”George Harrison”^50
memberOfGroup_ss:”Beatles”^50 memberOfGroup_ss:”Wings”^50

And since John Lennon is both a composer and a member of the Beatles, records with John Lennon are boosted twice which is why these records now top the typeahead list. (not sure why James P Johnson snuck in there except that there are two ‘J’s in his name).

This demonstrates how powerful the use of context can be. In this case, the context is based on the user’s current search patterns. Another take home here is that use of facets besides the traditional use as a UI navigation aid are a powerful way to build context into a search application. In this case, they were used in several ways – to create the pivot patterns for the suggester, to associate contextual metadata with suggester records and finally to use this context in a typeahead app to boost records that are relevant to the user’s most recent search goals. (The source code for the Angular JS app is also included in the github repository.)

We miss you Jimi – thanks for all the great tunes! (You are correct, I listened to some Hendrix – Beatles too – while writing this blog – is it that obvious?)


The post Pivoting to the Query: Using Pivot Facets to build a Multi-Field Suggester appeared first on

LibUX: How to Talk about User Experience — The Webinar!

Fri, 2016-08-12 04:28

Hey there. My writeup (“How to Talk about User Experience“) is now a 90-minute LITA webinar. I have pretty strong ideas about treating the “user experience” as a metric and I am super grateful to my friends at LITA for another opportunity to make the case.


The explosion of new library user experience roles, named and unnamed, the community growing around it, the talks, conferences, and corresponding literature signal a major shift. But the status of library user experience design as a professional field is impacted by the absence of a single consistent definition of the area. While we can workshop card sorts and pick apart library redesigns, even user experience librarians can barely agree about what it is they do – let alone why it’s important. How we talk about the user experience matters. So, in this 90 minute talk, we’ll fix that.

  • September 7, 2016
  • 1 – 2:30 p.m. (Eastern)
  • $45 – LITA Members
  • $105 – Non-members
  • $196 – Groups


Cynthia Ng: A Reflection on Two Years as Content Coordinator

Fri, 2016-08-12 04:26
After a little over two years, my time at the BC Libraries Cooperative (the co-op) working on the National Network for Equitable Library Service (NNELS) project will be coming to an end. As I prepare to leave, I thought I would reflect on my work while at the co-op. As always, I’ll start with my … Continue reading A Reflection on Two Years as Content Coordinator

LibUX: Jobs to be Done and New Feature Planning – Workshop

Fri, 2016-08-12 04:11

I am teaching a 90 minute (!) online workshop on September 13th on Jobs to be Done and New Feature Planning, where — yep! — I will be talking about the Kano model. Those of you not familiar with the jobs to be done framework might still have heard

People don’t want a quarter-inch drill, they want a quarter-inch hole. Theodore Levitt, paraphrasing

— the observation being that people buy services not for the services themselves but to get jobs done. There is a lot actually written which takes this farther, noting that while demographics and characteristics play some minor role the task or job at hand is largely independent from that, rendering feature-planning around demographics — you might know them as personas — sort of useless. We’ll try to reconcile that, too.


Core to improving the library user experience is identifying need and introducing new and useful services, features, and content, but the risk of failure sometimes trumps our willingness to try anything out of the ordinary. What a shame, right? In this workshop, Michael Schofield — a developer, librarian, and chief #libuxer — will introduce you to methods and models for identifying the tasks patrons want to perform (or, their “jobs to be done”), and whether providing a new service or feature won’t actually have a negative impact on the overall library user experience.

  • September 13, 2016
  • 2 – 3:30 p.m. (Eastern)

The cost is free to library staff in the state of Florida, but you might still try to give it a whirl and let us know in the comments.


LibUX: Service Design for Libraries Workshop

Fri, 2016-08-12 03:47

Hey there. Michael here. I am running an online service design workshop courtesy of my amaaaazing friends at NEFLIN called Service Design for Libraries: From Map to Blueprint.


In this practical workshop, Michael Schofield — a developer, librarian, and chief #libuxer — introduces service design, its place is in the user experience zeitgeist, and its role deconstructing library services to hammer out the kinks. A brief conceptual overview is made-up for by a useful workshop that has attendees creating a customer journey map before morphing it into the practical service blueprint.

  • August 23, 2016
  • 2 – 3:30 p.m. (Eastern)
  • Free to library staff in the state of Florida (although, I think it’s free otherwise, too. Give it a whirl and let us know in the comments).


FOSS4Lib Recent Releases: Sufia - 7.1.0

Thu, 2016-08-11 22:29

Last updated August 11, 2016. Created by Peter Murray on August 11, 2016.
Log in to edit this page.

Package: SufiaRelease Date: Thursday, August 11, 2016

SearchHub: Queue Based Indexing &amp; Collection Management at Gannett

Thu, 2016-08-11 18:52

As we countdown to the annual Lucene/Solr Revolution conference in Boston this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Devansh Dhutia’s session on how Gannet manages schema changes to large Solr collections.

Deploying schema-changes to solr collections with large volumes of data can be problematic when the reindex activity can take almost a whole day. Keeping in mind that Gannett’s 16 million document index grows by approximately 800,000 documents per month, the status quo isn’t satisfactory. A side effect of the current architecture is that during a Solr outage, not only are all reindex activities paused, but upstream authoring engines suffer from latency issues.

This talk demonstrates how Gannett is switching to a queue based solution with creative use of collections & aliases to dramatically improve the deployment, reindex, and authoring experiences. The solution also incorporates keeping a pair of Solr clouds in geographically dispersed data centers in an eventually synchronized state.

Devansh joined the Gannett family in 2006 and has been an active contributor to Gannett’s search strategy starting with Lucene, and for the last 2 years, Solr. Devansh was one of the primary developers involved in switching Gannett from the traditional master-slave solr setup to a geo-replicated Solr Cloud environment. When Devansh isn’t working with Solr, he enjoys spending time with his wife & 3 year old daughter and trying new recipes.

Queue Based Solr Indexing with Collection Management: Presented by Devansh Dhutia, Gannett Co. from Lucidworks

Join us at Lucene/Solr Revolution 2016, the biggest open source conference dedicated to Apache Lucene/Solr on October 11-14, 2016 in Boston, Massachusetts. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Queue Based Indexing & Collection Management at Gannett appeared first on

LITA: LITA online continuing education for September 2016

Thu, 2016-08-11 17:40

Start out the fall with these all new sessions, including a web course and two webinars:

Web Course:

Social Media For My Institution; from “mine” to “ours”
Instructor: Plamen Miltenoff
Starting Wednesday September 21, 2016, running for 4 weeks
Register Online, page arranged by session date (login required)

A course for librarians who want to explore the institutional application of social media. Based on an established academic course at St. Cloud State University “Social Media in Global Context” (more information at ). A theoretical introduction will assist participants to detect and differentiate the private use of social media from the structured approach to social media for an educational institution. Legal and ethical issues will be discussed, including future trends and management issues. The course will include hands-on exercises on creation and dissemination of textual and multimedia content and patrons’ engagement. Brainstorming on suitable for the institution strategies regarding resources, human and technological, workload share, storytelling, and branding.

This is a blended format web course:

The course will be delivered as 4 separate live webinar lectures, one per week on:

Wednesdays, September 21, 28, October 5 and 12
2:00 – 3:00 pm Central

You do not have to attend the live lectures in order to participate. The webinars will be recorded and distributed through the web course platform, Moodle for asynchronous participation. The web course space will also contain the exercises and discussions for the course.

Details here and Registration here


How to Talk About User Experience
Presenter: Michael Schofield
Wednesday September 7, 2016
Noon – 1:30 pm Central Time
Register Online, page arranged by session date (login required)

The explosion of new library user experience roles, named and unnamed, the community growing around it, the talks, conferences, and corresponding literature signal a major shift. But the status of library user experience design as a professional field is impacted by the absence of a single consistent definition of the area. While we can workshop card sorts and pick apart library redesigns, even user experience librarians can barely agree about what it is they do – let alone why it’s important. How we talk about the user experience matters. So, in this 90 minute talk, we’ll fix that.

Details here and Registration here

Online Productivity Tools: Smart Shortcuts and Clever Tricks
Presenter: Jaclyn McKewan
Tuesday September 20, 2016
11:00 am – 12:30 pm Central Time
Register Online, page arranged by session date (login required)

Become a lean, mean productivity machine! In this 90 minute webinar we’ll discuss free online tools that can improve your organization and productivity, both at work and home. We’ll look at to-do lists, calendars, and other programs. We’ll also explore ways these tools can be connected, as well as the use of widgets on your desktop and mobile device to keep information at your fingertips.

Details here and Registration here

And don’t miss the other upcoming LITA fall continuing education offerings:


Beyond Usage Statistics: How to use Google Analytics to Improve your Repository, with Hui Zhang
Offered: Tuesday October 11, 2016, 11:00 am – 12:30 pm

Web courses:

Project Management for Success, with Gina Minks
Offered: October 2016, runs for 4 weeks

Contextual Inquiry: Using Ethnographic Research to Impact your Library UW, with Rachel Vacek and Deirdre Costello
Offered: October 2016, running for 6 weeks.

Check the Online Learning web page for more details as they become available.

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty,

Hydra Project: Hydra Connect 2016 Program news!

Thu, 2016-08-11 15:30

The Hydra Connect 2016 Program Committee thought that you might appreciate an update on how planning is going, so…

The list of workshops for Monday has been available on the wiki for some time now.  We shall shortly be asking delegates to indicate which sessions they hope to attend so that we can allocate appropriately sized rooms and so that convenors can send out any pre-workshop materials to them.

The conference proper will start on Tuesday with a plenary session, a mix of key presentations and lightning talks as at previous Connects.  On Tuesday afternoon we shall have the very popular poster session for which we ask a poster from every attending institution – please start planning!  As last year, we shall arrange for printing at a FedEx branch near the conference venue for those who prefer not to travel with a poster tube!  Details soon.

We received far more suggestions for Connect sessions than we have had in the past – in particular there were a lot of suggestions for panels and breakouts.  We’re pleased to report that by extending the “traditional” Wednesday morning parallel tracks into the afternoon we have managed to accommodate everyone’s requests.  We’ve timetabled presentations in 30-minute slots (a 20-minute presentation, 5 minutes or so for questions and a bit of time for possible movement between rooms).  Panel and breakout sessions have been timetabled in one hour slots (50-55 minutes plus movement time).  If you are involved in presenting or facilitating any of these sessions you should hear from us with confirmation at the end of next week when we have finished tweaking the timetable.  We have included a number of slots for lightning talks and we’ll start soliciting these at the end of the month.  We anticipate having the Tuesday and Wednesday programs on the wiki in ten days’ time or so and you’ll find there is so much to choose from that, inevitably, you will have to make some hard choices about which sessions to attend.  We are hoping (though this is yet to be confirmed) that we may be able to make, and subsequently post, audio recordings of all the sessions so that you can listen to those that you couldn’t attend once you return home.

Thursday morning has been given over to unconference sessions and we hope to make “Sessionizer” available to delegates in about three weeks’ time so that you can start requesting slots.  Thursday afternoon is available for Interest Groups and Working Groups to have face-time.  We shall make any spare room capacity on Thursday available for booking to allow ad-hoc gatherings, Birds of a Feather sessions, and the like.

Booking is beginning to fill up and if you haven’t yet registered now would be a good time to do so!  Full details of registration and the conference hotel are on the wiki. Please note that the specially negotiated hotel rate is only valid until September 6th and you must register by that same date to receive a Hydra t-shirt!

If you can only make it to one Hydra meeting in 2016/17, this is the one to attend! 

Open Knowledge Foundation: Update on OpenTrialsFDA: finalist for the Open Science Prize

Thu, 2016-08-11 11:59

In May, the OpenTrialsFDA team (a collaboration between Erick Turner, Dr. Ben Goldacre and the  OpenTrials team at Open Knowledge) was selected as a finalist for the Open Science Prize. This global science competition is focused on making both the outputs from science and the research process broadly accessible to the public. Six finalists will present their final prototypes at an Open Science Prize Showcase in early December 2016, with the ultimate winner to be announced in late February or early March 2017.

As the name suggests, OpenTrialsFDA is closely related to OpenTrials, a project funded by The Laura and John Arnold Foundation that is developing an open, online database of information about the world’s clinical research trials. OpenTrialsFDA will work on increasing access, discoverability and opportunities for re-use of a large volume of high quality information currently hidden in user-unfriendly Food and Drug Administration (FDA) drug approval packages (DAPs).

The FDA publishes these DAPs as part of the general information on drugs via its data portal known as Drugs@FDA. These documents contain detailed information about the methods and results of clinical trials, and are unbiased, compared to reports of clinical trials in academic journals. This is because FDA reviewers require adherence to the outcomes and analytic methods prespecified in the original trial protocols, so, in contrast to most journal editors, they are unforgiving of practices such as post hoc switching of outcomes and changes to the planned statistical analyses. These review packages also often report on clinical trials that have never been published.

A more complete picture: contrasting the journal version of antidepressant trials with the FDA information (image: Erick Turner, adapted from

However, despite their high value, these FDA documents are notoriously difficult to access, aggregate, and search. The website itself is not easy to navigate, and much of the information is stored in PDFs or non-searchable image files for older drugs. As a consequence, they are rarely used by clinicians and researchers. OpenTrialsFDA will work on improving this situation, so that valuable information that is currently hidden away can be discovered, presented, and used to properly inform evidence-based treatment decisions.

The team has started to scrape the FDA website, extracting the relevant information from the PDFs through a process of OCR (optical character recognition). A new OpenTrialsFDA interface will be developed to explore and discover the FDA data, with application programming interfaces (APIs) allowing third party platforms to access, search, and present the information, thus maximising discoverability, impact, and interoperability. In addition, the information will be integrated into the OpenTrials database, so that for any trial for which a match exists, users can see the corresponding FDA data.

Future progress will be shared both through this blog and the OpenTrials blog: you can also sign up for the OpenTrials newsletter to receive regular updates and news. More information about the Open Science Prize and the other finalists is available from

Twitter: @opentrials