You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 14 hours 43 sec ago

Karen Coyle: Classification, RDF, and promiscuous vowels

Wed, 2016-08-17 23:23
"[He] provided (i) a classified schedule of things and concepts, (ii) a series of conjunctions or 'particles' whereby the elementary terms can be combined to express composite subjects, (iii) various kinds of notational devices ... as a means of displaying relationships between terms." [1]

"By reducing the complexity of natural language to manageable sets of nouns and verbs that are well-defined and unambiguous, sentence-like statements can be interpreted...."[2]

The "he" in the first quote is John Wilkins, and the date is 1668.[3] His goal was to create a scientifically correct language that would have one and only one term for each thing, and then would have a set of particles that would connect those things to make meaning. His one and only one term is essentially an identifier. His particles are linking elements.

The second quote is from a publication about OCLC's linked data experiements, and is about linked data, or RDF. The goals are so obviously similar that it can't be overlooked. Of course there are huge differences, not the least of which is the technology of the time.*

What I find particularly interesting about Wilkins is that he did not distinguish between classification of knowledge and language. In fact, he was creating a language, a vocabulary, that would be used to talk about the world as classified knowledge. Here we are at a distance of about 350 years, and the language basis of both his work and the abstract grammar of the semantic web share a lot of their DNA. They are probably proof of some Chomskian theory of our brain and language, but I'm really not up to reading Chomsky at this point.

The other interesting note is how similar Wilkins is to Melvil Dewey. He wanted to reform language and spelling. Here's the section where he decries alphabetization because the consonants and vowels are "promiscuously huddled together without distinction." This was a fault of language that I have not yet found noted in Dewey's work. Could he have missed some imperfection?!

*Also, Wilkins was a Bishop in the Anglican church, and so his description of the history of language is based literally on the Bible, which makes for some odd conclusions.

[1]Schulte-Albert, Hans G. Classificatory Thinking from Kinner to Wilkins: Classification and Thesaurus Construction, 1645-1668. Quoting from Vickery, B. C. "The Significance of John Wilkins in the History of Bibliographical Classification." Libri 2 (1953): 326-43.
[2]Godby, Carol J, Shenghui Wang, and Jeffrey Mixter. Library Linked Data in the Cloud: Oclc's Experiments with New Models of Resource Description. , 2015.
[3] Wilkins, John. Essay Towards a Real Character, and a Philosophical Language. S.l: Printed for Sa. Gellibrand, and for John Martyn, 1668.

SearchHub: Learning to Rank in Solr

Wed, 2016-08-17 20:08

As we countdown to the annual Lucene/Solr Revolution conference in Boston this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Bloomberg’s Michael Nilsson and Diego Ceccarelli’s talk, “Learning to Rank in Solr”.

In information retrieval systems, learning to rank is used to re-rank the top X retrieved documents using trained machine learning models. The hope is that sophisticated models can make more nuanced ranking decisions than a standard Solr query. Bloomberg has integrated a reranking component directly into Solr, enabling others to easily build their own learning to rank systems and access the rich matching features readily available in Solr. In this session, Michael and Diego review the internals of how Solr and Lucene score documents and present Bloomberg’s additions to Solr that enable feature engineering, feature extraction, and reranking.

Michael Nilsson is a software engineer working at Bloomberg LP, and has been a part of the company’s Search and Discoverability team for four years. He’s used Solr to build the company’s terminal cross domain search application, searching though millions of people, companies, securities, articles, and more.

Diego Ceccarelli is a software engineer at Bloomberg LP, working in the News R&D team. His work focuses on improving search relevance in the news search functions. Before joining Bloomberg, Diego was a researcher in Information Retrieval at the National Council of Research in Italy, whilst completing his Ph.D. in the same field at the University of Pisa. He is experienced in Lucene and Solr, dating back to his work on the Europeana project in 2010, and since then enjoys diving into these technologies.

Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP from Lucidworks

Join us at Lucene/Solr Revolution 2016, the biggest open source conference dedicated to Apache Lucene/Solr on October 11-14, 2016 in Boston, Massachusetts. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Learning to Rank in Solr appeared first on

LITA: Jobs in Information Technology: August 17, 2016

Wed, 2016-08-17 19:35

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Oak Park Public Library, Web Services Specialist, Oak Park, IL

Denver Public Library, Digital Project Manager, Denver, CO

Denver Public Library, Technology Access and Training Manager, Denver, CO

The Folger Shakespeare Library, Digital Strategist, Washington, DC

Darien Library, Senior Technology Assistant, Norwalk, CT

Champlain College, Technology Librarian, Burlington, VT

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Equinox Software: 4,208 Days, 22,653 Commits, 1,883,352 Lines of Code

Wed, 2016-08-17 16:09

Ten years ago, something remarkable happened. A brand new open source ILS went live in over 200 libraries across the state of Georgia. While the migration happened in a matter of days, it was the culmination of two years’ worth of work by a small team.

Today, that same open source ILS is utilized by more than 1,500 libraries all over the United States, Canada, and across the world. The small team has grown into an active community, supporting and improving the software each year. That software is Evergreen and Equinox is beyond proud to be the leading provider of support and development services for it.

As we approach Evergreen’s tenth birthday–Labor Day weekend–we’ll look at each year of Evergreen’s life. Equinox Team Members will be posting a blog post each day leading up to Labor Day, beginning on Thursday, August 18 (That’s tomorrow! Yay!).  Join us as we take a closer look at the software that has brought so many people together.

Evergreen’s Baby Picture handwritten by Mike Rylander

Open Knowledge Foundation: An interview with Rufus Pollock – Why I am Excited about MyData 2016 in Finland

Wed, 2016-08-17 10:50

A few weeks ago I sat down for a virtual interview with Molly Schwartz from Open Knowledge Finland about my thoughts on open data and mydata and why I am so excited about the MyData 2016 conference. The three-day conference is taking place from August 31 to September 2 in Helsinki and is being organized by Open Knowledge Finland in partnership with Aalto University and Fing.

You can register for MyData 2016 here. Discount price for the members of the Open Knowledge Network is just 220 eur / 3 day conference. Ask for the discount code from ( before registering at the MyData 2016 Holvi store. You can also still apply to be a volunteer for the conference.

This event shares many of the same organizers as the 2012 Open Knowledge Festival in Helsinki so you can expect the same spirit of fun, creativity and quality that made that such an incredible experience.


Molly Schwartz: So hi everybody, this is Molly Schwartz here, one of the team members helping to put on the MyData conference in Helsinki from August 31 to September 2. And I’m sitting here with one of our plenary speakers, Dr. Rufus Pollock, who is one of founders and the president of Open Knowledge, a worldwide network working to provide access to more open and broad datasets. So we’re very excited to have him here. So, Rufus, something that not a lot of people know is that MyData is actually an initiative that was born out the Finnish chapter of Open Knowledge (OKFFI), how do you feel about things that were kind of started by your idea springing up of their own accord?

Rufus Pollock: Well, it’s inspirational and obviously really satisfying. And not just in a personal way: it’s just wonderful to see how things flourish. Open Knowledge Finland have been an incredibly active chapter. I first went to Finland in I think it was 2010, and I was inspired then. Finland is just a place where you have a feeling you are in a wise society. The way they approach things, they’re very engaged but they have that non-attachment, a rigor of looking at things, and also trying things out. Somehow there’s not a lot of ego, people are very curious to learn and also to try things out, and I think deep down are incredibly innovative.

And I think this event is really in that tradition. I think the area of personal data and MyData is a huge issue, and one with a lot of connections to open data, even if it’s distinct. So I think it’s a very natural thing for a chapter from Open Knowledge to be taking on and looking at because it’s central to how we look at the information society, the knowledge society, of the 21st Century.

MS: Definitely. I totally agree. I like that you brought up that this concept of personal data is somewhat distinct, but it’s inevitably tied to this concept of opening data. Oftentimes opening datasets, you’re dealing with personal datasets as well. So, what are the kind of things you’re planning to speak about, loosely, at the conference, and what do you look forward to hearing from other people who will be at the MyData?

RP: Yes, that’s a great question. So, what am I looking to talk about and engage with and what am I looking forward to hearing about? Well, maybe I’ll take the second first.

What I am looking forward to

I think one of the reasons I’m really excited to participate and come is it’s the area where – even though I obviously know a lot about data and open data – this area of personal data is one where I am not as much an expert – by a long way. So I’m really curious to hear about it and especially about things like: what is the policy landscape? What do people think are the big things that are coming up? I’m really interested to see what the business sector is looking at.

There’s been quite a lot of discussion about how one could innovate in this space in a way that is both a wider opportunity for people to use data, personal data in usable ways, maybe in health care, maybe in giving people credit, I mean in all kinds of areas. But how do you do that in a way that respects and preserves people’s privacy, and so on. So, I think that’s really interesting as well, and again I’m not so up on that space. I’m looking forward to meeting and hearing from some of the people in that area.

And similarly on the policy, on the business side, and also on the civil society side and on the research side. I’ve heard about things like differential privacy and some of the breakthroughs we’ve had over the last years about how one might be able to allow people like researchers to analyse information, like genetics, like healthcare without getting direct access to the individual data and creating privacy issues. And there’s clearly a lot of value one could have from researchers being able to look at, for example, at genomic data from individuals across a bunch of them. But it’s also crucial to be able to preserve privacy there, and what are the kind of things going on there? And the research side I think would also touch on the policy side of matters as well.

What I would like to contribute

That brings me to what, for my part, I would like to contribute. I think Open Knowledge and we generally are on a journey at a policy level. We’ve got this incredible information revolution, this digital revolution, which means we’re living in a world of bits, and we need to make sure that world works for everyone. And that it works, in the sense that, rather than delivering more inequality – which it could easily do – and more exploitation, it give us fairness and empowerment, it brings freedom rather than manipulation or oppression. And I think openness is just key there.

And this vision of openness isn’t limited to just government – we can do it for all public datasets. By public datasets I don’t just mean government datasets, I mean datasets that you can legitimately share with anyone.

Now private, personal data you can’t legitimately give to anyone, or share with anyone — or you shouldn’t be able to!

So I think an interesting question is how those two things go together — the public datasets and the private, personal data. How they go together both in overall policy, but also in the mind of the public and of citizens and so on — how are they linked?

And this issue of how we manage information in the 21st century doesn’t just stop at some line where it’s like, oh, it’s public data, you know, and therefore we can look at it this way. Those of us working to make a world of open information have to look at private data too.

At Open Knowledge we have always had this metaphor of a coin. And one side of this coin is public data, e.g. government data. Now that you can open to everyone, everyone is empowered to have access. Now the flip side of that coin is YOUR data, your personal data. And your data is yours: you should get to choose how it’s shared and how it’s used.

Now while Open Knowledge is generally focused on, if you like, the public side, and will continue to do so, overall I think across the network this issue of personal data is just huge, it has this huge linkage. And I think the same principles can be applied. Just as for open data what we say is that people have freedom to access, share, use, government, whatever data is being opened, so with YOUR data, YOU should be empowered to access, share, and use that, as YOU see fit. And right now that is just not the case. And that’s what leads to the abuses we get concerned about, but it’s also what stops some of the innovation and stops people from being empowered and able to understand and take action on their own lives — what might you learn from having my last five years of, say, your shopping receipts or mobile phone location data.

Ultimately what happens to public data and what happens to personal data, they’re interconnected, both in people’s minds and, in a sense, they don’t just care about one thing or another, they care about, how is digital information going to work, how’s my data going to be managed, how’s the world’s data going to be managed.

I also think MyData is some of the most relevant issues for ordinary people. For example, just recently I had to check if someone paid me and it was just a nightmare. I had to scroll back through endless screens on my online banking account to find ways to download different files to piece it all together. Why didn’t they let me download all the data in a convenient way rather than having to dig forever and then only get the last three months. They’ve got that data on their servers, why can’t I have it? And, you know, maybe not only do I want it, but maybe there’s some part I would share anonymized, it could be aggregated and we could discover patterns that might be important — just, as one example we might be able to estimate inflation better. Or take energy use: I would happily share my house’s energy use data with people, even if it does tell you when I go to bed, I’d be happy to share if that let’s us discover how to make things environmentally better.

The word I think at the heart of it is empowerment. We at Open Knowledge want to see people empowered in the information age to understand, to make choices, to hold power to account, and one of the fundamental things is you being empowered with the information about you that companies or governments have, and we think you should be given access to that, and you should be choosing who else has access to it, and not the company, and not the government, per se.

MS: Yes. And that’s exactly the principle of why MyData came out of Open Knowledge, as you mentioned earlier, the idea of why can not these principles of Open Knowledge, of the datasets we want to be receiving, also apply to our data that we would like to be open in the same way back to us?

RP: Absolutely correct Molly, I mean just yes, absolutely.

MS: And that’s why it’s also so interesting, so many people have been talking about this kind of inherent tension between openness and privacy, and kind of, changing how we’re thinking about that, and seeing it actually as the same principles just being applied to individual people.

RP: Exactly, back in 2013 I wrote a post with my co-CEO Laura James about this idea and even used the term MyData. There’s an underlying unity that you’re pointing out that actually is a deep principle.

Remember openness isn’t an end in itself, right, it’s a means to an end – like money! And the purpose of having information and opening it up is to empower human beings to do something, to understand, to innovate, to learn, to discover, to earn a living, whatever it is. And that idea of empowerment, fundamentally, is common in both threads, both to MyData and personal data and access to that, and the access to public data for everyone. So I think you are totally right.

MS: Yes. So, thank you so much Rufus for joining us today, we are so looking forward to having you at the conference. You mention that you’ve been to Finland before. How long ago was that?

RP: I was there in 2012 for Open Knowledge Festival which was amazing. And then in 2010. Finland is an amazing place, Helsinki is an amazing place, and it will be an amazing event, so I really invite you to come along to the conference.

MS: I second that, and it’s many of the same people who are involved in organizing the Open Knowledge Festival who are involved in organizing MyData, so we can expect much of the same.

RP: A brilliant programme, high quality people. An incredible kind of combination of kind of joy and reliability, so you’ll have an amazing time, come join us.

MS: Yes. Ok, so thank you Rufus, and we will see you in August!

RP: See you in August!

LibUX: A practical security guide for web developers

Wed, 2016-08-17 03:50

Lisa Haitz in slack pointed out this gem of a repo, intended to be a practical security guide for web developers.

Security issues happen for two reasons –

1. Developers who have just started and cannot really tell a difference between using MD5 or bcrypt.
2. Developers who know stuff but forget/ignore them.

Our detailed explanations should help the first type while we hope our checklist helps the second one create more secure systems. This is by no means a comprehensive guide, it just covers stuff based on the most common issues we have discovered in the past.

Their security checklist — I think — demonstrates just how involved web security can be in that, first, no wonder so many mega-sites have been hacked in the last year, and second, libraries probably aren’t ready for anticipatory design.

A practical security guide for developers

LibUX: UI Content Resources

Wed, 2016-08-17 03:38

In our slack, Penelope Singer shared this mega-list of articles, books, and examples for making good content, establishing a content strategy, and the like.

You’ll find this list useful if:
* You’re a writer working directly with an interface
* You’re a designer that is often tasked with writing user interface copy
* You’re a content strategist working on a product and want to learn more about the words used in an interface
* You’re a copywriter and want to learn more about user experience

UI Content Resources

LibUX: Circulating Ideas #99: Cecily Walker

Wed, 2016-08-17 02:22

We — Amanda and Michael — were honored to guest-host an episode of Circulating Ideas, interviewing Cecily Walker about design thinking and project management. Steve Thomas was nice enough to let us re-broadcast our interview.

Cecily Walker is a librarian at Vancouver Public Library, where she focuses on user experience, community digital projects, digital collections, and the intersection of social justice, technology, and public librarianship. It was her frustration with the way that software was designed to meet the needs of highly technical users rather than the general public that led her to user experience, but it was her love of information, intellectual freedom, and commitment to social justice that led her back to librarianship. Cecily can be found on Twitter (@skeskali) where she frequently holds court on any number of subjects, but especially lipstick.

Show notes

This Vancouver
“UX, consideration, and a CMMI-based model” [Coral Sheldon-Hess]
“Mindspring’s 14 Deadly Sins”
Cecily on Twitter


If you like, you can download the MP3 or subscribe to LibUX on StitcheriTunes, YouTube, Soundcloud, Google Play Music, or just plug our feed straight into your podcatcher of choice.

William Denton: Bad-Ass Librarians

Wed, 2016-08-17 01:14

I saw this at the bookstore today and bought it immediately: The Bad-Ass Librarians of Timbuktu and Their Race to Save the World’s Most Precious Manuscripts, by Joshua Hammer.

I’ll try to do a review when I’ve read it, but in the meantime, anything about bad-ass librarians needs to be shared with all the other bad-ass librarians out there.

Karen Coyle: The case of the disappearing classification

Wed, 2016-08-17 00:14
I'm starting some research into classification in libraries (now that I have more time due to having had to drop social media from my life; see previous post). The main question I want to answer is: why did research into classification drop off at around the same time that library catalogs computerized? This timing may just be coincidence, but I'm suspecting that it isn't.

 I was in library school in 1971-72, and then again in 1978-80. In 1971 I took the required classes of cataloging (two semesters), reference, children's librarianship, library management, and an elective in law librarianship. Those are the ones I remember. There was not a computer in the place, nor do I remember anyone mentioning them in relation to libraries. I was interested in classification theory, but not much was happening around that topic in the US. In England, the Classification Research Group was very active, with folks like D.J. Foskett and Brian Vickery as mainstays of thinking about faceted classification. I wrote my first published article about a faceted classification being used by a UN agency.[1]

 In 1978 the same school had only a few traditional classes. I'd been out of the country, so the change to me was abrupt. Students learned to catalog on OCLC. (We had typed cards!) I was hired as a TA to teach people how to use DIALOG for article searching, even though I'd never seen it used, myself. (I'd already had a job as a computer programmer, so it was easy to learn the rules of DIALOG searching.) The school was now teaching "information science". Here's what that consisted of at the time: research into term frequency of texts; recall and precision; relevance ranking; database development.

I didn't appreciate it at the time, but the school had some of the bigger names in these areas, including William Cooper and M. E. "Bill" Maron. (I only just today discovered why he called himself Bill - the M. E., which is what he wrote under in academia, stands for "Melvin Earl". Even for a nerdy computer scientist, that was too much nerdity.) 1978 was still the early days of computing, at least unless you were on a military project grant or worked for the US Census Bureau. The University of California, Berkeley, did not have visible Internet access. Access to OCLC or DIALOG was via dial-up to their proprietary networks. (I hope someone has or will write that early history of the OCLC network. For its time it must have been amazing.)

The idea that one could search actual text was exciting, but how best to do it was (and still is, to a large extent) unclear. There was one paper, although I so far have not found it, that was about relevance ranking, and was filled with mathematical formulas for calculating relevance. I was determined to understand it, and so I spent countless hours on that paper with a cheat sheet beside me so I could remember what uppercase italic R was as opposed to lower case script r. I made it through the paper to the very end, where the last paragraph read (as I recall): "Of course, there is no way to obtain a value for R[elevance], so this theory cannot be tested." I could have strangled the author (one of my profs) with my bare hands.

Looking at the articles, now, though, I see that they were prescient; or at least that they were working on the beginnings of things we now take for granted. One statement by Maron especially strikes me today:
A second objective of this paper is to show that about is, in fact, not the central concept in a theory of document retrieval. A document retrieval system ought to provide a ranked output (in response to a search query) not according to the degree that they are about the topic sought by the inquiring patron, but rather according to the probability that they will satisfy that person‘s information need. This paper shows how aboutness is related to probability of satisfaction.[2] This is from 1977, and it essentially describes the basic theory behind Google ranking. It doesn't anticipate hyperlinking, of course, but it does anticipate that "about" is not the main measure of what will satisfy a searcher's need. Classification, in the traditional sense, is the quintessence of about. Is this the crux of the issue? As yet, I don't know. More to come.

[1]Coyle, Karen (1975). "A Faceted Classification for Occupational Safety and Health". Special Libraries. 66 (5-6): 256–9.
[2]Maron, M. E. (1977) "On Indexing, Retrieval, and the Meaning of About". Journal of the American Society for Information Science, January, 1977, pp. 38-43

DuraSpace News: VIVO Updates for August 14–VIVO 1.9 Cheat Sheet, Conference, Survey

Wed, 2016-08-17 00:00

From Mike Conlon, VIVO project director

Karen Coyle: This is what sexism looks like: Wikipedia

Tue, 2016-08-16 21:00
We've all heard that there are gender problems on Wikipedia. Honestly there are a lot of problems on Wikipedia, but gender disparity is one of them. Like other areas of online life, on Wikipedia there are thinly disguised and not-so thinly disguised attacks on women. I am at the moment the victim of one of those attacks.

Wikipedia runs on a set of policies that are used to help make decisions about content and to govern behavior. In a sense, this is already a very male approach, as we know from studies of boys and girls at play: boys like a sturdy set of rules, and will spend considerable time arguing whether or not rules are being followed; girls begin play without establishing a set of rules, develop agreed rules as play goes on if needed, but spend little time on discussion of rules.

If you've been on Wikipedia and have read discussions around various articles, you know that there are members of the community that like to "wiki-lawyer" - who will spend hours arguing whether something is or is not within the rules. Clearly, coming to a conclusion is not what matters; this is blunt force, nearly content-less arguing. It eats up hours of time, and yet that is how some folks choose to spend their time. There are huge screaming fights that have virtually no real meaning; it's a kind of fantasy sport.

Wiki-lawyering is frequently used to harass. It is currently going on to an amazing extent in harassment of me, although since I'm not participating, it's even emptier. The trigger was that I sent back for editing two articles about men that two wikipedians thought should not have been sent back. Given that I have reviewed nearly 4000 articles, sending back 75% of those for more work, these two are obviously not significant. What is significant, of course, is that a woman has looked at an article about a man and said: "this doesn't cut it". And that is the crux of the matter, although the only person to see that is me. It is all being discussed as violations of policy, although there are none. But sexism, as with racism, homophobia, transphobia, etc., is almost never direct (and even when it is, it is often denied). Regulating what bathrooms a person can use, or denying same sex couples marriage, is a kind of lawyering around what the real problem is. The haters don't say "I hate transexuals" they just try to make them as miserable as possible by denying them basic comforts. In the past, and even the present, no one said "I don't want to hire women because I consider them inferior" they said "I can't hire women because they just get pregnant and leave."

Because wiki-lawyering is allowed, this kind of harassment is allowed. It's now gone on for two days and the level of discourse has gotten increasingly hysterical. Other than one statement in which I said I would not engage because the issue is not policy but sexism (which no one can engage with), it has all been between the wiki-lawyers, who are working up to a lynch mob. This is gamer-gate, in action, on Wikipedia.

It's too bad. I had hopes for Wikipedia. I may have to leave. But that means one less woman editing, and we were starting to gain some ground.

The best read on this topic, mainly about how hard it is to get information that is threatening to men (aka about women) into Wikipedia: WP:THREATENING2MEN: Misogynist Infopolitics and the Hegemony of the Asshole Consensus on English Wikipedia
I have left Wikipedia, and I also had to delete my Twitter account because they started up there. I may not be very responsive on other media for a while. Thanks to everyone who has shown support, but if by any chance you come across a kinder, gentler planet available for habitation, do let me know. This one's desirability quotient is dropping fast.

SearchHub: Lessons from Sharding Solr at Etsy

Tue, 2016-08-16 20:57

As we countdown to the annual Lucene/Solr Revolution conference in Boston this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Gregg Donovan’s session, “Lessons from Sharding Solr at Etsy”.

Gregg covers the following lessons learned at Etsy while sharding Solr: How to enable SolrJ to handle distributed search fanout and merge; How to instrument Solr for distributed tracing so that distributed searches may be better understood, analyzed, and debugged; Strategies for managing latency in distributed search, including tolerating partial results and issuing backup requests in the presence of lagging shards.

Gregg Donovan is a Senior Software Engineer at in Brooklyn, NY, working on the Solr and Lucene infrastructure that powers more than 120 million queries per day. Gregg spoke at Lucene/Solr Revolution 2015 in Austin, Lucene Revolution 2011 in San Francisco, Lucene Revolution 2013 in San Diego, and previously worked with Solr and Lucene at

Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy from Lucidworks

Join us at Lucene/Solr Revolution 2016, the biggest open source conference dedicated to Apache Lucene/Solr on October 11-14, 2016 in Boston, Massachusetts. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Lessons from Sharding Solr at Etsy appeared first on

Mita Williams: The Hashtag Syllabus: Part Three

Tue, 2016-08-16 19:25

In The Future of the Library: From Electric Media to Digital Media
Robert K. Logan and Marshall McLuhan, you can find this passage from Chapter 9: The Compact Library and Human Scale:

As an undergraduate at the University of Cambridge, I (McLuhan) encountered a library in the English Department that had immense advantages. I never have seen one like it since. It consisted of no more than 1,500 or 2,000 books. These books, however, were chosen from many fields of history and aesthetics, philosophy, anthropology, mathematics, and the sciences in general. The one criterion, which determined the presence of any book in this collection, was the immediate and top relevance for twentieth-century awareness.  The shelf-browser could tell at a glance exactly which poets, novelists, critics, painters, and which of their individual writings were indispensable for knowing “where it’s at….”

… The library of which I spoke existed in a corner of the English Faculty Library at Cambridge, but it enabled hundreds of students to share all the relevant poets, painters, critics, musicians, and scientists of that time as a basis for an ongoing dialog. Would it not be possible to have similar libraries created by other departments in the university? Could not the History Department indicate those areas of anthropology and sociology that were indispensable to the most advanced historical studies of the hour? Could not the Department of Philosophy pool its awareness of many fields in order to create a composite image of all the relevant of many fields in order to create a composite image of all the relevant speculation and discovery of our time? Only now have I begun to realize that this unique library represented the meeting of both a written and oral tradition at an ancient university. It is this figure-ground pattern of the written and the oral that completes the meaning of the book and the library.

McLuhan isn’t the first scholar to recognize that there is something feels fundamentally different between a library collection of material selected by librarians and a working collection of material selected by practitioners. While the ideal academic library is close at hand and contains a vast amount of material relevant to one’s interests, the ideal working library is compact and at ‘human-scale.’

It is as if there are two kinds of power at hand.

From Karen Coyle’s FRBR Before and After‘s chapter The Model [pdf]

Patrick Wilson’s Two Kinds of Power, published in 1968, and introduced in chapter 1, is a book that is often mentioned in library literature but whose message does not seem to have disseminated through library and cataloging thinking. If it had, our catalogs today might have a very different character. A professor of Library Science at the University of California at Berkeley, Wilson’s background was in philosophy, and his book took a distinctly philosophical approach to the question he posed, which most likely limited its effect on the practical world of librarianship. Because he approached his argument from all points of view, argued for and against, and did not derive any conclusions that could be implemented, there would need to be a rather long road from Wilson’s philosophy to actual cataloging code.

Wilson takes up the question of the goals of what he calls “bibliography,” albeit applied to the bibliographical function of the library catalog. The message in the book, as I read it, is fairly straightforward once all of Wilson’s points and counterpoints are contemplated. He begins by stating something that seems obvious but is also generally missing from cataloging theory, which is that people read for a purpose, and that they come to the library looking for the best text (Wilson limits his argument to texts) for their purpose. This user need was not included in Cutter’s description of the catalog as an “efficient instrument.” By Wilson’s definition, Cutter (and the international principles that followed) dealt only with one catalog function: “bibliographic control.” Wilson suggests that in fact there are two such functions, which he calls “powers”: the first is the evaluatively neutral description of books, which was first defined by Cutter and is the role of descriptive cataloging, called “bibliographic control”; the second is the appraisal of texts, which facilitates the exploitation of the texts by the reader. This has traditionally been limited to the realm of scholarly bibliography or of “recommender” services.

This definition pits the library catalog against the tradition of bibliography, the latter being an analysis of the resources on a topic, organized in terms of the potential exploitation of the text: general works, foundational works, or works organized by school of thought. These address what he sees as the user’s goal, which is “the ability to make the best use of a body of writings.” The second power is, in Wilson’s view, the superior capability. He describes descriptive control somewhat sarcastically as “an ability to line up a population of writings in any arbitrary order, and make the population march to one’s command” (Wilson 1968)

Karen goes on to write…

If one accepts Wilson’s statement that users wish to find the text that best suits their need, it would be hard to argue that libraries should not be trying to present the best texts to users. This, however, goes counter to the stated goal of the library catalog as that of bibliographic control, and when the topic of “best” is broached, one finds an element of neutrality fundamentalism that pervades some library thinking. This is of course irreconcilable with the fact that some of these same institutions pride themselves on their “readers’ services” that help readers find exactly the right book for them. The popularity of the readers’ advisory books of Nancy Pearl and social networks like Goodreads, where users share their evaluations of texts, show that there is a great interest on the part of library users and other readers to be pointed to “good books.” How users or reference librarians are supposed to identify the right books for them in a catalog that treats all resources neutrally is not addressed by cataloging theory.

I’m going copy and past that last sentence again for re-emphasis:

How users or reference librarians are supposed to identify the right books for them in a catalog that treats all resources neutrally is not addressed by cataloging theory.

As you can probably tell from my more recent posts and from my recent more readings, I’ve been delving deeper into the relationship between libraries and readers. To explain why this is necessary, I’ll end with another quotation from McLuhan:

The content of a library, paradoxically is not its books but its users, as a recent study of the use of campus libraries by university faculty revealed. It was found that the dominant criterion for selection of a library was the geographical proximity of the library to the professor’s office. The depth of the collection in the researcher’s field was not as important a criterion as convenience (Dougherty & Blomquist, 1971, pp. 64-65). The researcher was able to convert the nearest library into a research facility that met his needs. In other words, the content of this conveniently located facility was its user. Any library can be converted from the facility it was designed to be, into the facility the user wishes it to become. A library designed for research can be used for entertainment, and vice-versa. As we move into greater use of electronic media, the user of the library will change even more. As the user changes, so will the library’s content or the use to which the content of the library will be subjected. In other words, as the ground in which the library exists changes, so will the figure of the library. The nineteenth-century notion of the library storing basically twentieth-century material will have to cope with the needs of twenty-first century users.

This is the third part series called The Hashtag Syllabus. Part One is a brief examination of the recent phenomenon of generating and capturing crowdsourced syllabi on Twitter and Part Two is a technical description of how to use Zotero to collect and re-use bibliographies online.

District Dispatch: Video series makes the case for libraries

Tue, 2016-08-16 18:23

U.S. libraries—120,000 strong—represent a robust national infrastructure for advancing economic and educational opportunity for all. From pre-K early learning to computer coding to advanced research, our nation’s libraries develop and deliver impactful programs and services that meet community needs and advance national policy goals.

This message is one that our Washington Office staff bring to federal policymakers and legislators every day, and we know it’s one that library directors and trustees also are hitting home in communities across the country. With Library Card Sign-up Month almost upon us, a new series of short videos (1-2 minutes) can help make the case for libraries, including one featuring school principal Gwen Abraham highlighting the important role of public libraries in supporting education. “Keep the innovation coming. Our kids benefit from it, this will affect their futures, and this is really what we need to make sure our kids are prepared with 21st century skills.”

As the nation considers our vision for the future this election year and begins to plot actionable steps to achieve that vision, we offer The E’s of Libraries® as part of the solution. Education, Employment, Entrepreneurship, Empowerment and Engagement are hallmarks of America’s libraries—but may not be as obvious to decision makers, influencers, and potential partners.

“Cleveland Public Library, like many of our colleagues, is using video more and more to share our services with more people in an increasingly visual world,” said Public Library Association (PLA) President Felton Thomas. “I know this is a catalog we need to build, and I hope these diverse videos will be used in our social media, public presentations and outreach to better reflect today’s library services and resources.”

For Employment: “The library was not a place I thought of right away, but it turned out to be the best place for my job search,” says Mike Munoz about how library programs helped him secure a job in a new city after only four months.

For Entrepreneurship: “”Before I walked into the public library, I knew nothing about 3D printing,” says brewery owner John Fuduric, who used library resources to print unique beer taps for his business. “The library is a great resource, but with the technology, the possibilities are endless.”

And Kristin Warzocha, CEO of the Cleveland Food Bank, speaks to the power of partnerships to address community needs: “Hunger is everywhere, and families across our country are struggling. Libraries are ideal partners because libraries are everywhere, too. Being able to partner with libraries…is a wonderful win-win situation for us.” In dozens of communities nationwide, libraries are partnering to address food security concerns for youth as part of summer learning programs. In Cleveland, this partnership has expanded to afterschool programming and even “checking out” groceries at the library.

All of the videos are freely available from the PLA YouTube page, and I’d love to hear how you’re using the videos—or even developing videos of your own. I’m at Thanks!

The post Video series makes the case for libraries appeared first on District Dispatch.

Jez Cope: What happened to the original Software Carpentry?

Tue, 2016-08-16 17:03

“Software Carpentry was originally a competition to design new software tools, not a training course. The fact that you didn’t know that tells you how well it worked.”

When I read this in a recent post on Greg Wilson’s blog, I took it as a challenge. I actually do remember the competition, although looking at the dates it was long over by the time I found it.

I believe it did have impact; in fact, I still occasionally use one of the tools it produced, so Greg’s comment got me thinking: what happened to the other competition entries?

Working out what happened will need a bit of digging, as most of the relevant information is now only available on the Internet Archive. It certainly seems that by November 2008 the domain name had been allowed to lapse and had been replaced with a holding page by the registrar.

There were four categories in the competition, each representing a category of tool that the organisers thought could be improved:

  • SC Build: a build tool to replace make
  • SC Conf: a configuration management tool to replace autoconf and automake
  • SC Track: a bug tracking tool
  • SC Test: an easy to use testing framework

I’m hoping to be able to show that this work had a lot more impact than Greg is admitting here. I’ll keep you posted on what I find!

Islandora: Islandora 7.x-1.8 Release Team: You're Up!

Tue, 2016-08-16 15:42

It's that time again. Islandora has a twice-yearly release schedule, shooting to get a new version out at the end of April and October. We are now looking for volunteers to join the team for the October release of Islandora 7.x-1.8, under the guidance of Release Manager Danny Lamb.

Given how fortunate we have been to have so many volunteers on our last few releases, we are changing things up a little bit to improve the experience, both through consolidating our documentation and by adding a few new roles to the release:

  • Communications Manager - Works with the Release Manager to announce release timeline milestones to the community. Reminds volunteers of upcoming deadlines and unfinished tasks. Reports to the Release Manager.
  • Testing Manager - Oversees testing of the release and reports back to the Release Manager. Advises Testers on how to complete their tasks. Monitors testing status and reminds Testers to complete their tasks on time. Helps the Release Manager to assign testing tickets to Testers during the release.
  • Documentation Manager - Oversees documenting the release and reports back to the Release Manager. Advises Documenters on how to complete their tasks. Monitors testing status and reminds Documenters to complete their tasks on time.
  • Auditing Manager - Oversees audit of the release and reports back to the Release Manager. Advises Auditors on how to complete their tasks. Monitors testing status and reminds Auditors to complete their tasks on time.

If you have been a Tester, Documenter, or Auditor for a previous Islandora Release, please consider taking on a little more responsibility and being a mentor to new volunteers by managing a role!

These are in addition to our existing release roles:

  • Component Manager - Component Managers take responsibility for a single module or collection of modules, reviewing open pull requests and referring the results to the Release Manager. Outside of a release cycle, Component Managers act as Maintainer on their modules until the next release. Components with no Component Manager will not be included in the release.
  • Tester - Installing and running the latest Islandora release candidate and testing for bugs. No programming experience required! We are looking for people with a general familiarity with Islandora to try out the latest releases and put them through their paces to look for bugs and make suggestions. Any JIRA tickets marked “Ready for Test” for a given component will also be assigned to the designated tester for a component, along with instructions on how to test.
  • Documenter - Checking modules readme files and updating the Islandora Documentation Wiki to reflect new releases.
  • Auditor -  Each release we audit our README and LICENSE files. Auditors will be responsible for auditing a given component by verifying that these document are current and fit into their proper templates.

All of these roles are outlined here and details on exactly how to Audit, Test, and Document an Islandora release are listed here.


Why join the 7.x-1.8 Release Team?
  • Give back to Islandora. This project survives because of our volunteers. If you've been using Islandora and want to contribute back to the project, being a part of a Release Team is one of the most helpful commitments you can make.
  • There's a commitment to fit your skills and time. Do you have a strong grasp of the inner workings of a module and want to make sure bugs, improvements, and features are properly managed in its newest version? Be a Component Manager. Do you work with a module a lot as an end user and think you can break it? Be a Tester! Do you want to learn more about a module and need an excuse to take a deep dive? Be a Documenter! Do you have a busy few months coming up and can't give a lot of time to the Islandora release?  Be an Auditor (small time commitment - big help!). You can take on a single module or sign up for several. 
  • Credit. Part of my job as inaugural Communication Manager is to create Release Team pages on our documentation so that future users can know who helped to make the release a reality.
  • T-Shirts. Each member of an Islandora Release Team gets a t-shirt unique to that release. They really are quite nifty:

Tenative schedule for the release:

  • Code Freeze: Tuesday, September 5, 2016
  • Release Candidate: Monday, September 19, 2016
  • Release: Monday October 31, 2016

SearchHub: Solr as SparkSQL DataSource, Part II

Tue, 2016-08-16 15:13
Solr as a SparkSQL DataSource Part II

Co-authored with Kiran Chitturi, Lucidworks Data Engineer

Last August, we introduced you to Lucidworks’ spark-solr open source project for integrating Apache Spark and Apache Solr, see: Part I. To recap, we introduced Solr as a SparkSQL Data Source and focused mainly on read / query operations. In this posting, we show how to write data to Solr using the SparkSQL DataFrame API and also dig a little deeper into the advanced features of the library, such as data locality and streaming expression support.

Writing Data to Solr

For this posting, we’re going to use the Movielens 100K dataset found at: After downloading the zip file, extract it locally and take note of the directory, such as /tmp/ml-100k.

Setup Solr and Spark

Download Solr 6.x (6.1 is the latest at this time) and extract the archive to a directory, referred to as $SOLR_INSTALL hereafter. Start it in cloud mode by doing:

cd $SOLR_INSTALL bin/solr -cloud

Create some collections to host our movielens data:

bin/solr create -c movielens_ratings bin/solr create -c movielens_movies bin/solr create -c movielens_users

Also, make sure you’ve installed Apache Spark 1.6.2; see Spark’s getting started instructions for more details. Spark Documentation

Load Data using spark-shell

Start the spark-shell with the spark-solr JAR added to the classpath:

cd $SPARK_HOME ./bin/spark-shell --packages "com.lucidworks.spark:spark-solr:2.1.0"

Let’s load the movielens data into Solr using SparkSQL’s built-in support for reading CSV files. We provide the bulk of the loading code you need below, but you’ll need to specify a few environmental specific variables first. Specifically, declare the path to the directory where you extracted the movielens data, such as:

val dataDir = "/tmp/ml-100k"

Also, verify the zkhost val is set to the correct value for your Solr server.

val zkhost = "localhost:9983"

Next, type :paste into the spark shell so that you can paste in the following block of Scala:

sqlContext.udf.register("toInt", (str: String) => str.toInt) var userDF ="com.databricks.spark.csv") .option("delimiter","|").option("header", "false").load(s"${dataDir}/u.user") userDF.registerTempTable("user") userDF = sqlContext.sql("select C0 as user_id,toInt(C1) as age,C2 as gender,C3 as occupation,C4 as zip_code from user") var writeToSolrOpts = Map("zkhost" -> zkhost, "collection" -> "movielens_users") userDF.write.format("solr").options(writeToSolrOpts).save var itemDF ="com.databricks.spark.csv") .option("delimiter","|").option("header", "false").load(s"${dataDir}/u.item") itemDF.registerTempTable("item") val selectMoviesSQL = """ | SELECT C0 as movie_id, C1 as title, C1 as title_txt_en, | C2 as release_date, C3 as video_release_date, C4 as imdb_url, | C5 as genre_unknown, C6 as genre_action, C7 as genre_adventure, | C8 as genre_animation, C9 as genre_children, C10 as genre_comedy, | C11 as genre_crime, C12 as genre_documentary, C13 as genre_drama, | C14 as genre_fantasy, C15 as genre_filmnoir, C16 as genre_horror, | C17 as genre_musical, C18 as genre_mystery, C19 as genre_romance, | C20 as genre_scifi, C21 as genre_thriller, C22 as genre_war, | C23 as genre_western | FROM item """.stripMargin itemDF = sqlContext.sql(selectMoviesSQL) itemDF.registerTempTable("item") val concatGenreListSQL = """ | SELECT *, | concat(genre_unknown,genre_action,genre_adventure,genre_animation, | genre_children,genre_comedy,genre_crime,genre_documentary, | genre_drama,genre_fantasy,genre_filmnoir,genre_horror, | genre_musical,genre_mystery,genre_romance,genre_scifi, | genre_thriller,genre_war,genre_western) as genre_list | FROM item """.stripMargin itemDF = sqlContext.sql(concatGenreListSQL) // build a multi-valued string field of genres for each movie sqlContext.udf.register("genres", (genres: String) => { var list = scala.collection.mutable.ListBuffer.empty[String] var arr = genres.toCharArray val g = List("unknown","action","adventure","animation","children", "comedy","crime","documentary","drama","fantasy", "filmnoir","horror","musical","mystery","romance", "scifi","thriller","war","western") for (i <- arr.indices) { if (arr(i) == '1') list += g(i) } list }) itemDF.registerTempTable("item") itemDF = sqlContext.sql("select *, genres(genre_list) as genre from item") itemDF = itemDF.drop("genre_list") writeToSolrOpts = Map("zkhost" -> zkhost, "collection" -> "movielens_movies") itemDF.write.format("solr").options(writeToSolrOpts).save sqlContext.udf.register("secs2ts", (secs: Long) => new java.sql.Timestamp(secs*1000)) var ratingDF ="com.databricks.spark.csv") .option("delimiter","\t").option("header", "false").load(s"${dataDir}/") ratingDF.registerTempTable("rating") ratingDF = sqlContext.sql("select C0 as user_id, C1 as movie_id, toInt(C2) as rating, secs2ts(C3) as rating_timestamp from rating") ratingDF.printSchema writeToSolrOpts = Map("zkhost" -> zkhost, "collection" -> "movielens_ratings") ratingDF.write.format("solr").options(writeToSolrOpts).save

Hit ctrl-d to execute the Scala code in the paste block. There are a couple of interesting aspects of this code to notice. First, I’m using SQL to select and name the fields I want to insert into Solr from the DataFrame created from the CSV files. Moreover, I can use common SQL functions, such as CONCAT, to perform basic transformations on the data before inserting into Solr. I also use some user-defined functions (UDF) to perform custom transformations, such as collapsing the genre fields into a multi-valued string field that is more appropriate for faceting using a UDF named “genres”. In a nutshell, you have the full power for Scala and SQL to prepare data for indexing.

Also notice that I’m saving the data into three separate collections and not de-normalizing all this data into a single collection on the Solr side, as is common practice when building a search index. With SparkSQL and streaming expressions in Solr, we can quickly join across multiple collections, so we don’t have to de-normalize to support analytical questions we want to answer with this data set. Of course, it may still make sense to de-normalize to support fast Top-N queries where you can’t afford to perform joins in real-time, but for this blog post, it’s not needed. The key take-away here is that you now have more flexibility in joining across collections in Solr, as well as joining with other data sources using SparkSQL.

Notice how we’re writing the resulting DataFrames to Solr using code such as:

var writeToSolrOpts = Map("zkhost" -> zkhost, "collection" -> "movielens_users") userDF.write.format("solr").options(writeToSolrOpts).save

Behind the scenes, the spark-solr project uses the schema of the source DataFrame to define fields in Solr using the Schema API. Of course, if you have special needs for specific fields (i.e. custom text analysis), then you’ll need to predefine them before using Spark to insert rows into Solr.

This also assumes that auto soft-commits are enabled for your Solr collections. If auto soft-commits are not enabled, you can do that using the Solr Config API or just include the soft_commit_secs option when writing to Solr, such as:

var writeToSolrOpts = Map("zkhost" -> zkhost, "collection" -> "movielens_users", "soft_commit_secs" -> "10")

One caveat is if the schema of the DataFrame you’re indexing is not correct, then the spark-solr code will create the field in Solr with incorrect field type. For instance, I didn’t convert the rating field into a numeric type on my first iteration, which resulted in it getting indexed as a string. As a result, I was not able to perform any aggregations on the Solr side, such as computing the average rating of action movies for female reviewers in Boston. After correcting the issue on the Spark side, the field was already incorrectly defined in Solr, so I had to use the Solr Schema API to drop and re-create the field definition with the correct data type. The key take-away here is that seemingly minor data type issues in the source data can lead to confusing issues when working with the data in Solr.

In this example, we’re using Spark’s CSV DataSource, but you can write any DataFrame to Solr. This means that you can read data from any SparkSQL DataSource, such as Cassandra or MongoDB, and write to Solr using the same approach as what is shown here. You can even use SparkSQL as a more performant replacement of Solr’s Data Import Handler (DIH) for indexing data from an RDBMS; we show an example of this in the Performance section below.

Ok, so now you have some data loaded into Solr and everything setup correctly to query from Spark. Now let’s dig into some of the additional features of the spark-solr library that we didn’t cover in the previous blog post.

Analyzing Solr Data with Spark

Before you can analyze data in Solr, you need to load it into Spark as a DataFrame, which was covered in the first blog post in this series. Run the following code in the spark-shell to read the movielens data from Solr:

var ratings ="solr").options(Map("zkhost" -> zkhost, "collection" -> "movielens_ratings")).load ratings.printSchema ratings.registerTempTable("ratings") var users ="solr").options(Map("zkhost" -> zkhost, "collection" -> "movielens_users")).load users.printSchema users.registerTempTable("users") sqlContext.cacheTable("users") var movies ="solr").options(Map("zkhost" -> zkhost, "collection" -> "movielens_movies")).load movies.printSchema movies.registerTempTable("movies") sqlContext.cacheTable("movies") Joining Solr Data with SQL

Here is an example query you can send to Solr from the spark-shell to explore the dataset:

sqlContext.sql(""" | SELECT u.gender as gender, COUNT(*) as num_ratings, avg(r.rating) as avg_rating | FROM ratings r, users u, movies m | WHERE m.movie_id = r.movie_id | AND r.user_id = u.user_id | AND m.genre='romance' AND u.age > 30 | GROUP BY gender | ORDER BY num_ratings desc """.stripMargin).show

NOTE: You may notice a slight delay in executing this query as Spark needs to distribute the spark-solr library to the executor process(es).

In this query, we’re joining data from three different Solr collections and performing an aggregation on the result. To be clear, we’re loading the rows of all three Solr collections into Spark and then relying on Spark to perform the join and aggregation on the raw rows.

Solr 6.x also has the ability to execute basic SQL. But at the time of this writing, it doesn’t support a broad enough feature set to be generally useful as an analytics tool. However, you should think of SparkSQL and Solr’s parallel SQL engine as complementary technologies in that it is usually more efficient to push aggregation requests down into the engine where the data lives, especially when the aggregation can be computed using Solr’s faceting engine. For instance, consider the following SQL query that performs a join on the results of a sub-query that returns aggregated rows.

SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON solr.movie_id = m.movie_id ORDER BY aggCount DESC

It turns out that the sub-query aliased here as “solr” can be evaluated on the Solr side using the facet engine, which as we all know is one of the most powerful and mature features in Solr. The sub-query:

SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating='[4 TO *]' GROUP BY movie_id ORDER BY aggCount desc LIMIT 10

Is effectively the same as doing:

/select?q=*:* &fq=rating_i:[4 TO *] &facet=true &facet.limit=10 &facet.mincount=1 &facet.field=movie_id

Consequently, what would be nice is if the spark-solr library could detect when aggregations can be pushed down into Solr to avoid loading the raw rows into Spark. Unfortunately, this functionality is not yet supported by Spark, see: SPARK-12449. As that feature set evolves in Spark, we’ll add it to spark-solr. However, we’re also investigating using some of Spark’s experimental APIs to weave push-down optimizations into the query planning process, so stay tuned for updates on this soon. In the meantime, you can perform this optimization in your client application by detecting when sub-queries can be pushed down into Solr’s parallel SQL engine and then re-writing your queries to use the results of the push-down operation. We’ll leave it as an exercise for the user for now and move on to using streaming expressions from Spark.

Streaming Expressions

Streaming expressions are one of the more exciting features in Solr 6.x. We’ll refer you to the Solr Reference Guide for details about streaming expressions, but let’s take a look at an example showing how to use streaming expressions with Spark:

val streamingExpr = """ parallel(movielens_ratings, hashJoin( search(movielens_ratings, q="*:*", fl="movie_id,user_id,rating", sort="movie_id asc", qt="/export", partitionKeys="movie_id"), hashed=search(movielens_movies, q="*:*", fl="movie_id,title", sort="movie_id asc", qt="/export", partitionKeys="movie_id"), on="movie_id" ), workers="1", sort="movie_id asc" ) """ var opts = Map( "zkhost" -> zkhost, "collection" -> "movielens_ratings", "expr" -> streamingExpr ) var ratings ="solr").options(opts).load ratings.printSchema

Notice that instead of just reading all rows from the movielens_ratings collection, we’re asking the spark-solr framework to execute a streaming expression and then expose the results as a DataFrame. Specifically in this case, we’re asking Solr to perform a hashJoin of the movies collection with the ratings collection to generate a new relation that includes movie_id, title, user_id, and rating. Recall that a DataFrame is an RDD[Row] and a schema. The spark-solr framework handles turning a streaming expression into a SparkSQL schema automatically. Here’s another example the uses Solr’s facet/stats engine to compute the average rating per genre:

val facetExpr = """ facet(movielens_movies, q="*:*", buckets="genre", bucketSorts="count(*) desc", bucketSizeLimit=100, count(*)) """ val opts = Map( "zkhost" -> zkhost, "collection" -> "movielens_movies", "expr" -> facetExpr ) var genres ="solr").options(opts).load genres.printSchema

Unlike the previous SQL example, the aggregation is pushed down into Solr’s aggregation engine and only a small set of aggregated rows are returned to Spark. Smaller RDDs can be cached and broadcast around the Spark cluster to perform in-memory computations, such as joining to a larger dataset.

There are a few caveats to be aware of when using streaming expressions and spark-solr. First, until Solr 6.2 is released, you cannot use the export handler to retrieve timestamp or boolean fields, see SOLR-9187. In addition, we don’t currently support the gatherNodes stream source as it’s unclear how to map the graph-oriented results into a DataFrame, but we’re always interested in use cases where gatherNodes might be useful.

So now you have the full power of Solr’s query, facet, and streaming expression engines available to Spark. Next, let’s look at one more cool feature that opens up analytics on your Solr data to any JDBC compliant BI / dashboard tool.

Accessing Solr from Spark’s distributed SQL Engine and JDBC

Spark provides a thrift-based distributed SQL engine (built on HiveServer2) to allow client applications to execute SQL against Spark using JDBC. Since the spark-solr framework exposes Solr as a SparkSQL data source, you can easily execute queries using JDBC against Solr. Of course we’re aware that Solr provides its own JDBC driver now, but it’s based on the Solr SQL implementation, which as we’ve discussed is still maturing and does not provide the data type and analytic support needed by most applications.

First, you’ll need to start the thrift server with the –jars option to add the spark-solr shaded JAR to the classpath. In addition, we recommend running the thrift server with the following configuration option to allow multiple JDBC connections (such as those served from a connection pool) to share cached data and temporary tables:

--conf spark.sql.hive.thriftServer.singleSession=true

For example, here’s how I started the thrift server on my Mac.

sbin/ --master local[4] \ --jars spark-solr/target/spark-solr-2.1.0-shaded.jar \ --executor-memory 2g --conf spark.sql.hive.thriftServer.singleSession=true \ --conf spark.driver.extraJavaOptions="-Dsolr.zkhost=localhost:2181/solr610"

Notice I’m also using the spark.driver.extraJavaOptions config property to set the zkhost as a Java system property for the thrift server. This alleviates the need for client applications to pass in the zkhost as part of the options when loading the Solr data source.

Use the following SQL command to initialize the Solr data source to query the movielens_ratings collection:

CREATE TEMPORARY TABLE ratings USING solr OPTIONS ( collection "movielens_ratings" )

Note that the required zkhost property will be resolved from the Java System property I set when starting the thrift server above. We feel this is a better design in that your client application only needs to know the JDBC URL and not the Solr ZooKeeper connection string. Now you have a temporary table backed by the movielens_ratings collection in Solr that you can execute SQL statements against using Spark’s JDBC driver. Here’s some Java code that uses the JDBC API to connect to Spark’s distributed SQL engine and execute the same query we ran above from the spark-shell:

import java.sql.*; public class SparkJdbc { public static void main(String[] args) throws Exception { String driverName = "org.apache.hive.jdbc.HiveDriver"; String jdbcUrl = "jdbc:hive2://localhost:10000/default"; String jdbcUser = "???"; String jdbcPass = "???"; Class.forName(driverName); Connection conn = DriverManager.getConnection(jdbcUrl, jdbcUser, jdbcPass); Statement stmt = null; ResultSet rs = null; try { stmt = conn.createStatement(); stmt.execute("CREATE TEMPORARY TABLE movies USING solr OPTIONS (collection \"movielens_movies\")"); stmt.execute("CREATE TEMPORARY TABLE users USING solr OPTIONS (collection \"movielens_users\")"); stmt.execute("CREATE TEMPORARY TABLE ratings USING solr OPTIONS (collection \"movielens_ratings\")"); rs = stmt.executeQuery("SELECT u.gender as gender, COUNT(*) as num_ratings, avg(r.rating) as avg_rating "+ "FROM ratings r, users u, movies m WHERE m.movie_id = r.movie_id AND r.user_id = u.user_id AND m.genre='romance' "+ " AND u.age > 30 GROUP BY gender ORDER BY num_ratings desc"); int rows = 0; while ( { ++rows; // TODO: do something with each row } } finally { if (rs != null) rs.close(); if (stmt != null) stmt.close(); if (conn != null) conn.close(); } } } Data Locality

If the Spark executor and Solr replica live on the same physical host, SolrRDD provides faster query execution time using the Data Locality feature. During the partition creation, SolrRDD provides the placement preference option of running on the same node where the replica exists. This saves the overhead of sending the data across the network between different nodes.


Before we wrap up this blog post, we wanted to share our results from running a performance experiment to see how well this solution scales. Specifically, we wanted to measure the time taken to index data from Spark to Solr and also the time taken to query Solr from Spark using the NYC green taxi trip dataset between 2013-2015. The data is loaded onto an Postgres RDS instance in AWS. We used the Solr scale toolkit (solr-scale-tk) to deploy a 3-node Lucidworks Fusion cluster, which includes Apache Spark and Solr. More details are available at

  • 3 EC2 nodes of r3.2xlarge instances running Amazon Linux and deployed in the same placement group
  • Solr nodes and Spark worker processes are co-located together on the same host
  • Solr collection ‘nyc-taxi’ created with 6 shards (no replication)
  • Total number of rows ‘91748362’ in the database
Writing to Solr

The documents are loaded from the RDS instance and indexed to Solr using the spark-shell script. 91.49M rows are indexed to Solr in 49 minutes.

  • Docs per second: 31.1K
  • JDBC batch size: 5000
  • Solr indexing batch size: 50000
  • Partitions: 200
Reading from Solr

The full collection dump from Solr to Spark is performed in two ways. To be able to test the streaming expressions, we chose a simple query that only uses fields with docValues. The result set includes all the documents present in the ‘nyc-taxi’ collection (91.49M)

Deep paging with split queries using Cursor Marks
  • Docs per second (per task): 6350
  • Total time taken: 20 minutes
  • Partitions: 120
Streaming using the export handler
  • Docs per second (per task): 108.9k
  • Total time taken: 2.3 minutes
  • Partitions: 6

Full data dumps from Spark to Solr using the JDBC datasource is faster than the traditional DIH approach. Streaming using the export handler is ~10 times faster than the traditional deep paging. Using DocValues gives us this performance benefit.

We hope this post gave you some insights into using Apache Spark and Apache Solr for analytics. There are a number of other interesting features of the library that we could not cover in this post, so we encourage you to explore the code base in more detail: Also, if you’re interested in learning more about how to use Spark and Solr to address your big data needs, please attend Timothy Potter’s talk at this year’s Lucene Revolution: Your Big Data Stack is Too Big.

The post Solr as SparkSQL DataSource, Part II appeared first on