You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 1 day 6 hours ago

Terry Reese: MarcEdit Mac Update — OCLC Integrations

Tue, 2016-09-06 13:04

Happy belated Labor Day (to those in the US) and happy Tuesday to everyone else.  On my off day Monday, I did some walking (with the new puppy), a little bit of yard work, and some coding.  Over the past week, I’ve been putting the plumbing into the Mac version of MarcEdit to allow the tool to support the OCLC integrations — and today, I’m putting out the first version.

Couple of notes — the implementation is a little quirky at this point.  The process works much like the Windows/Linux version — once you get your API information from OCLC, you enter the info and validate it in the MarcEdit preferences.  Once validated, the OCLC options will become enabled and you can use the tools.  At this point, the integration work is limited to the OCLC records downloader and holdings processing tools.  Screenshots are below.

Now, the quirks – when you validate, the Get Codes button (to get the location codes), isn’t working for some reason.  I’ll correct this in the next couple days.  The second quirk — after you validate your keys, you’ll need to close MarcEdit and open it again to enable the OCLC menu options.  The menu validation isn’t refreshing after close — again, I’ll fix this in the next couple of days.  However, I wanted to get this out to folks.

Longer term, right now the direct integration between worldcat and the MarcEditor isn’t implemented.  I have most of the code done, but not completed.  Again, over this week, I hope to have time to get this done.

You can get the Mac update from:




SearchHub: Data Science for Dummies

Tue, 2016-09-06 12:04

Don’t get me wrong, Data Science is really cool – trust me. I say this (tongue in cheek of course) because most of the information that you see online about Data Science is (go figure) written by Data “Scientists”. And you can usually tell – they give you verbiage and mathematical formulas that only they understand so you have to trust them when they say things like – “it can easily be proven that …” – that a) it has been proven and b) that it is easy – for them, not for you. That’s not an explanation, it’s a cop out. But to quote Bill Murray’s line from Ghostbusters: “Back off man, I’m a Scientist”.

Science is complex and is difficult to write about in a way that is easy for non-scientists to understand. I have a lot of experience here. As a former scientist, I would have difficulty at parties with my wife’s friends because when they asked me what I do, I couldn’t really tell them the gory details without boring them to death (or so I thought) so I would give them a very high level version. My wife used to get angry with me for condescending to her friends when they wanted more details and I would demur – so OK its hard and I punted – I told her that I was just trying to be nice – not curmudgeonly. But if we are smart enough to conceive of these techniques, we should be smart enough to explain them to dummies. Because if you can explain stuff to dummies, then you really understand it too.

Lucy – you’ve got lots of ‘splainin’ to do

Don’t you love it when an article starts out with seemingly english-like explanations (the promise) and quickly degenerates into data science mumbo jumbo. Your first clue that you are in for a rough ride is when you hit something like this – with lots of Greek letters and other funny looking symbols:

and paragraphs like this:

“In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.” (from Wikipedia article on the “Principle of Maximum Entropy“)

Huh? Phrases like “epistemic modesty” and “stated prior data” are not what I would call “ordinary language”. I’ll take a shot at putting this in plain english later when I discuss Information Theory. “Maximum Ignorance” is a good theme for this blog though – just kidding. But to be fair to the Wikipedia author(s) of the above, this article is really pretty good and if you can get through it – with maybe just a Tylenol or two or a beverage – you will have a good grasp of this important concept. Information Theory is really cool and is important for lots of things besides text analysis. NASA uses it to clean up satellite images for example. Compression algorithms use it too. It is also important in Neuroscience – our sensory systems adhere to it big time by just encoding things that change – our brains fill in the rest – i.e. what can be predicted or inferred from the incoming data. More on this later.

After hitting stuff like above – since the authors assume that you are still with them – the rest of the article usually leaves you thinking more about pain killers than the subject matter that you are trying to understand. There is a gap between what they know and what you do and the article has not helped much because it is a big Catch-22 – you have to understand the math and their jargon before you can understand the article – and if you already had that understanding, you wouldn’t need to read the article. Jargon and complex formulas are appropriate when they are talking to their peers because communication is more efficient if you can assume prior subject mastery and you want to move the needle forward. When you try that with dummies you crash and burn. (OK, not dummies per se, just ignorant if that helps you feel better. As my friend and colleague Erick Erickson would say, we are all ignorant about something – and you guessed it, he’s a curmudgeon too.) The best way to read these articles is to ignore the math that you would need a refresher course in advanced college math to understand (assuming that you ever took one) and just trust that they know what they’re talking about. The problem is that this is really hard for programmers to do because they don’t like to feel like dummies (even though they’re not).

It reminds me of a joke that my dad used to tell about Professor Norbert Weiner of MIT. My dad went to MIT during WWII (he was 4-F because of a chemistry experiment gone wrong when he was 12 that took his left hand) – and during that time, many of the junior faculty had been drafted into the armed forces to work on things like radar, nuclear bombs and proximity fuses for anti-aircraft munitions. Weiner was a famous mathematician noted for his work in Cybernetics and non-linear mathematics. He was also the quintessential absent minded professor. The story goes that he was recruited (most likely ordered by his department chairman) to teach a freshman math course. One day, a student raised his hand and said “Professor Weiner, I was having trouble with problem 4 on last nights assignment. Could you go over problem 4 for us?” Weiner says, “Sure, can somebody give me the text book?” So he reads the problem, thinks about it for a minute or two, turns around and writes a number on the board and says, “Is that the right answer?” The student replies, “Yes, Professor that is what is in the answer section in the book, but how did you get it? How did you solve the problem?” Weiner replies, “Oh sorry, right.” He erases the number, looks at the book again, thinks for another minute or so, turns back to the board and writes the same number on the board again. “See”, he says triumphantly, “I did it a different way.”

The frustration that the students must have felt at that point is the same frustration that non-data scientists feel when encountering one of these “explanations”. So lets do some ‘splainin’. I promise not to use jargon or complex formulas – just english.

OK dummies, so what is Data Science?

Generally speaking, data science is deriving some kind of meaning or insight from large amounts data. Data can be textual, numerical, spatial, temporal or some combination of these. Two branches of mathematics that are used to do this magic are Probability Theory and Linear Algebra. Probability is about the chance or likelihood that some observation (like a word in a document) is related to some prediction such as a document topic or a product recommendation based on prior observations of the huge numbers of user choices. Linear algebra is the field of mathematics that deals with systems of linear equations (remember in algebra class the problem of solving multiple equations with multiple unknowns? Yeah – much more of that). Linear algebra deals with things called Vectors and Matrices. A vector is a list of numbers. A vector in 2 dimensions is a point on a plane – a pair of numbers – it has an X and a Y coordinate and can be characterized by a length and angle from the origin – if that is too much math so far, then you really are an ignorant dummy. A matrix is a set of vectors – for example a simultaneous equation problem can be expressed in matrix form by lining up the coefficients – math-eze for constant numbers so if one of the equations is 3x – 2y + z = 5, the coefficients are 3, -2, and 1 – which becomes a row in the coefficient matrix, or 3, -2, 1 and 5 for a row in the augmented or solution matrix. Each column in the matrix of equations is a vector consisting of the sets of x, y and z coefficients.

Linear algebra typically deals with many dimensions that are difficult or impossible to visualize, but the cool thing is that the techniques that have been worked out can be used on any size vector or matrix. It gets much worse than this of course, but the bottom line is that it can deal with very large matrices and has techniques that can reduce these sets to equivalent forms that work no matter how many dimensions that you throw at it – allowing you to solve systems of equations with thousands or millions of variables (or more). The types of data sets that we are dealing with are in this class. Pencil and paper won’t cut it here or even a single computer so this is where distributed or parallel analytic frameworks like Hadoop and Spark come in.

Mathematics only deals with numbers so if a problem can be expressed numerically, you can apply these powerful techniques. This means that the same methods can be used to solve seemingly disparate problems like semantic topic mapping and sentiment analysis of documents to recommendations of music recordings or movies based on similarity to tunes or flicks that the user likes – as exemplified by sites like Pandora and Netflix.

So for the text analytics problem, the first head scratcher is how to translate text into numbers. This one is pretty simple – just count words in documents and determine how often a given word occurs in each document. This is known as term frequency or TF. The number of documents in a group that contain the word is its document frequency or DF. The ratio of these two TF/DF or term frequency multiplied by inverse document frequency (1/DF) is a standard number known affectionately to search wonks as TF-IDF (yeah, jargon has a way of just weaseling its way in don’t it?) I mention this because even dummies coming to this particular blog have probably heard this one before – Solr used to use it as its default way of calculating relevance. So now you know what TF-IDF means, but why is it used for relevance? (Note that as of Solr 6.0 TF-IDF is no longer the default relevance algorithm, it has been replaced by a more sophisticated method called BM25 which stands for “Best Match 25” a name that gives absolutely no information at all – it still uses TF-IDF but adds some additional smarts.) The key is to understand why TF-IDF was used, so I’ll try to ‘splain that.

Some terms have more meaning or give you more information about what a document is about than others – in other words, they are better predictors of what the topic is. If a term is important to a topic, it stands to reason that it will be used more than once in documents about that topic, maybe dozens of times (high TF) but it won’t be often used in documents that are not about the subject or topic to which it is important – in other words, it’s a keyword for that subject. Because it is relatively rare in the overall set of documents it will have low DF so the inverse 1/DF or IDF will be high. Multiplying these two (and maybe taking the logarithm just for kicks ‘cause that’s what mathematicians like to do) – will yield a high value for TF-IDF. There are other words that will have high TF too, but these tend to be common words that will also have high DF (hence low IDF). Very common words are manually pushed out of the way by making them stop words. So the relevance formula tends to favor more important subject words over common or noise words. A classic wheat vs chaff problem.

So getting back to our data science problem, the reasoning above is the core concept that these methods use to associate topics or subjects to documents. What these methods try to do is to use probability theory and pattern detection using linear algebraic methods to ferret out the salient words in documents that can be used to best predict their subject areas. Once we know what these keywords are, we can use them to detect or predict the subject areas of new documents (our test set). In order to keep up with changing lingo or jargon, this process needs to be repeated from time to time.

There are two main ways that this is done. The first way called “supervised learning” uses statistical correlation – a subject matter expert (SME) selects some documents called “training documents” and labels them with a subject. The software then learns to associate or correlate the terms in the training set with the topic that the human expert has assigned to them. Other topic training sets are also in the mix here so we can differentiate their keywords. The second way is called “unsupervised learning” because there are no training sets. The software must do all the work. These methods are also called “clustering” methods because they attempt to find similarities between documents and then label them based on the shared verbiage that caused them to cluster together.

In both cases, the documents are first turned into a set of vectors or a matrix in which each word in the entire document set is replaced by its term frequency in each document, or zero if the document does not have the word. Each document is a row in the matrix and each column has the TF’s for all of the documents for a given word, or to be more precise, token (because some terms like BM25 are not words). Now we have numbers that we can apply linear algebraic techniques to. In supervised learning, the math is designed to find the terms that have a higher probability of being found in the training set docs for that topic than in other training sets. Two probability theorems that come into play are Bayes’ Theorem which deals with conditional probabilities and Information Theory. A conditional probability is like the probability that you are a moron if you text while driving (pretty high it turns out – and would be a good source of Darwin awards except for the innocent people that also suffer from this lunacy.) In our case, the game is to compute the conditional probabilities of finding a given word or token in a document and the probability that the document is about the topic we are interested in – so we are looking for terms with a high conditional probability for a given topic – aka the key terms.

Bayes’ Theorem states that the probability of event A given that event B has occurred – p(A|B) – is equal to the probability that B has occurred given A – p(B|A) – times the overall probability that A can happen – p(A) – divided by the overall probability that B can happen – p(B). So if A is the probability that a document is about some topic and B is the probability that a term occurs in the document, terms that are common in documents about topic A but rare otherwise will be good predictors. So if we have a random sampling of documents that SMEs have classified (our training set), the probability of topic A is the fraction of documents classified as A. p(B|A) is the frequency of a term in these documents and p(B) is the term’s frequency in the entire training set. Keywords tend to occur together so even if any one keyword does not occur in all documents classified the same way (maybe because of synonyms), it may be that documents about topic A are the only ones (or nearly) that contain two or three of these keywords. So, the more keywords that we can identify, the better our coverage of that topic will be and the better chance we have of correctly classifying all of them (what we call recall). If this explanation is not enough, check out this really good article on Bayes’ Theorem – its also written for dummies.

Information Theory looks at it a little differently. It says that the more rare an event is, the more information it conveys – keywords have higher information than stop words in our case. But other words are rare too – esoteric words that authors like to use to impress you with their command of the vocabulary but are not subject keywords. So there are noise words that are rare and thus harder to filter out. But noise words of any kind tend to have what Information Theorists call high Entropy – they go one way and then the other whereas keywords have low entropy – they consistently point in the same direction – i.e., give the same information. You may remember Entropy from chemistry or physics class – or maybe not if you were an Arts major who was in college to “find” yourself. In Physics, entropy is the measure of disorder or randomness – the Second Law of Thermodynamics states that all systems tend to a maximum state of disorder – like my daughter’s room. So noise is randomness, both in messy rooms and in text analytics. Noise words can occur in any document regardless of context, and in both cases, they make stuff harder to find (like a matching pair of socks in my daughter’s case). Another source of noise are what linguists call polysemous words – words with multiple meanings in different contexts – like ‘apple’ – is it a fruit or a computer? (That’s an example of the author showing off that I was talking about earlier – ‘polysemous’ has way high IDF, so you probably don’t need to add it to your every day vocabulary. Just use it when you want people to think that you are smart.) Polysemous words also have higher entropy because they simultaneously belong to multiple topics and therefore convey less information about each one – i.e., they are ambiguous.

The more documents that you have to work with, the better chance you have to detect the words with the most information and the least entropy. Mathematicians like Claude Shannon who invented Information Theory have worked out precise mathematical methods to calculate these values. But there is always uncertainty left after doing this, so the Principle of Maximum Entropy says that you should choose a classification model that constrains the predictions by what you know, but hedges its bets so to speak to account for the remaining uncertainties. The more you know – i.e. the more evidence that you have, the maximum entropy that you need to allow to be fair to your “prior stated data” will go down and the model will become more accurate. In other words, the “maximum ignorance” that you should admit to decreases (of course, you don’t have to admit it – especially if you are a politician – you just should). Ok, these are topics for advanced dummies – this is a beginners class. The key here is that complex math has awesome power – you don’t need to fully understand it to appreciate what it can do – just hire Data Scientists to do the heavy lifting – hopefully ones that also minored in English Lit so you can talk to them – (yes, a not so subtle plug for a well rounded Liberal Arts Education!). And on your end of the communication channel as Shannon would say – knowing a little math and science can’t hurt.

In unsupervised learning, since both the topics and keywords are not known up front, they have to be guessed at, so we need some kind of random or ‘stochastic’ process to choose them (like Dirichlet Allocation for example or a Markov Chain). After each random choice or “delta”, we see if it gives us a better result – a subset of words that produce good clustering – fro is small sets of documents that are similar to each other but noticeably different from other sets. Similarity between the documents is determined by some distance measure. Distance in this case is like the distance between points in the 2-D example I gave above but this time in a high-dimensional space of the matrix – but it’s the same math and is known as Euclidian distance. Another similarity measure just focuses on the angles between the vectors and is known as Cosine Similarity. Using linear algebra, when we have a vector of any dimension we can compute distances or angles in what we call a “hyper-dimensional vector space” (cool huh? – just take away ‘dimensional’ and ‘vector’ and you get ‘hyperspace’. Now all we need is Warp Drive). So what we need to do is to separate out the words that maximize the distance or angle between clusters (the keywords) from the words that tend to pull the clusters closer together or obfuscate the angles (the noise words).

The next problem is that since we are guessing, we will get some answers that are better than others because as we take words in and out of our guessed keywords list, the results get better and then get worse again so we go back to the better alternative. (This is what the Newton method does to find things like square roots). The rub here is that if we had kept going, the answer might have gotten better again so what we stumbled on is what is known as a local maximum. So we have to shake it up and keep going till we can find the maximum maximum out there – the global maximum. So we do this process a number of times until we can’t get any better results (the answers “converge”), then we stop. As you can imagine all of this takes a lot of number crunching which is why we need powerful distributed processing frameworks such as Spark or Hadoop to get it down to HTT – human tolerable time.

The end results are pretty good – often exceeding 95% accuracy as judged by human experts (but can we assume that the experts are 100% accurate – what about human error? Sorry about that, I’m a curmudgeon you know – computers as of yet have no sense of humor so you can’t tease them very effectively.) But because of this pesky noise problem, machine learning algorithms sometimes make silly mistakes – so we can bet that the Data Scientists will keep their jobs for awhile. And seriously, I hope that this attempt at ‘splainin’ what they do entices you dummies to hire more of them. Its really good stuff – trust me.

The post Data Science for Dummies appeared first on

Cynthia Ng: Presentation: Making Web Services Accessible for Everyone

Tue, 2016-09-06 04:25
This morning , I did another presentation for the Florida Libraries Webinars group. The first time was focused on web content, while this time was focused on overall design and structure. To be honest, there isn’t really anything new in this presentation. It’s very much a compilation of my LibTechConf presentation, Access presentation, and a … Continue reading Presentation: Making Web Services Accessible for Everyone

John Miedema: The beginning is near …

Mon, 2016-09-05 16:06

After Reading will begin soon.

District Dispatch: ALA makes recommendations to FCC on digital inclusion plan

Mon, 2016-09-05 07:37

FCC Commissioners

As part of its modernization of the Lifeline program in March, the Federal Communications Commission (FCC) charged its Consumer and Government Affairs Bureau (CGB) with developing a digital inclusion plan that addresses broadband adoption issues. Yesterday the ALA filed a letter with the Commission with recommendations for the plan.

ALA called on the Commission to address non-price barriers to broadband adoption by:

  • Using its bully pulpit to increase public awareness about the need for and economic value of broadband adoption; highlight effective adoption efforts; and recognize and promote digital literacy providers like libraries to funders and state and local government authorities that can help sustain and grow efforts by these providers.
  • Expanding consumer information, outreach and education that support broadband adoption—both through the FCC’s own website and materials and by effectively leveraging aligned government (e.g., the National Telecommunications and Information Administration’s Broadband Adoption Toolkit) and trusted noncommercial resources (e.g., EveryoneOn,, Digital IQ).
  • Encouraging and guiding Eligible Telecommunications Carriers (ETCs) in the Lifeline program to support broadband adoption efforts through libraries, schools and other trusted local entities.
    Building and strengthening collaborations with other federal agencies, including the Institute of Museum and Library Services, the National Telecommunications and Information Administration, and the Department of Education.
  • Convening diverse stakeholders (non-profit, private and government agencies, representatives from underserved communities, ISPs, and funders) to review and activate the Commission’s digital inclusion plan.
  • Regional field hearings also should be held to extend conversation and connect digital inclusion partners beyond the Beltway. There should be mechanisms for public comment and refinement of the plan (e.g., public notice or notice of inquiry).
  • Improving data gathering and research to better understand gaps and measure progress over time.
  • Exploring how the Universal Service Fund and/or merger obligations can be leveraged to address non-price barriers to broadband adoption. Sustainable funding to support and expand broadband adoption efforts and digital literacy training is a challenge, particularly in light of the need for one-on-one help and customized learning in many cases.

The letter builds on past conversations with CGB staff at the 2016 ALA Annual Conference and the last Schools, Health and Libraries Broadband (SHLB) Coalition Conference. The CGB is due to submit the plan to the Commission before the end of the year.

The post ALA makes recommendations to FCC on digital inclusion plan appeared first on District Dispatch.

Mark Matienzo: A Push-to-Talk Conference Call Foot Pedal

Sun, 2016-09-04 18:16

My current position at DPLA, especially since we are remote-first organization, requires me to be on lots of conference calls, both video and audio. While I’ve learned the value of staying muted while I’m not talking, there are a couple of things that make this challenging. First, I usually need the window for the call to have focus to unmute myself by the platform’s designated keystroke. Forget that working well if you need to bring something up in another window, or switch to another application. Secondly, while we have our own preferred platform internally (Google Hangouts), I have to use countless others, too; each of those platforms has its own separate keystroke to mute.

This all leads to a less than ideal situation, and naturally, I figured there must be a better way. I knew that some folks have used inexpensive USB footpedals for things like Teamspeak, but that ran into the issue where a keystroke would only be bound to a specific application. Nonetheless, I went ahead and bought a cheap PCSensor footswitch sold under another label from an online retailer. The PCSensor footswitches are programmable, but the software that ships with them is Windows-only. However, I also found a command-line tool for programming the switches.

After doing some digging, I came across an application for Mac OS X called Shush, which provides both push-to-talk and push-to-silence modes, which are activated by a keystroke. Once installed, I bound Shush to the Fn keystroke, which would allow me to activate push-to-talk even if I didn’t have the pedal plugged in. However, I couldn’t get the pedal to send the Fn keystroke alone since it’s a modifier key. As a workaround, I put together a device-specific configuration for Karabiner, a flexible and powerful tool to configure input devices for Mac OS. By default, the pedal sends the keycode for b, and the configuration rebinds b for an input device matching the USB vendor and product IDs for the pedal to Fn.

Since I’ve bought and set up my first pedal, I’ve gotten used to using the pedal to quickly mute and unmute myself, making my participation in conference calls become much more smooth than it was previously. I’ve also just replaced my first pedal which broke suddenly with a nearly identical one, but I might make the switch to a more durable version. My Karabiner configuration is available as a gist for your use - I hope this helps you as much as it helped me!

Equinox Software: Then you win

Fri, 2016-09-02 18:44

Here we are, ten years after the world started using Evergreen, twelve years after the first line of code was written.  Gandhi may not have coined the phrase, but whoever did was right: first they ignore you, then they laugh at you, then they fight you, then you win.

Evergreen won, in many senses, a long time ago.  If nothing else, these recent reflections on the last ten years have reminded us that winning takes many forms.  Growing from a single-site effort into a self-sustaining project with a positive, inviting community is certainly one type of win.  Outlasting the doubt, the growing pains, the FUD, which Evergreen has, is another.  Then, when you get past all that, most of your detractors become your imitators.  It’s interesting how that happens.  And, we can call that a win as well.

There has been, in just the last couple years, much talk in the ILS industry of openness.  Essentially every ILS available in the last two decades has been built on the shoulders of Open Source projects to a greater or lesser degree: the Apache web server; the Perl and PHP languages; the Postgres and MySQL databases.  The truth is that while most always have been, it’s only recently become fashionable to admit the fact.  Then, of course, there are the claims of “open APIs” that aren’t.  Perhaps most interesting, though not at all surprising, is the recent claim by a proprietary vendor that it was taking “the unprecedented step” of trying to “make open source a viable alternative for libraries.”  Some seem to have missed the last ten or fifteen years of technology in libraries.

And so here we are, ten years later.  Evergreen is in use by some 1,500 libraries, serving millions of patrons.  Other ILS’s, new and old, want to do what Evergreen has managed to do.  We say more power to them.  We hope they succeed.  The Evergreen community, and our colleagues here at Equinox, have helped pave the way for libraries to define their own future.  That was the promise of Evergreen twelve years ago, and that was the goal of the first, small team on Labor Day weekend ten years ago.  We’ve won, and delivered on that promise.

Now it’s time to see what the next ten years will bring.  We’re ready, are you?

— Mike Rylander and Grace Dunbar

This is the twelfth and final post in our series leading up to Evergreen’s Tenth birthday.  Thank you to all those who have been following along!

SearchHub: Solr at Scale for Time-Oriented Data

Fri, 2016-09-02 17:58

As we countdown to the annual Lucene/Solr Revolution conference in Boston this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Brett Hoerner’s talk, “Solr at Scale for Time-Oriented Data”.

This talk will go over tricks used to index, prune, and search over large (>10 billion docs) collections of time-oriented data, and how to migrate collections when inevitable changes are required.

Brett Hoerner lives in Austin, TX and is an Infrastructure Engineer at Rocana where they are helping clients control their global-scale modern infrastructure using big data and machine learning techniques. He began using SolrCloud in 2012 to index the Twitter firehose at Spredfast, where the collection eventually grew to contain over 150 billion documents. He is primarily interested in the performance and operation of distributed systems at scale.

Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana from Lucidworks

Join us at Lucene/Solr Revolution 2016, the biggest open source conference dedicated to Apache Lucene/Solr on October 11-14, 2016 in Boston, Massachusetts. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Solr at Scale for Time-Oriented Data appeared first on

Dan Scott: Google Scholar's broken Recaptcha hurts libraries and their users

Fri, 2016-09-02 17:35

For libraries, proxying user requests is how we provide authenticated access--and some level of anonymized access--to almost all of our licensed resources. Proxying Google Scholar in the past would direct traffic through a campus IP address, which prompted Scholar to automatically include links to the licensed content that we had told it about. It seemed like a win-win situation: we would drive traffic en masse to Google Scholar, while anonymizing our user's individual queries, and enabling them swift access to our library's licensed content as well as all the open access content that Google knows about.

However, in the past few months things changed. Now when Google Scholar detects proxied access it tries to throw up a Recaptcha test--which would be an okay-ish speed bump, except it uses a key for a domain ( of course) which doesn't match the proxied domain and thus dies with a JS exception, preventing any access. That doesn't help our users at all, and it hurts Google too because those users don't get to search and generate anonymized academic search data for them.

Folks on the EZProxy mailing list have tried a few different recipes to try to evade the Recaptcha but that seems doomed to failure.

If we don't proxy these requests, then every user would need to set their preferred library(via the Library Links setting) to include convenient access to all of our licensed content. But that setting can be hard to find, and relies on cookies, so behaviour can be inconsistent as they move from browser to browser (as happens in universities with computer labs and loaner laptops). And then the whole privacy thing is lost.

On the bright side, I think a link like makes it a tiny bit easier to help users set their preferred library in the unproxied world. So we can include that in our documentation about Google Scholar and get our users a little closer to off-campus functionality.

But I really wish that Google would either fix their Recaptcha API key domain-based authentication so it could handle proxied requests, or recognize that the proxy is part of the same set of campus IP addresses that we've identified as having access to our licensed resources in Library Links and just turn off the Recaptcha altogether.

Open Knowledge Foundation: What skills do you need to become a data-driven storyteller? Join a week-long data journalism training #ddjcamp

Fri, 2016-09-02 12:05

This post is written by Anastasia Valeeva, data journalist and member of Open Knowledge Russia, one of the local groups in the Open Knowledge Network.

European Youth Press is organising a week-long intensive training on data journalism funded by Erasmus+. It is aimed at young journalists, developers and human rights activists from 11 countries: Czech Republic, Germany, Belgium, Italy, Sweden, Armenia, Ukraine, Montenegro, Slovakia, Denmark or Latvia.

Together we will explore how to create a data-driven story step-by-step.

If you have always wanted to learn more about what it means to be a data-driven storyteller, then this is an opportunity not to miss! Our course was designed with wanna-be data journalists in mind and for people who have been following others’ work in this area but are looking to learn more about actually making a story themselves.

Modern journalism now requires considerable cross-border communication and collaboration. Examples of TheMigrantsFiles and Panama papers are impressive; but how do you become part of such teams?

In this course, we are combining best practices of learning and doing.

You will have classes and workshops along the data pipeline: where to get the data, what to do to make it ‘clean’, and how to find a story in the data. In parallel to the training, you will work in teams and produce a real story that will be published in the national media of one of the participating countries.

The general topic of all the stories produced has been chosen as migration/refugees. Data journalism has a reputation to be a more objective kind of journalism, opposed to ‘he said – she said’ narratives. However, there is still great potential to explore data-driven stories about migrants and the effects of migration around the world.

Praising the refugee hunters as national heroes; violence targeting international journalists and migrants; sentimental pleas with a short-time effect – those are few examples of media coverage of the refugee crisis. The backlash so far to these narratives has mostly been further distrust in the media. What are the ways out of it?

We want to produce more data-driven balanced stories on migrants. For this training, we are inviting prominent researchers and experts in the field of migration. They will help us with relevant datasets and knowledge. We will not fix the world, but we can make a little change together.

So, if you are between 18 and 30 years old and come from Czech Republic, Germany, Belgium, Italy, Sweden, Armenia, Ukraine, Montenegro, Slovakia, Denmark or Latvia, don’t wait – apply now (deadline is 11 Sept):

David Rosenthal: CrossRef on "fake DOIs"

Thu, 2016-09-01 21:33
I should have pointed to Geoff Bilder's post to the CrossRef blog, DOI-like strings and fake DOIs when it appeared at the end of June. It responds to the phenomena described in Eric Hellman's Wiley's Fake Journal of Constructive Metaphysics and the War on Automated Downloading, to which I linked in the comments on Improving e-Journal Ingest (among other things). Details below the fold.

Geoff distinguishes between "DOI-like strings" and "fake DOIs", presenting three ways DOI-like strings have been (ab)used:
  • As internal identifiers. Many publishing platforms use the DOI they're eventually going to register as their internal identifier for the article. Typically it appears in the URL at which it is eventually published. The problem is that: the unregistered DOI-like strings for unpublished (e.g. under review or rejected manuscripts) content ‘escape’ into the public as well. People attempting to use these DOI-like strings get understandably confused and angry when they don’t resolve or otherwise work as DOIs.Platforms should use internal IDs that can't be mistaken for external IDs, because they can't guarantee that the internal ones won't leak.
  • As spider- or crawler-traps. This is the usage that Eric Hellman identified. Strings that look like DOIs but are not even intended to eventually be registered but which have bad effects when resolved: When a spider/bot trap includes a DOI-like string, then we have seen some particularly pernicious problems as they can trip-up legitimate tools and activities as well. For example, a bibliographic management browser plugin might automatically extract DOIs and retrieve metadata on pages visited by a researcher. If the plugin were to pick up one of these spider traps DOI-like strings, it might inadvertently trigger the researcher being blocked- or worse- the researcher’s entire university being blocked. In the past, this has even been a problem for Crossref itself. We periodically run tools to test DOI resolution and to ensure that our members are properly displaying DOIs, CrossMarks, and metadata as per their member obligations. We’ve occasionally been blocked when we ran across the spider traps as well. Sites using these kinds of crawler traps should expect a lot of annoyed customers whose legitimate operations caused them to be blocked.
  • As proxy bait. These unregistered DOI-like strings can be fed to system such as Sci-Hub in an attempt to detect proxies. If they are generated afresh on each attempt, the attacker knows that Sci-Hub does not have the content. So it will try to fetch it using a proxy or other technique. The fetch request will be routed via the proxy to the publisher, who will recognize the DOI-like string, know where the proxy is located and can take action, such as blocking the institution: In theory this technique never exposes the DOI-like strings to the public and automated tools should not be able to stumble upon them. However, recently one of our members had some of these DOI-like strings “escape” into the public and at least one of them was indexed by Google. The problem was compounded because people clicking on these DOI-like strings sometimes ended having their university’s IP address banned from the member’s web site. ... We think this just underscores how hard it is to ensure DOI-like strings remain private and why we recommend our members not use them. As we see every day, designing computer systems that in the real world never leak information is way beyond the state of the art.
And Bilder explains why what CrossRef means by "fake DOIs" isn't what the general public means by "fake DOIs", how to spot them, and why they are harmless.
The following is what we have sometimes called a “fake DOI”


It is registered with Crossref, resolves to a fake article in a fake journal called The Journal of Psychoceramics (the study of Cracked Pots) run by a fictitious author (Josiah Carberry) who has a fake ORCID ( but who is affiliated with a real university (Brown University).

Again, you can try it.

And you can even look up metadata for it.

Our dirty little secret is that this “fake DOI” was registered and is controlled by Crossref.These "starting with 5" DOIs are used by Crossref to test their systems. They too can confuse legitimate software, but the bad effects of the confusion are limited. And now that the secret is out, legitimate software can know to ignore them, and thus avoid the confusion.

FOSS4Lib Upcoming Events: Islandoracon 2017

Thu, 2016-09-01 20:49
Date: Sunday, May 14, 2017 - 08:45 to Friday, May 19, 2017 - 16:45Supports: Islandora

Last updated September 1, 2016. Created by Peter Murray on September 1, 2016.
Log in to edit this page.

The Islandora Foundation is thrilled to announce the second Islandoracon, to be held at the lovely LIUNA Station in Hamilton, Ontario from May 15 - 19, 2017. The conference schedule will take place over five days, including a day of post-conference sessions and a full-day Hackfest.

FOSS4Lib Recent Releases: Fedora Repository - 4.6.0

Thu, 2016-09-01 20:38

Last updated September 1, 2016. Created by Peter Murray on September 1, 2016.
Log in to edit this page.

Package: Fedora RepositoryRelease Date: Wednesday, August 31, 2016

Open Knowledge Foundation: Freedom to control MyData: Access to personal data as a step towards solving wider social issues.

Thu, 2016-09-01 17:30

This piece is part of a series of posts from MyData 2016 – an international conference that focuses on human centric personal information management. The conference is co-hosted by the Open Knowledge Finland chapter of the Open Knowledge International Network.

Song lyrics: Pharrell Williams “Freedom”; Image Pixabay CC0


Indeed, the theme of MyData so far is freedom. The freedom to own our data. Freedom however, is a very complicated subject that has been subjected to so many arguments, interpretations and even wars. I will avoid the more complicated philosophy and dive instead into a more daily life example. In the pop song quoted above, freedom can be understood as being carefree – “Who cares what they see and what they know?” Taking it to the MyData context, are we granting freedom to others to do whatever they want with our data and information because we trust them, or just because we don’t care?

MyData speakers have looked at the issue of freedom from a different angle. Taavi Kotka, Estonia CIO, claims that the fifth freedom of the EU should be the freedom of Data. People, explains Kotka, should have the choice of what can be done with their data. They should know and understand the possibilities that sharing the data can bring (for example, like better and easier services across the EU countries), and the threat that this can entail, like misuses of their data. For that we need pioneer regulators. For that we need the private sector and civil society to pressure and showcase what we can do with data and shift change accordingly.

…thinking outside of the box can help governments to move forward and at the end of the day, to supply better services for citizens

This shifting in regulations and thinking should also be accepted by government. It was refreshing to hear the Finnish Minister of Transport and Communications, Anne Berner saying that government should not be afraid of disruption, but accept disruption and be disruptive themselves. MyData is disruptive in the sense that it is challenging the norms of the current data storage and use, and thinking outside of the box can help governments to move forward and at the end of the day, to supply better services for citizens.

Another topic that has been raised up repeatedly is the digital self and the idea that data is a stepping stone to a better society. The question is then, that in order to build a good society do we need to understand our private data? Maybe understanding data is not a good enough end goal? Maybe a better framing would be to create information and knowledge from the data? I was excited to see a project that can help consumers to evaluate and decide who to trust: Ranking Digital Rights. Ranking Digital Rights looks at big tech corporations and ranks their public commitments and disclosed policies affecting users’ freedom of expression and privacy. This is a very good tool for discussion and advocacy on these topics.

Ranking Digital Rights looks at big tech corporations and ranks their public commitments and disclosed policies affecting users’ freedom of expression and privacy.

To return to the question of open. Does freedom of data mean open data? The closed system does not allow us to access our own data. We can’t get insights. How do we create different models to get there?

And I think this is where I enjoy this conference the most – the variety of people. In the last two years I have been in many open data conferences, but the business community side of these events has been very limited, or at least for me, not appealing. Here at MyData, there are tracks for many different stakeholders – from insurance firms to banks, from health to education. I have met people who see the MyData initiative not only as a moral thing to do, but also as an opportunity to innovate and create trust with users. Trust, as I am rediscovering, is key for growth. Ignoring the mistrust of users can lead to a broken market. More than trust, I was happy to see people who are trying to influence their companies not only to go the MyData way, but also to open relevant data from their companies to the public, so we can work on and maybe solve social issues. Seeing the two go hand-in-hand is great, and I am looking forward to more conversations like these.

Tomorrow, Rufus Pollock, our president and Open Knowledge International’s Co-Founder is going to speak about how we can collaborate with others for a better future. You can catch him at 9.30 Helsinki time on Here is a preview for his talk tomorrow: We want openness for public data – public information that could be made available to anyone. And we want access for every person to their own personal data…both are about empowering people to access information.

-Rufus Pollock

Terry Reese: MarcEdit Update Notes

Thu, 2016-09-01 13:46

Posted Sept. 1, this update resolves a couple issues.  Particularly:


* Bug Fix: Custom Field Sorting: Fields without the sort field may drop the LDR.  This has been corrected.
* Bug Fix: OCLC Integration: regression introduced with the engine changes when dealing with diacritics.  This has been corrected.
* Bug Fix: MSI Installer: AUTOUPDATE switch wasn’t being respected.  This has been corrected.
* Enhancement: MARCEngine: Tweaked the transformation code to provide better support for older processing statements.


* Bug Fix: Custom Field Sorting: Fields without the sort field may drop the LDR.  This has been corrected.
* Bug Fix: MARCEngine: Regression introduced with the last update that caused one of the streaming functions to lose encoding information.  This has been corrected.
* Enhancement: MARCEngine: Tweaked the transformation code to provide better support for older processing statements.

Special Notes:

I’ll be adding a knowledge-base article, but I updated the windows MSI to fix the admin command-line added to allow administrators to turn off the auto-update feature.  Here’s an example of how this works: >>MarcEdit_Setup64.msi /qn AUTOUPDATE=no

I don’t believe the AUTOUPDATE key is case sensitive – but the documented use pattern is upper-case and what I’ll test against going forward.

Downloads are available via the downloads page:


Equinox Software: Evergreen 2016: Symbiosis

Thu, 2016-09-01 13:18

Photo by Erica Rohlfs

When I hear the term “Evergreen,” it immediately invokes images of nature’s symbiotic relationships – Bald eagles nesting in coniferous trees, lady slipper orchids thriving in soil nutrients typically found beneath conifers and hemlocks, pollinators and mammals relying on evergreens for food and, in return, help to redistribute seeds. There is also a complex network of dialogues being exchanged throughout these evergreen forests.

During the past decade, I have been very blessed to hold multiple discussions with people about Evergreen, and it’s not surprising that the continued theme from my fellow coworkers’ blog posts is the emphasis on community. Community grants opportunities and a feeling of personal ownership (how awesome is it that non-proprietary software helps to promote a sense of ownership). Community also helps to foster symbiotic and sustainable relationships. Relationships that are rooted in dialog.

In February 2007, as a reference and genealogy librarian at a rural public library, I held my first conversations with both librarians and patrons about their Evergreen user experiences. Fast forwarding to August 2016, I still treasure every conversation that I have with librarians about their needs, expectations, and experiences. With each library migration, it is with honor and humbleness to hear about the librarians’ current workflows and needs. These user needs are constantly being met with each passing version of Evergreen.

For some, those needs may appear simple. I was so excited by the Update Expire Date button! Or, more complex, like the intricate gears that make meta-record level holds possible. One of the strongest examples of community dialog and symbiosis is the continued refinement of the Acquisitions module.

I couldn’t possible describe all of the awesomeness that I have observed over the past 10 years or single it down to a special moment; there’s just too much. Each patron, library staff member, consortia member, volunteer, contributor, developer, support, data analyst (did I forget anyone?) contributes to Evergreen’s complex web of communication and overall sustainability. I can say that I know how fortunate I am, as a Project Manager, to see the forest for the trees and to know that the Evergreen Community’s roots are growing stronger with each passing year.

This is the eleventh in our series of posts leading up to Evergreen’s Tenth birthday.

Dan Scott: PHP's File_MARC gets a new release (1.1.3)

Thu, 2016-09-01 04:00

Yesterday, just one day before the anniversary of the 1.1.2 release, I published the 1.1.3 release of the PEAR File_MARC library. The only change is the addition of a convenience method for fields called getContents() that simply concatenates all of the subfields together in order, with an optional separator string. Many thanks to Carsten Klee for contributing the code!

You can install File_MARC through the usual channels: PEAR or composer. Have fun!