You are here

Feed aggregator

Terry Reese: MarcEdit 7 is Here!

planet code4lib - Tue, 2017-11-28 09:00

After 9 months of development, hundreds of thousands of lines of changed code, 3 months of beta testing over which time, tens of millions of records were processed using MarcEdit 7, the tool is finally ready. Will you occasionally run into issues…possibly – any time that this much code has changed, I’d say that there is a distinct possibility. But I believe (hope) that the program has been extensively vetted and is ready to move into production. So, what’s changed? A lot. Here’s a short list of the highlights:

  • Native Clustering – MarcEdit implements the Levenshtein Distance and Composite Coefficient matching equations to provide built-in clustering functionality. This will let you group fields and perform batch edits across like items. In many ways, it’s a lite-weight implementation of OpenRefine’s clustering functionality designed specifically for MARC data. Already, I’ve used this tool to provide clustering of data sets over 700,000 records. For performance sake, I believe 1 million to 1.5 million records could be processed with acceptable performance using this method.
  • Smart XML Profiling – A new XML/JSON profiler has been added to MarcEdit that removes the need to know XSLT, XQuery or any other Xlanguage. The tool uses an internal markup language that you create through a GUI based mapper that looks and functions like the Delimited Text Translator. The tool was designed to lower barriers and make data transformations more accessible to users.
  • Speaking of accessibility, I spent over 3 months researching fonts, sizes, and color options – leading to the development of a new UI engine. This enabled the creation of themes (and theme creator), identification of free fonts (and a way to download them directly and embed fonts for use directly in MarcEdit within the need of administrator rights), and a wide range of other accessibility and keyboard options.
  • New versions – MarcEdit is now available as 4 downloads. Two which require administrative access and two that can be installed by anyone. This should greatly simplify management of the application.
  • Tasks have been super charged. Tasks that in MarcEdit 6.x could take close to 8 hours now can process in under 10-20 minutes. New task functions have been added, tasks have been extended, and more functions can be added to tasks.
  • Link data tools have been expanded. From the new SPARQL tools, to the updated linked data platform, the resource has been updated to support better and faster linked data work. Coming in the near future will be direct support for HDT and linked data fragments.
  • A new installation wizard was implemented to make installation fun and easier. User follow Hazel, the setup agent, as she guides you through the setup process.
  • Languages – MarcEdit’s interface has been translated into 26+ languages
  • .NET Language update – this seems like a small thing, but it enabled many of the design changes
  • MarcEdit 7 *no* longer supports Windows XP
  • Consolidated and improved Z39.50/SRU Client
  • Enhanced COM support, with legacy COM namespaces preserved for backward compatibility
  • RDA Refinements
  • Improved Error Handling and expanded knowledge-base
  • The new Search box feature to help users find help

With these new updates, I’ve updated the MarcEdit Website and am in the process of bringing new documentation online. Presently, the biggest changes to the website can be seen on the downloads page. Rather than offering users four downloads, the webpage provides a guided user experience. Go to the downloads page, and you will find:

If you want to download the 64-bit version, when the user clicks on the link, the following modal window is presented:

Hopefully this will help users, because I think that for the lion’s share of MarcEdit’s user community, the non-Administrator download is the version that most users should use. This version simplifies program management, sandboxes the application, and can be managed by any user. But the goal of this new downloads page is to make the process of selecting your version of MarcEdit easier to understand and empower users to make the best decision for their needs.

Additionally, as part of the update process, I needed to update the MarcEdit MSI Cleaner. This file was updated to support MarcEdit 7’s GUID keys created on installation. And finally, the program was developed so that it could be installed and used side by side with MarcEdit 6.x. The hope is that users will be able to move to MarcEdit 7 as their schedules allow, while still keeping MarcEdit 6.x until they are comfortable with the process and able to uninstall the application.

Lastly, this update is seeing the largest single creation of new documentation in the application’s history. This will start showing up throughout the week and I continue to wrap up documentation and add new information about the program. This update has been a long-time coming, and I will be posting a number of tid-bits throughout the week as I complete updating the documentation. My hope is that the wait will have been worth it, and that users will find the new version, it’s new features, and the improved performance useful within their workflows.

The new version of MarcEdit can be downloaded from:

As always, if you have questions, let me know.


Terry Reese: MarcEdit MSI Cleaner Changes for MarcEdit 7

planet code4lib - Tue, 2017-11-28 05:35

The MarcEdit MSI cleaner was created to help fix problems that would occasionally happen when using the Windows Installer. Sometimes, problems happen, and when they do, it becomes impossible to install or update MarcEdit. MarcEdit 7, from a programming perspective, is much easier to manage (I’ve removed all data from the GAC (global assembly cache) and limited data outside of the user data space), but things could occur that might cause the program to be unable to be updated. When that happens, this tool can be used to remove the registry keys that are preventing the program from updating/reinstalling.

In working on the update for this tool, there were a couple significant changes made:

  1. I removed the requirement that you had to be an administrator in order to run the tool. You will need to be an administrator to make changes, but I’ve enabled the tool so users can now run the application to see if the cleaner would likely solve their problem.
  2. Updated UI – I updated the UI so that you will know that this tool has been changed to support MarcEdit 7.
  3. I’ve signed the application…it has now been signed with a security certificate and now is identified as a trusted program.

I’ve posted information about the update here:

If you have questions, let me know.


Open Knowledge Foundation: Paradise Lost: a data-driven report into who should be on the EU’s tax haven blacklist

planet code4lib - Mon, 2017-11-27 23:15

Open Knowledge International coordinates the Open Data for Tax Justice project with the Tax Justice Network, working to create a global network of people and organisations using open data to improve advocacy, journalism and public policy around tax justice. Today, in partnership with the Tax Justice Network, we are publishing Paradise lost, a data-driven investigation looking at who should feature on the forthcoming European Union blacklist of non-cooperative tax jurisdictions.

This blogpost has been reposted from the Tax Justice Network site.

UK financial services could face heavy penalties after Brexit unless the UK turns back on its aggressive tax haven policies, a new study from the Tax Justice Network has found.

The study looks at the new proposals for a EU blacklist of tax havens. The EU has said that it will impose sanctions on countries which find themselves on the list.

The EU is due to publish its list of tax havens on 5 December, but analysis by the Tax Justice Network, using the EU’s own criteria, suggests that the UK and 5 other EU countries would find themselves on that list if the criteria were applied equally to them.

The EU has stated that the list will only apply to non-EU member states and with the UK set to leave the Union in 2019, this could mean sanctions being applied to the UK unless it changes course and ends policies which make the country a tax haven. This includes the UK’s controlled foreign companies rules which allow UK based multinationals to hoard foreign profits in zero-tax countries. The UK’s CFC rules are currently the subject of a ‘state aid’ investigation by the European Commission.

The study also finds that up to 60 jurisdictions could be listed under the EU’s blacklist procedure, if the criteria are strictly applied. But there are fears that intense lobbying by the tax avoidance industry and some countries could see the list severely watered down. In addition to the UK, researchers have identified Ireland, The Netherlands, Cyprus, Luxembourg, Malta as countries that should be included in the blacklist if the EU were to apply its own criteria to member states.

Alex Cobham, chief executive of the Tax Justice Network, said: “The EU list is a step forward from previous efforts, in that the criteria are at least partially transparent and so partially objectively verifiable. It’s clear that the UK is at real risk of counter-measures from the EU. The government should forget any Brexit tax haven threat, which could easily open up EU counter-measures like the loss of financial passporting rights, and instead address the flaws in current policies.

“But it’s also true that the EU list lacks legitimacy as well as transparency, because it will not apply to its own members. For global progress on a truly level playing field, we need the G77 and G20 to engage in negotiating an international convention that would set an ambitious and fair set of minimum standards, with powerful sanctions, to which all jurisdictions would be held equally to account.”

Wouter Lips of Ghent University, who led the study, said: “The EU blacklist represents an important step forward from empty OECD lists – but the criteria need to be completely transparent to be objectively verifiable by independent researchers. Otherwise the risk remains that political power will be a factor in whether countries like the UK or indeed the USA are listed. Many of the 60 countries that are listed are lower-income countries which had no say in the OECD standards upon which much of the EU’s criteria depend, and that raises further questions of fairness.”

Please email if you’d like to be added to the project mailing list or want to join the Open Data for Tax Justice network. You can also follow the #OD4TJ hashtag on Twitter for updates.

Lucidworks: Caching, and Filters, and Post-Filters, Oh My!

planet code4lib - Mon, 2017-11-27 17:10

A while back, I joined the #solr IRC channel in the middle of conversation about Solr’s queryResultCache & filterCache. The first message I saw was…

< victori:#solr> anyway, are filter queries applied independently on the full dataset or one after another on a shrinking resultset?

As with many things in life, the answer is “It Depends”

In particular, the answer to this question largely depends on:

… but further nuances come into play depending on:

  • The effective cost param specified on each fq (defaults to ‘0’ for most queries)
  • The type of the underlying Query object created for each fq: Do any implement the PostFilter API?

As I explained some of these nuances on IRC (and what they change about the behavior) I realized 2 things:

  • This would make a really great blog post!
  • I wish there was a way to demonstrate how all this happens, rather then just describe it.

That led me to wonder if it would be possible to create a “Tracing Wrapper Query Parser” people could use to get get log messages showing when exactly a given Query (or more specifically the “Scorer” for a that Query) was asked to evaluated each document. With something like this, people could experiment (on small datasets) with different q & fq params and different cache and cost local params and see how the execution changes. I made a brief attempt at building this kind of QParser wrapper, and quickly got bogged down in lots of headaches with how complex the general purpose Query, Weight, and Scorer APIs can be at the lower level.

On the other hand: the ValueSource (aka Function) API is much simpler, and easily facilitates composing functions around other functions. Solr also already makes it easy to use any ValueSource as a Query via the {!frange} QParser — which just so happens to also support the PostFilter API!

A few hours later, the “TraceValueSource” and trace() function syntax were born, and now I can use it to walk you through the various nuances of how Solr executes different Queries & Filters.


In this article, we’re going to assume that the underlying logic Lucene uses to execute a simple Query is essentially: Loop over all docIds in the index (starting at 0) testing each one against the Query, if a document matches record it’s score and continue with the next docId in the index.

Likewise we’re going to assume that when Lucene is computing the Conjunction (X ^ Y) of two Queries, the logic is essentially:

  • Loop over all docIds in the index (starting at 0) testing each one against X until we find a matching document
  • if that document also matches Y then record it’s score and continue with the next docId
  • If the document does not match Y, swap X & Y, and start the process over with the next docId

These are both extreme over simplifications of how most queries are actually executed — many Term & Points based queries are much more optimized to “skip ahead” in the list of documents based on the term/points metadata — but it is a “close enough” approximation to what happens when all Queries are ValueSource based for our purposes today.

{!frange} Queries and the trace() Function

Let’s start with a really simple warm up to introduce you to the {!frange} QParser and the trace() function I added, beginning with some trivial sample data…

$ bin/solr -e schemaless -noprompt ... curl -H 'Content-Type: application/json' 'http://localhost:8983/solr/gettingstarted/update?commit=true' --data-binary ' [{"id": "A", "foo_i": 42, "bar_i": 99}, {"id": "B", "foo_i": -42, "bar_i": 75}, {"id": "C", "foo_i": -7, "bar_i": 1000}, {"id": "D", "foo_i": 7, "bar_i": 50}]' ... tail -f example/schemaless/logs/solr.log ...

For most of this blog I’ll be executing queries against these 4 documents, while showing you:

  • The full request URL
  • Key url-decoded request params in the request for easier reading
  • All log messages written to solr.log as a result of the request

The {!frange} parser allows user to specify an arbitrary function (aka: ValueSource) that will be wrapped up into a query that will match documents if and only if the results of that function fall in a specified range. For example: With the 4 sample documents we’ve indexed above, the query below does not match document ‘A’ or ‘C’ because the sum of the foo_i + bar_i fields (42 + 100 = 142 and -7 + 1000 = 993 respectively) does not fall in between the lower & upper range limits of the query (0 <= sum(foo_i,bar_i) <= 100) …

http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}sum%28foo_i,bar_i%29 // q = {!frange l=0 u=100}sum(foo_i,bar_i) { "response":{"numFound":2,"start":0,"docs":[ { "id":"B"}, { "id":"D"}] }} INFO - 2017-11-14 20:27:06.897; [ x:gettingstarted] org.apache.solr.core.SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}sum(foo_i,bar_i)&omitHeader=true&fl=id} hits=2 status=0 QTime=29

Under the covers, the Scorer for the FunctionRangeQuery produced by this parser loops over each document in the index and asks the ValueSource if it “exists” for that document (ie: do the underlying fields exist) and if so then it asks for the computed value for that document.

Generally speaking, the trace() function we’re going to use, implements the ValueSource API in such a way that any time it’s asked for the “value” of a document, it delegates to another ValueSource, and logs a message about the input (document id) and the result — along with a configurable label.

If we change the function used in our previous query to be trace(simple_sum,sum(foo_i,bar_i)) and re-execute it, we can see the individual methods called on the “sum” ValueSource in this process (along with the internal id + uniqueKey of the document, and the “simple_sum” label we’ve chosen) and the result of the wrapped function …

http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) TraceValueSource$TracerValues; simple_sum: exists(#0: "A") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#0: "A") -> 141.0 TraceValueSource$TracerValues; simple_sum: exists(#1: "B") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#1: "B") -> 33.0 TraceValueSource$TracerValues; simple_sum: exists(#2: "C") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#2: "C") -> 993.0 TraceValueSource$TracerValues; simple_sum: exists(#3: "D") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#3: "D") -> 57.0 SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id} hits=2 status=0 QTime=6

Because we’re using the _default Solr configs, this query has now been cached in the queryResultCache. If we re-execute it no new “tracing” information will be logged, because Solr doesn’t need to evaluate the ValueSource against each of the documents in the index in order to respond to the request…

http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id} hits=2 status=0 QTime=0 Normal fq Processing

Now let’s use multiple {!frange} & trace() combinations to look at what happens when we have some filter queries in our request…

http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29&fq={!frange%20l=0}trace%28pos_foo,foo_i%29&fq={!frange%20u=90}trace%28low_bar,bar_i%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) // fq = {!frange l=0}trace(pos_foo,foo_i) // fq = {!frange u=90}trace(low_bar,bar_i) TraceValueSource$TracerValues; pos_foo: exists(#0: "A") -> true TraceValueSource$TracerValues; pos_foo: floatVal(#0: "A") -> 42.0 TraceValueSource$TracerValues; pos_foo: exists(#1: "B") -> true TraceValueSource$TracerValues; pos_foo: floatVal(#1: "B") -> -42.0 TraceValueSource$TracerValues; pos_foo: exists(#2: "C") -> true TraceValueSource$TracerValues; pos_foo: floatVal(#2: "C") -> -7.0 TraceValueSource$TracerValues; pos_foo: exists(#3: "D") -> true TraceValueSource$TracerValues; pos_foo: floatVal(#3: "D") -> 7.0 TraceValueSource$TracerValues; low_bar: exists(#0: "A") -> true TraceValueSource$TracerValues; low_bar: floatVal(#0: "A") -> 99.0 TraceValueSource$TracerValues; low_bar: exists(#1: "B") -> true TraceValueSource$TracerValues; low_bar: floatVal(#1: "B") -> 75.0 TraceValueSource$TracerValues; low_bar: exists(#2: "C") -> true TraceValueSource$TracerValues; low_bar: floatVal(#2: "C") -> 1000.0 TraceValueSource$TracerValues; low_bar: exists(#3: "D") -> true TraceValueSource$TracerValues; low_bar: floatVal(#3: "D") -> 50.0 TraceValueSource$TracerValues; simple_sum: exists(#3: "D") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#3: "D") -> 57.0 SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id&fq={!frange+l%3D0}trace(pos_foo,foo_i)&fq={!frange+u%3D90}trace(low_bar,bar_i)} hits=1 status=0 QTime=23

There’s a lot of information here to consider, so let’s break it down and discuss in the order of the log messages…

  • In order to cache the individual fq queries for maximum possible re-use, Solr executes each fq query independently against the entire index:
    • First the “pos_foo” function is run against all 4 documents to identify if 0 <= foo_i
      • this resulting DocSet is put into the filterCache for this fq
    • then the “low_bar” function is run against all 4 documents to see if bar_i <= 90
      • this resulting DocSet is put into the filterCache for this fq
  • Now the main query (simple_sum) is now ready to be run:
    • Instead of executing the main query against all documents in the index, it only needs to be run against the intersection of the DocSets from each of the individual (cached) filters
    • Since document ‘A’ did not match the “low_bar” fq, the “simple_sum” function is never asked to evaluated it as a possible match for the overall request
    • Likewise: since ‘B’ did not match the “pos_foo” fq, it is also never considered.
    • Likewise: since ‘C’ did not match the “low_bar” fq, it is also never considered.
    • Only document “D” matched both fq filters, so it is checked against the main query — and it is a match, so we have hits=1

In future requests, even if the main q param changes and may potentially match a different set of values/documents, the cached filter queries can still be re-used to limit the set of documents the main query has to check — as we can see in this next request using the same fq params…

http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20u=999}trace%28max_foo,foo_i%29&fq={!frange%20l=0}trace%28pos_foo,foo_i%29&fq={!frange%20u=90}trace%28low_bar,bar_i%29 // q = {!frange u=999}trace(max_foo,foo_i) // fq = {!frange l=0}trace(pos_foo,foo_i) // fq = {!frange u=90}trace(low_bar,bar_i) TraceValueSource$TracerValues; max_foo: exists(#3: "D") -> true TraceValueSource$TracerValues; max_foo: floatVal(#3: "D") -> 7.0 SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+u%3D999}trace(max_foo,foo_i)&omitHeader=true&fl=id&fq={!frange+l%3D0}trace(pos_foo,foo_i)&fq={!frange+u%3D90}trace(low_bar,bar_i)} hits=1 status=0 QTime=1 Non-cached Filters

Now let’s consider what happens if we add 2 optional local params to our filter queries:

  • cache=false – Tells Solr that we don’t need/want this filter to be cached independently for re-use.
    • This will allow Solr to evaluate these filters at the same time it’s processing the main query
  • cost=X – Specifies an integer “hint” to Solr regarding how expensive it is to execute this filter.
    • Solr provides special treatment to some types of filters when 100 <= cost (more on this later)
    • By default Solr assumes most filters have a default of cost=0 (but beginning with Solr 7.2, {!frange} queries default to cost=100)
    • For this examples, we’ll explicitly specify a cost on each fq such that: 0 < cost < 100.
http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29&fq={!frange%20cache=false%20cost=50%20l=0}trace%28pos_foo_nocache_50,foo_i%29&fq={!frange%20cache=false%20cost=25%20u=100}trace%28low_bar_nocache_25,bar_i%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) // fq = {!frange cache=false cost=50 l=0}trace(pos_foo_nocache_50,foo_i) // fq = {!frange cache=false cost=25 u=100}trace(low_bar_nocache_25,bar_i) TraceValueSource$TracerValues; low_bar_nocache_25: exists(#0: "A") -> true TraceValueSource$TracerValues; low_bar_nocache_25: floatVal(#0: "A") -> 99.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#0: "A") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#0: "A") -> 42.0 TraceValueSource$TracerValues; simple_sum: exists(#0: "A") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#0: "A") -> 141.0 TraceValueSource$TracerValues; low_bar_nocache_25: exists(#1: "B") -> true TraceValueSource$TracerValues; low_bar_nocache_25: floatVal(#1: "B") -> 75.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#1: "B") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#1: "B") -> -42.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#2: "C") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#2: "C") -> -7.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#3: "D") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#3: "D") -> 7.0 TraceValueSource$TracerValues; low_bar_nocache_25: exists(#3: "D") -> true TraceValueSource$TracerValues; low_bar_nocache_25: floatVal(#3: "D") -> 50.0 TraceValueSource$TracerValues; simple_sum: exists(#3: "D") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#3: "D") -> 57.0 SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id&fq={!frange+cache%3Dfalse+cost%3D50+l%3D0}trace(pos_foo_nocache_50,foo_i)&fq={!frange+cache%3Dfalse+cost%3D25+u%3D100}trace(low_bar_nocache_25,bar_i)} hits=1 status=0 QTime=8

Let’s again step through this in sequence and talk about what’s happening at each point:

  • Because the filters are not cached, Solr can combine them with the main q query and execute all three in one pass over the index
  • The filters are sorted according to their cost, and the lowest cost filter (low_bar_nocache_25) is asked to find the “first” document it matches:
    • Document “A” is a match for low_bar_nocache_25 (bar_i <= 100) so then the next filter is consulted…
    • Document “A” is also a match for pos_foo_nocache_50 (0 <= foo_i) so all filters match — the main query can be consulted…
    • Document “A” is not a match for the main query (simple_sum)
  • The filters are then asked to find their “next” match after “A”, beginning with the lowest cost filter: low_bar_nocache_25
    • Document “B” is a match for ‘low_bar_nocache_25’, so the next filter is consulted…
    • Document “B” is not a match for the ‘pos_foo_nocache_50’ filter, so that filter keeps checking until it finds it’s “next” match (after “B”)
    • Document “C” is not a match for the ‘pos_foo_nocache_50’ filter, so that filter keeps checking until it finds it’s “next” match (after “C”)
    • Document “D” is the “next” match for the ‘pos_foo_nocache_50’ filter, so the remaining filter(s) are consulted regarding that document…
    • Document “D” is also a match for the ‘low_bar_nocache_25’ filter, so all filters match — the main query can be consulted again.
    • Document “D” is a match for the main query (simple_sum), and we have our first (and only) hit for the request

There are two very important things to note here that may not be immediately obvious:

  1. Just because the individual fq params indicate cache=false does not mean that nothing about their results will be cached. The results of the main q in conjunction with the (non-cached) filters can still wind up in the queryResultCache, as you can see if the exact same query is re-executed… http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29&fq={!frange%20cache=false%20cost=50%20l=0}trace%28pos_foo_nocache_50,foo_i%29&fq={!frange%20cache=false%20cost=25%20u=100}trace%28low_bar_nocache_25,bar_i%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) // fq = {!frange cache=false cost=50 l=0}trace(pos_foo_nocache_50,foo_i) // fq = {!frange cache=false cost=25 u=100}trace(low_bar_nocache_25,bar_i) SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id&fq={!frange+cache%3Dfalse+cost%3D50+l%3D0}trace(pos_foo_nocache_50,foo_i)&fq={!frange+cache%3Dfalse+cost%3D25+u%3D100}trace(low_bar_nocache_25,bar_i)} hits=1 status=0 QTime=1

    …we don’t get any trace() messages, because the entire “q + fqs + sort + pagination” combination was in the queryResultCache.

    (NOTE: Just as using cache=false in the local params of the fq params prevent them from being put in the filterCache, specifying cache=false on the q param can also prevent an entry for this query being added to the queryResultCache if desired)

  2. The relative cost value of each filter does not dictate the order that they are evaluated against every document.
    • In the example above, the higher cost=50 specified on on the ‘pos_foo_nocache_50’ filter did not ensure it would be executed against fewer documents then the lower cost ‘low_bar_nocache_25’ filter
      • Document “C” was checked against (and ruled out by) the (higher cost) ‘pos_foo_nocache_50’ filter with out ever checking that document against the lower cost ‘low_bar_nocache_25’
    • The cost only indicates in what order each filter should be consulted to find it’s “next” matching document after each previously found match against the entire request
      • Relative cost values ensure that a higher cost filter will not be asked to find check the “next” match against any document that a lower cost filter has already definitively ruled out as a non-match.

    Compare the results above with the following example, where the same functions use new ‘cost’ values:

    http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29&fq={!frange%20cache=false%20cost=10%20l=0}trace%28pos_foo_nocache_10,foo_i%29&fq={!frange%20cache=false%20cost=80%20u=100}trace%28low_bar_nocache_80,bar_i%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) // fq = {!frange cache=false cost=10 l=0}trace(pos_foo_nocache_10,foo_i) // fq = {!frange cache=false cost=80 u=100}trace(low_bar_nocache_80,bar_i) TraceValueSource$TracerValues; pos_foo_nocache_10: exists(#0: "A") -> true TraceValueSource$TracerValues; pos_foo_nocache_10: floatVal(#0: "A") -> 42.0 TraceValueSource$TracerValues; low_bar_nocache_80: exists(#0: "A") -> true TraceValueSource$TracerValues; low_bar_nocache_80: floatVal(#0: "A") -> 99.0 TraceValueSource$TracerValues; simple_sum: exists(#0: "A") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#0: "A") -> 141.0 TraceValueSource$TracerValues; pos_foo_nocache_10: exists(#1: "B") -> true TraceValueSource$TracerValues; pos_foo_nocache_10: floatVal(#1: "B") -> -42.0 TraceValueSource$TracerValues; pos_foo_nocache_10: exists(#2: "C") -> true TraceValueSource$TracerValues; pos_foo_nocache_10: floatVal(#2: "C") -> -7.0 TraceValueSource$TracerValues; pos_foo_nocache_10: exists(#3: "D") -> true TraceValueSource$TracerValues; pos_foo_nocache_10: floatVal(#3: "D") -> 7.0 TraceValueSource$TracerValues; low_bar_nocache_80: exists(#3: "D") -> true TraceValueSource$TracerValues; low_bar_nocache_80: floatVal(#3: "D") -> 50.0 TraceValueSource$TracerValues; simple_sum: exists(#3: "D") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#3: "D") -> 57.0 SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id&fq={!frange+cache%3Dfalse+cost%3D10+l%3D0}trace(pos_foo_nocache_10,foo_i)&fq={!frange+cache%3Dfalse+cost%3D80+u%3D100}trace(low_bar_nocache_80,bar_i)} hits=1 status=0 QTime=3

    The overall flow is fairly similar to the last example:

    • Because the filters are not cached, Solr can combine them with the main query and execute all three in one pass over the index
    • The filters are sorted according to their cost, and the lowest cost filter (pos_foo_nocache_10) is asked to find the “first” document it matches:
      • Document “A” is a match for pos_foo_nocache_10 (0 <= foo) — so the next filter is consulted…
      • Document “A” is a match for low_bar_nocache_80 (bar <= 100) — so all filters match, and so the main query can be consulted…
      • Document “A” is not a match for the main query (simple_sum)
    • The filters are then asked to find their “next” match after “A”, beginning with the lowest cost filter: (pos_foo_nocache_10)
      • Document “B” is not a match for the ‘pos_foo_nocache_10’ filter, so that filter keeps checking until it finds it’s “next” match (after “B”)
      • Document “C” is not a match for the ‘pos_foo_nocache_10’ filter, so that filter keeps checking until it finds it’s “next” match (after “C”)
      • Document “D” is the “next” match for the ‘pos_foo_nocache_10’ filter, so the remaining filter(s) are consulted regarding that document…
      • Document “D” is also a match for the ‘low_bar_nocache_80’ filter, so all filters match — the main query can be consulted again.
      • Document “D” is a match for the main query, and we have our first (and only) hit for the request

The key thing to note in these examples, is that even though we’ve given Solr a “hint” at the relative cost of these filters, the underlying Scoring APIs in Lucene depend on being able to ask each Query to find the “next match after doc#X”. Once a “low cost” filter has been asked to do this, the document it identifies will be used as the input when asking a “higher cost” filter to find it’s “next match”, and if the higher cost filter matches very few documents, it may have to “scan over” more total documents in the segment then the lower cost filter.

Post Filtering

There are a small handful of Queries available in Solr (notably {!frange} and {!collapse}) which — in addition to supporting the normal Lucene iterative scoring APIs — also implement a special “PostFilter” API.

When a Solr request includes a filter that is cache=false and has a cost >= 100 Solr will check if the underlying Query implementation supports the PostFilter API; If it does, Solr will automatically use this API, ensuring that these post filters will only be consulted about a potential matching document after:

  • It has already been confirmed to be a match for all regular (non-post) fq filters
  • It has already been confirmed to be a match for the main q Query
  • It has already been confirmed to be a match for any lower cost post-filters

(This overall user experience (and special treatment of cost >= 100, rather then any sort of special postFilter=true syntax) is focused on letting users indicate how “expensive” they expect the various filters to be, while letting Solr worry about the best way to handle those various expensive filters depending on how they are implemented internally with out the user being required to know in advance “Does this query support post filtering?”)

For Advanced Solr users who want to write custom filtering plugins (particularly security related filtering that may need to consult external data sources or enforce complex rules) the PostFilter API can be a great way to ensure that expensive operations are only executed if absolutely necessary.

Let’s reconsider our earlier example of non-cached filter queries, but this time we’ll use cost=200 on the bar < 100 filter condition so it will be used as a post filter…

http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29&fq={!frange%20cache=false%20cost=50%20l=0}trace%28pos_foo_nocache_50,foo_i%29&fq={!frange%20cache=false%20cost=200%20u=100}trace%28low_bar_postfq_200,bar_i%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) // fq = {!frange cache=false cost=50 l=0}trace(pos_foo_nocache_50,foo_i) // fq = {!frange cache=false cost=200 u=100}trace(low_bar_postfq_200,bar_i) TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#0: "A") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#0: "A") -> 42.0 TraceValueSource$TracerValues; simple_sum: exists(#0: "A") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#0: "A") -> 141.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#1: "B") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#1: "B") -> -42.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#2: "C") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#2: "C") -> -7.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#3: "D") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#3: "D") -> 7.0 TraceValueSource$TracerValues; simple_sum: exists(#3: "D") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#3: "D") -> 57.0 TraceValueSource$TracerValues; low_bar_postfq_200: exists(#3: "D") -> true TraceValueSource$TracerValues; low_bar_postfq_200: floatVal(#3: "D") -> 50.0 SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id&fq={!frange+cache%3Dfalse+cost%3D50+l%3D0}trace(pos_foo_nocache_50,foo_i)&fq={!frange+cache%3Dfalse+cost%3D200+u%3D100}trace(low_bar_postfq_200,bar_i)} hits=1 status=0 QTime=4

Here we see a much different execution flow from the previous examples:

  • The lone non-cached (non-post) filter (pos_foo_nocache_50) is initially consulted to find the “first” document it matches
    • Document “A” is a match for pos_foo_nocache_50 (0 <= foo) — so all “regular” filters match, and the main query can be consulted…
    • Document “A” is not a match for the main query (simple_sum) so we stop considering “A”
    • The post-filter (low_bar_postfq_200) is never consulted regarding “A”
  • The lone non-post filter is again asked to find it’s “next” match after “A”
    • Document “B” is not a match for the ‘pos_foo_nocache_50’ filter, so that filter keeps checking until it finds it’s “next” match (after “B”)
    • Document “C” is not a match for the ‘pos_foo_nocache_50’ filter, so that filter keeps checking until it finds it’s “next” match (after “C”)
    • Document “D” is the “next” match for the ‘pos_foo_nocache_50’ filter — since there are no other “regular” filters, the main query is consulted again
    • Document “D” is also a match for the main query
    • After all other conditions have been satisfied Document “D” is then checked against the post filter (low_bar_postfq_200) — since it matches we have our first (and only) hit for the request

In these examples, the functions we’ve used in our filters have been relatively simple, but if you wanted to filter on multiple complex math functions over many fields, you can see how specifying a “cost” relative to the complexity of the function could be advantageous to ensure that the “simpler” functions are checked first.

In Conclusion…

Hopefully these examples I’ve walked through are helpful for folks trying to wrap their heads around how/why filter queries behave in various sitautions, and specifically how {!frange} queries work so you can consider some of the trade offs of tweaking the cache and cost params of your various filters.

Even for me, with ~12 years of Solr experience, running through these examples made me realize I had a missconception about how/when FunctionRangeQuery could be optimized (ultimately leading to SOLR-11641 which should make {!frange cache=false ...} much faster by default in future Solr versions)

The post Caching, and Filters, and Post-Filters, Oh My! appeared first on Lucidworks.

ACRL TechConnect: Memory Labs and audiovisual digitization workflows with Lorena Ramírez-López

planet code4lib - Mon, 2017-11-27 16:00

Hello! I’m Ashley Blewer, and I’ve recently joined the ACRL TechConnect blogging team. For my first post, I wanted to interview Lorena Ramírez-López. Lorena is working (among other places) at the D.C. Public Library on their Memory Lab initiative, which we will discuss below. Although this upcoming project targets public libraries, Lorena has a history of dedication to providing open technical workflows and documentation to support any library’s mission to set up similar “digitization stations.”

Hi Lorena! Can you please introduce yourself?

Hi! I’m Lorena Ramírez-López. I am a born and raised New Yorker from Queens. I went to New York University for Cinema Studies and Spanish where I did an honors thesis on Paraguayan cinema in regards to sound theory. I continued my education at NYU and graduated from the Moving Image Archiving and Preservation program where I concentrated on video and digital preservation. I was one of the National Digital Stewardship Residents for the American Archive of Public Broadcasting. I did my residency at Howard University television station (WHUT) in Washington D.C from 2016 until this June 2017. Along with being the project manager for the Memory Lab Network, I do contracting work for the National Portrait Gallery on their time based media artworks, joined the Women who Code community, and teach Spanish at Fluent City!


Tell us a little bit about DCPL’s Memory Lab and your role in it.

The DC Public Library’s Memory Lab was a National Digital Stewardship Project back in 2014 through 2015. This was the baby of DCPL’s National Digital Stewardship Resident, Jaime Mears, back in the day. A lot of my knowledge on how it started comes from reading the original project proposal, which you can find that on the Library of Congress’s website as well as Jaime Mear’s final report on the Memory Lab is found on the DC Library website. But to summarize its origin story, the Memory Lab was created as a local response to the fact that communities are generating a lot of digital content while still keeping many of their physical materials like VHS, miniDVs, and photos but might not necessarily have the equipment or knowledge to preserve their content. It has been widely accepted in the archival and preservation fields that we have an approximate 15- to 20-year window of opportunity to digitally preserve legacy audio and video recordings on magnetic tape because of the rate of degradation and the obsolescence of playback equipment. The term “video at risk” might ring a bell to some people. There’s also photographs and film, particularly color slides and negatives and moving image film formats, that will also fade and degrade over time. People want to save their memories as well as share them on a digital platform.

There are well-established best practices for digital preservation in archival practice, but these guidelines and documentation are generally written for a professional audience. And while there are a various personal digital archiving resources for a public audience, they aren’t really easy to find on the web and a lot of these resources aren’t updated to reflect the changes in our technology, software, and habits.

That being the case, our communities risk massive loss of history and culture! And to quote Gabriela Redwine’s Digital Preservation Coalition report,  “personal digital archives are important not just because of their potential value to future scholars, but because they are important to the people who created them.”

So the Memory Lab was the library’s local response in the Washington D.C. area to bridge this gap of digital archiving knowledge and provide the tools and resources for library patrons to personally digitize their own personal content.

My role is maintaining this memory lab (digitization rack). When hardware gets worn down or breaks, I fix it. When software for our computers upgrade to newer systems, I update our workflows.

I am currently re-doing the website to reflect the new wiring I did and updating the instructions with more explanations and images. You can expect gifs!


You recently received funding from IMLS to create a Memory Lab Network. Can you tell us more about that?

Yes! The DC Public Library in partnership with the Public Library Association received a national leadership grant to expand the memory lab model.

During this project, the Memory Lab Network will partner with seven public libraries across the United States. Our partners will receive training, mentoring, and financial support to develop their own memory lab as well as programs for their library patrons and community to digitize and preserve their personal and family collections. A lot of focus is put on the digitization rack, mostly because it’s cool, but the memory lab model is not just creating a digitization rack. It’s also developing classes and online resources for the community to understand that digital preservation doesn’t just end with digitizing analog formats.

By creating these memory labs, these libraries will help bridge the digital preservation divide between the professional archival community and the public community. But first we have to train and help the libraries set up the memory lab, which is why we are providing travel grants to Washington, D.C. for an in-depth digital preservation bootcamp and training for these seven partners.

If anyone wants to read the proposal, the Institute of Museum and Library Sciences has it here.


What are the goals of the Memory Lab Network, and how do you see this making an impact on the overall library field (outside of just the selected libraries)?

One of the main goals is to see how well the memory lab model holds up. The memory lab was a local response to a need but it was meant to be replicated. This funding is our chance to see how we can adapt and improve the memory lab model for other public libraries and not just our own urban library in Washington D.C.

There are actually many institutions and organizations that have digitization stations and/or the knowledge and resources, but we just don’t realize who they are. Sometimes it feels like we keep reinventing the wheel with digital preservation. There are plenty of websites that had contemporary information on digital preservation and links to articles and other explanations at one time. Then those websites weren’t sustained and remained stagnant while housing a series of broken links and lost PDFs. We could (and should) be better of not just creating new resources, but updating the ones we have.

The reasons why some organization aren’t transparent or updating the information, or why we aren’t searching in certain areas, varies, but we should be better at documenting and sharing our information to our archival and public communities. This is why the other goal is to create a network to better communicate and share.


What advice do you have for librarians thinking of setting up their own digitization stations? How can someone learn more about aspects of audiovisual preservation on the job?

If you are thinking of setting up your own digitization station, announce that not only to your local community but also the larger archival community. Tell us about this amazing adventure you’re about to tackle. Let us know if you need help! Circulate and cite that article you thought was super helpful. Try to communicate not only your successes but also your problems and failures.

We need to be better at documenting and sharing what we’re doing, especially when dealing with how to handle and repair playback decks for magnetic media. Besides the fact that the companies just stopped supporting this equipment, a lot of this information on how to support and repair equipment could have been shared or passed down by really knowledge experts, but it wasn’t. Now we’re all holding our breath and pulling our hair out because this one dude who repairs U-matic tapes is thinking about retiring. This lack of information and communication shouldn’t be the case in our environment when we can email and call.

We tend to freak out about audiovisual preservation because we see how other professional institutions set up their workflows and the amount of equipment they have. The great advantage libraries have is that not only can they act practically with their resources but also they have the best type of feedback to learn from: library patrons. We’re creating these memory lab models for the general public so getting practical experience, feedback, and concerns are great ways to learn more on what aspects of audiovisual preservation really need to be fleshed out and researched.

And for fun, try creating and archiving your own audiovisual media! You technically already do with taking photos and videos on your phone. Getting to know your equipment and where your media goes is very helpful.


Thanks very much, Lorena!

For more information on how to set up a “digitization station” at your library, I recommend Dorothea Salo’s robust website detailing how to build an “audio/video or digital data rescue kit”, available here.


HangingTogether: Equity, Diversity, and Inclusion (EDI)

planet code4lib - Mon, 2017-11-27 14:00

My colleagues Rebecca Bryant and Merrilee Proffitt have summarized discussions at our 1 November 2017 regional OCLC Research Library Partnership meeting in Baltimore on evolving scholarly services and workflows and on moving forward with unique and distinctive collections. This summarizes our third discussion thread, on equity, diversity, and inclusion (EDI).

We conducted an “Equity, Diversity and Inclusion Survey of the OCLC Research Library Partnership” between 12 September and 13 October of this year.  We wanted to obtain a snapshot of the EDI efforts within the Partnership that could help identify specific follow-up activities which could provide assistance and better inform practice. We were pleased that 63 Partners in nine countries responded to the survey. A summary of the results served as the framework for discussions with Partners at our meeting in Baltimore.

Some highlights from the survey results:

  • 59% of the respondents had set up or plan to set up an EDI committee or working group.
  • 72% were using or planned to use EDI goals and principles to inform their collections’ workflows, practices, or services.
  • Of those who responded they were using or planned to use EDI goals, 79% were working with other institutions, organizations, or community groups on EDI to improve representation of marginalized groups in collections, practices, or services.
  • The top three areas that 80% or more of the respondents had already changed due to their institutions’ EDI goals and principles were:
    • Activities and events
    • Recruitment and retention
    • Outreach to marginalized communities
  • The top two areas where 70% or more of the respondents planned to change but haven’t yet were:
    • Search and discovery interfaces
    • Metadata descriptions in library catalogs

Biggest institutional challenges in Partners’ EDI efforts included building relationships with marginalized communities and creating a positive work climate, which, in turn, would help recruit and retain diverse staff.

OCLC Research staff will be looking at ways to follow-up on the suggestions from our Partnership meeting discussions, which included sharing a list of resources from the survey responses and researching the landscape of EDI efforts being conducted by other professional organizations to identify gaps so we can better leverage community activities. Meanwhile, feel free to post your own EDI efforts in the comments below!

The OCLC Research Library Partnership provides a unique transnational collaborative network of peers to address common issues as well as the opportunity to engage directly with OCLC Research.

OCLC Dev Network: Alexa Meets WorldCat

planet code4lib - Mon, 2017-11-27 14:00

The provocative statement “voice is new UI” came into vogue in 2016. I was skeptical at first, in the way many of us have become when topics trend like this. I thought to myself, “Isn’t a UI something on a screen?” However, as 2017 now comes to a close, it is abundantly clear that voice UIs aren’t just a fad. 

District Dispatch: Net neutrality protections eliminated in draft FCC order

planet code4lib - Mon, 2017-11-27 13:55

Last week, we highlighted a disturbing policy change that we had been anticipating for a while: Federal Communications Commission (FCC) Chairman Pai’s plan to roll back the net neutrality rules that require internet service providers to treat all internet traffic and services equally.

Between Thanksgiving preparations and leftovers, we have had some time to review this big turkey (220 pages worth). Below are some first impressions.

Before we dive in, now is the time to raise the volume on outcry as members of Congress return from the holiday. We have set up an email tool so you can make your voice heard in advance of the FCC’s December 14 vote. Visit our action center and contact your elected officials now.

First, the existing regulations the FCC passed in 2015 established clear, bright-line rules prohibiting harmful behavior by commercial internet service providers (ISPs)—for both mobile and fixed broadband. Chairman Pai’s draft order eliminates all those rules and only requires bare bones transparency disclosures—that is, ISPs could degrade service or block access to certain sites for libraries and their patrons but would need to tell them first.

The ALA has argued early and often for more transparency related to broadband offerings and ISP network management practices, and we are glad to see a nod to this in the draft order. But it’s not enough. When you have little or no choice of ISP (as of December 2016, nearly half of U.S. homes lack access to more than one broadband offering at the level of 25Mbps download/3Mbps upload), knowing that you may be paying more and getting less is no comfort. The original 2015 Order provided strong transparency, as well as enforceable “rules of the road” for broadband providers to make sure the internet we enjoy today will continue to flourish.

Second, the FCC’s draft order undermines the court-affirmed legal foundation for protecting the open internet by reversing the 2015 reclassification of broadband as a Title II service under the Communications Act. In doing so, it also undermines its own authority—giving it to the Federal Trade Commission (FTC). While the FCC touts this as a feature, others have noted the FTC would be more limited in its ability to protect net neutrality.

Third, in classic “black is white” doublespeak, the draft order states that “through these actions, we advance our critical work to promote broadband deployment in rural America and infrastructure investment throughout the nation, brighten the future of innovation both within networks and at their edge, and move closer to the goal of eliminating the digital divide.” If only the ISPs could charge content providers to reach consumers, the draft order argues, they would use this windfall to build broadband capacity for geographically isolated communities. And then those residents would be able to “choose” how much they would like to pay for whatever package is offered to them in this bright and innovative new world order. (Of course, consumers also likely will be paying higher prices to content providers, as well, as those costs get passed along.)

ALA has consistently argued that once the consumer—including libraries—pays for whatever broadband speed they can afford, s/he should be able to choose any and all legal content on the web. And, in a Web 2.0 world, we know that many consumers also are creators— that is the true power of the internet. The internet is not—and has never been—a broadcast medium in which increasingly consolidated power lies in the hands of a few, but a place where people create and share and launch new enterprises. It’s no wonder the small business community has been among the strongest proponents of net neutrality protections. ALA and libraries stand with all of our patrons and our neighbors who depend on net neutrality not only for equitable access but equitable collaboration and equitable distribution.

Finally (for now), this new FCC order would create a world where ISPs are allowed to block, slow down and limit quality access to any websites or applications they want. ALA stands vehemently opposed to these actions; the draft order violates all the principles we believe are necessary for a free and open internet as well as fundamental library values.

The FCC is scheduled to vote on this dangerous proposal at its meeting on December 14, and every indication suggests Chairman Pai has the three votes he needs to pass it. The independent regulatory agency received a record number of comments in the public record, including from the ALA and many library professionals, and the majority of comments favored maintaining the net neutrality protections already in place. As in past advocacy to preserve net neutrality protections, the fight ahead is more of a marathon than a sprint.

That said, Congress can play a role in at least two important ways:

  • First, strong disapproval from members of Congress (especially from Republicans and even more importantly from those on the committees that oversee the FCC) could persuade the FCC to pause its planned vote. We have set up an email tool so you can make your voice heard with your member of Congress in advance of the December 14 vote.
  • Second, if the FCC moves ahead with the vote and it passes along party lines as expected, members of Congress could use the Congressional Review Act to reject this destructive policy move. (More on this later.)

An even more unlikely third option is that Congress could develop new legislation that would codify net neutrality protections in law. In light of an already complex and heated legislative agenda, this seems improbable.

The more likely (and previously more successful) path of resistance is through the courts. We will talk more about legal challenges in a future blog.

The post Net neutrality protections eliminated in draft FCC order appeared first on District Dispatch.

Terry Reese: MarcEdit 7 staging

planet code4lib - Mon, 2017-11-27 08:41

As of 12 am, Nov. 27 – I’ve staged all the content for MarcEdit 7. Technically, if you download the current build from the Release Candidate page – you’ll get the new code. However – there’s a couple things I need to test and finish prepping, so I’m just staging content tonight. Things left to test:

  1. Automated update – I need to make sure that the update mechanism has switched gracefully to the new code-base, and I can’t test that without staging an update. So, that’s what I’m going to do tomorrow. Currently, I have the code running, tomorrow, I’ll update the build number and stage an update for testing purposes.
  2. I need to update the Cleaner program – while MarcEdit is easier to clean when installed only in the user space, the problem is that if an update becomes corrupted, you still have to remove a registry key. Those keys are hard to find, and the cleaner just needs to be updated to automatically find them and remove them when necessary.
  3. I want to update the delivery mechanism on the website. With MarcEdit 7, there are 4 Windows installers – 2 that install without administrative permissions, 2 that do. I’d recommend that users install the versions that do not require administrative permissions, but there may be times when the other version is more appropriate (like if you have more than one users signing in on a machine). I’m working on a mechanism that will enable users to select 32 or 64 bit, and then get the 2 appropriate download links, within information related to which version would be recommended and the use cases each version is designed for.

Questions, let me know.


HangingTogether: Announcing a Spanish language version of the Survey on Research Information Management Practices

planet code4lib - Mon, 2017-11-27 05:22

We are pleased to announce that thanks to translation support from CONCYTEC, the Peruvian National Council for Science, Technology and Technological Innovation, the international Survey of Research Information Management Practices, collaboratively developed by OCLC Research and euroCRIS, is now available in a Spanish-language version.

CONCYTEC will be encouraging the completion of the survey by research organizations across Peru. Andres Melgar, Director for Evaluation and Knowledge Management of CONCYTEC states,  

CONCYTEC’s Direction of Evaluation and Knowledge Management is currently assessing RIM practices and capacities in Peruvian universities and research institutes, and finds that the OCLC & euroCRIS RIM Survey is a most valuable instrument which will be of great assistance in this endeavor towards the establishment of a national RIM infrastructure based on global standards.

CONCYTEC is a decentralized public body under the authority of the Peruvian Presidency of the Council of Ministers. Its mission is to formulate national science, technology and innovation (STI) policies and plans, and to promote and manage actions to create and transfer scientific knowledge and technology on behalf of the social and economic development of the country. Through its Direction of Evaluation and Knowledge Management, CONCYTEC is responsible for fostering the establishment and development of a national network of scientific information and interoperability with the goal of enabling timely and efficient management of STI statistics and the gathering of the information required for STI planning, operation, and promotion in the country.

We believe that this survey will provide valuable information to other stakeholders worldwide, including librarians, research administrators, institutional researchers, policy makers, and university and research institutional leaders. We encourage you to share information about this survey and to help ensure that your institution participates.

Open Knowledge Foundation: Pin it in the Parks: Crowdsourcing park facilities information in Dublin

planet code4lib - Fri, 2017-11-24 09:28

Since 2015 Open Knowledge International has been part of the consortium of RouteToPA, a European innovation project aimed at improving citizen engagement by enabling meaningful interaction between open data users, open data publishers and open data. This month, the project is running the Pin it in the Parks competition together with Smart Dublin to encourage, inform and engage citizens and residents on services and issues in their local areas.

This blog has been reposted from the RouteToPA website

Ever wondered if there are tennis courts or exercise machines in the parks near you? What about playgrounds or skate parks or even historical monuments? A four weeks competition, Pin it in the Parks, encourages citizens to share information on the facilities available in parks near them in the city of Dublin. By using the RouteToPA android app, the user will have the ability to take photos of facilities they encounter and provide its exact location, hence making this information available to everyone.

Here is what we know about some of Dublin’s park facilities, through information collected by SmartDublin & RouteToPA project:

This leafletjs-datalet was created by RTPA in his newsfeed from this dataset

Why is this competition important?

Local authorities are coming together all over Ireland to work together to improve access to information and to make data more accessible and easy to use. Participating in this competition will strengthen the role played by citizens and residents firstly to push towards more open data about issues that touch on their daily lives and secondly to raise awareness about the current state of the parks.

Citizens and residents that are aware of their needs and their surrounding form a stronger pressure group. The competition encourages teamwork, as participants are strongly encouraged to apply in teams which will increase their chances of getting more points.

The information collected throughout this competition will push citizens to start discussions around their needs and the current states of the parks amongst other topics. RouteToPA, through its SPOD platform, allows participants to start and participate in the conversation around the data collected. Obtaining credible data is the first step towards finding solutions to the challenges a society might face.

Finally, keep an eye on the Pin it in the Parks website where the competition updates and final results will be made available.


Evergreen ILS: Hack-A-Way 2018

planet code4lib - Thu, 2017-11-23 23:54

I’m pleased to announce that the selection committee for the 2018 Hack-A-Way have made their choice.  We had several competing offers to host the event in 2018 from wonderful community members.  Having just wrapped up spending two years in Indiana, hosted by Evergreen Indiana, I can say that the bar has been set pretty high for future events.  This year Kathy Lussier from MassLNC, Dan Wells from Calvin College and Jason Boyer from Evergreen Indiana, all past hosts, acted as our selection committee.

And the choice was …. we’re going back to Atlanta!  The host is Equinox Open Library Initiative.  Equinox is now the second institution to be a two time host and was the host of the original Hack-A-Way in 2012.  In a way it seems fitting.  Back in 2012 we had the first discussions of what became the web based staff client and we are returning as we approach the next wave of challenges.

I would like to thank everyone who submitted proposals, who acted on the selection committee and otherwise do so much to make this happen for the community.  More communication will be forthcoming in the next week as a follow up survey to the 2017 Hack-A-Way comes out both to look back at the past Hack-A-Way and to start planning it for 2018.



Open Knowledge Foundation: How do open data measurements help water advocates to advance their mission?

planet code4lib - Thu, 2017-11-23 11:53

This blogpost was jointly written by Danny Lämmerhirt and Nisha Thompson (DataMeet).

Since its creation, the open data community has been at the heart of the Global Open Data Index (GODI). By teaming up with expert civil society organisations we define key datasets that should be opened by government to align with civil society’s priorities. We assumed that GODI also teaches our community to become more literate about government institutions, regulatory systems and management procedures that create data in the first place – making GODI an engagement tool with government.

Tracing the publics of water data

Over the past few months we have reevaluated these assumptions. How do different members of civil society perceive the data assessed by GODI? Is the data usable to advance their mission? How can GODI be improved to accommodate and reflect the needs of civil society? How would we go about developing user-centric open data measurements and would it be worth to run more local and contextual assessments?

As part of this user research, OKI and DataMeet (a community of data science and open data enthusiasts in India) teamed up to investigate the needs of civic organisations in the water, sanitation and health (WASH) sector. GODI assesses whether governments release information on water quality, that is pollution levels, per water source. In detail this means that we check whether water data is available at potentially a household level or  at each untreated public water source such as a lake or river. The research was conducted by DataMeet and supervised by OKI, and included interviews and workshops with fifteen different organisations.

In this blogpost we share insights on how law firms, NGOs, academic institutions, funding and research organisations perceive the usefulness of GODI for their work. Our research focussed on the South Asian countries India, Pakistan, Nepal, and Bangladesh. All countries face similar issues with ensuring safe water to their populations because of an over-reliance on groundwater, geogenic pollutants like arsenic, and high pollutants from industry, urbanisation, farming, and poor sanitation.

According to the latest GODI results, openness of water quality data remains low worldwide.

What kinds of water data matter to organisations in the water sector?

Whilst all interviewed organisations have a stake in access to clean water for citizens, they have very different motivations to use water quality data. Governmental water quality data is needed to

  1. Monitor government activities and highlight general issues with water management (for advocacy groups).
  2. Create a baseline to compare against civil society data (for organisations implementing water management systems)
  3. Detect geographic target areas of under-provision as well as specific water management problems to guide investment choices (for funding agencies and decision-makers)

Each use case requires data with different quality. Some advocacy interviewees told us that government data, despite a potential poor reliability, is enough to make the case that water quality is severely affected across their country. In contrast, researchers have a need for data that is provided continuously and at short updating cycles. Such data may not be provided by government. Government data is seen as support for their background research, but not a primary source of information. Funders and other decision-makers use water quality data largely for monitoring and evaluation – mostly to make sure their money is being used and is impactful. They will sometimes use their own water quality data to make the point that government data is not adequate. Funders push for data collection at a project level not continuous monitoring which can lead to gaps in understanding.

GODI’s definition of water quality data is output-oriented and of general usefulness. It enables finding the answer to whether the water that people can access is clean or not. Yet, organisations on the ground need other data – some of which is process-oriented – to understand how water management services are regulated and governed or what laboratory is tasked to collect data. A major issue for meaningful engagement with water-related data is the complexity of water management systems.

In the context of South Asia, managing, tracking, and safeguarding water resources for use today and in the future is complex. Water management systems, from domestic to industrial to agricultural ones, are diverse and hard to examine and keep accountable. Where water is coming from, how much of it is being used and for what, and then how waste is being disposed of are all crucial questions to these systems. Yet there is very little data available to address all these questions.

How do organisations in the WASH sector perceive the GODI interface?

GODI has an obvious drawback for the interviewed organisations: transparency is not a goal for organisations working on the ground and does not in itself provoke an increase in access to safe water or environmental conservation. GODI measures the publication of water quality data, but is not seen to stimulate improved outcomes. It also does not interact with the corresponding government agency.

One part of GODI’s theory of change is that civil society becomes literate about government institutions and can engage with government via the publication of government data. Our interviews suggest that our theory of change needs to be reconsidered or new survey tools need to be developed that can enhance engagement between civil society and government. Below we share some ideas for future scenarios.

Our learnings and the road ahead Adding questions to GODI

Interviews show that GODI’s current definition of water quality data does not always align with the needs of organisations on the ground. If GODI wants to be useful to organisations in the WASH sector, new questions can be added to the survey and be used as a jumping off point for outreach to groups. Some examples include:

  1. Add a question regarding metadata and methodology documentation to capture quality and provenance water data, but also where we found and selected data.
  2. Add a question regarding who did the data collection government or partner organisation. This allows community members to trace the data producers and engage with them.
  3. Assess transparency of water reports. Reports should be considered since they are an important source of information for civil society.
Customising the Open Data Survey for regional and local assessments

Many interviewees showed an interest in assessing water quality data at the regional and hyperlocal level. DataMeet is planning to customise the Open Data Survey and to team up with local WASH organisations to develop and maintain a prototype for a regional assessment of water quality. India will be our test case since there is local data for the whole country available at varying degrees across states. This may include to also assess quality of data and access to metadata.

Highest transparency would mean to have water data from each individual lab were the samples are sent. Another use case of the Open Data Survey would include to measure the transparency of water laboratories. Bringing more transparency and accountability to labs would be the most valuable for ground groups sending samples to labs across the country.

Map of high (> 30 mg/l) fluoride values from 2013–14. From: The Gonda Water Data story

Storytelling through data

Whilst some interviewees saw little use in governmental water quality data, its usefulness can be greatly enhanced when combined with other information. As discussed earlier, governmental water data gives snapshots and may provide baseline values that serve NGOs as rough orientation for their work. Data visualisations could present river and water basin quality and tell stories about the ecological and health effects.

Behavior change is a big issue when adapting to sanitation and hygiene interventions. Water quality and health data can be combined to educate people. If you got sick, have you checked your water? Do you use a public toilet? Are you washing your hands? This type of narration does not require granular accurate data.

Comparing water quality standards

Different countries and organisations have different standards for what counts as high water pollution levels. Another project could assess how the needs of South Asian countries are being served by a comparing pollution levels with different standards. For instance, fluorosis is an issue in certain parts of India: not just from high fluoride levels but also because of poor nutrition in those areas. Should fluoride affected areas have lower permissible amounts in poorer countries? These questions could be used to make water quality data actionable to advocacy  groups.

Tim Ribaric: Checking EZproxy URLs in bulk

planet code4lib - Wed, 2017-11-22 21:10

I wrote a bit of a utility to help out with EZProxy URL checking...

read more

LITA: Jobs in Information Technology: November 22, 2017

planet code4lib - Wed, 2017-11-22 18:09

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

University at Albany, State University of New York, Web Developer/Designer, Albany, NY

Yale University, Data Librarian, New Haven, CT

University of Denver, Information Technologies Librarian, Denver, CO

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.


Subscribe to code4lib aggregator