You are here

Feed aggregator

FOSS4Lib Recent Releases: veraPDF - v1.10

planet code4lib - Thu, 2017-11-30 15:47

Last updated November 30, 2017. Created by Peter Murray on November 30, 2017.
Log in to edit this page.

Package: veraPDFRelease Date: Thursday, November 30, 2017

FOSS4Lib Recent Releases: JHOVE - 1.18

planet code4lib - Thu, 2017-11-30 14:48

Last updated November 30, 2017. Created by Peter Murray on November 30, 2017.
Log in to edit this page.

Package: JHOVERelease Date: Thursday, November 30, 2017

LITA: Double Your Impact to Build A More Inclusive LITA

planet code4lib - Wed, 2017-11-29 00:03

Last year we ran our first ever campaigns specifically to raise money to sponsor more diversity in LITA leadership through an additional ALA Emerging Leader, 11 participants in our first AvramCamp for female-identifying individuals, and 6 new attendees at the LITA Forum.

This year, we’re asking you to help continue this investment in future library IT leaders by contributing to #GiveLITA. Even better, this year any amount you donate will be automatically be matched so that you double your impact. Together, we can sponsor twice the number of scholarship recipients to support a continued focus on diversity in library technology positions.

Every LITA Board and staff member has donated, because we believe so strongly in this goal. Donate today to work with us invest in a more inclusive future for our profession.

DuraSpace News: Research and Cultural Heritage Ecosystems Better Supported by 4Science

planet code4lib - Wed, 2017-11-29 00:00

From Claudio Cortese, 4Science  4Science is pleased to announce that we have joined, as the first participant, the Certified DuraSpace Partner Program to deliver DuraCloud services. This partnership will facilitate delivery of DuraCloud digital preservation and content storage services in Europe, offering cloud-hosted services and application support in additional time zones and languages. At 4Science, we are very pleased to help DuraSpace offer this service outside the U.S.A.

DuraSpace News: 4Science Offering DuraCloud Services in Europe as Certified DuraSpace Partner

planet code4lib - Wed, 2017-11-29 00:00

DuraSpace is pleased to announce 4Science will begin offering DuraCloud services as a Certified DuraSpace Partner. This partnership will facilitate delivery of DuraCloud preservation and content storage services in Europe, complying with the European Commission General Data Protection Regulation (GDPR) and offering application support in additional time zones and languages.

LITA: Get Your Digital Life Decoded – a LITA webinar

planet code4lib - Tue, 2017-11-28 18:48

Sign up Now for

Digital Life Decoded: A user-centered approach to cyber-security and privacy
Instructors: Hannah Rainey, Libraries Fellow; Sonoe Nakasone, Lead Librarian for Metadata Technologies; and Will Cross, Director, Copyright & Digital Scholarship Center, all at North Carolina State University.
December 12, 2017, 1:00 pm – 2:30 pm Central time

        

The current technological and political landscapes have re-ignited conversations and concerns around digital security, privacy, and media literacy. Staff at NCSU developed a project branded as “Digital Life Decoded,” and grounded in substantial user research done in the spring of 2017 that identified three specific issues students were concerned about:

  • Hacking of personal information
  • Consent for use of information
  • Understanding how their information would be shared

Register here, courses are listed by date.

This 90 minute webinar will cover the process of development, including user research methods and project management. The bulk of the session will be spent walking through the 3 interactive activities from the pop-up programs developed. In addition to sharing methods and lessons learned, this webinar aims to heighten the conversation about our professional and personal roles in leading cyber-security and privacy.

View details and Register here.

Discover upcoming LITA webinars and web courses

Diversity, Inclusion, and Empowerment in Library Makerspaces
Offered: December 6, 2017

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty, mbeatty@ala.org

David Rosenthal: Intel's "Management Engine"

planet code4lib - Tue, 2017-11-28 16:00
Back in May Erica Portnoy and Peter Eckersley, writing for the EFF's Deep Links blog, summed up the situation in a paragraph:
Since 2008, most of Intel’s chipsets have contained a tiny homunculus computer called the “Management Engine” (ME). The ME is a largely undocumented master controller for your CPU: it works with system firmware during boot and has direct access to system memory, the screen, keyboard, and network. All of the code inside the ME is secret, signed, and tightly controlled by Intel. ... there is presently no way to disable or limit the Management Engine in general. Intel urgently needs to provide one.Recent events have pulled back the curtain somewhat and revealed that things are worse than we knew in May. Below the fold, some details.

Concern about the ME goes back further. Sparked by a talk given at the Chaos Computer Conference by [Joanna Rutkowska] of the Qubes OS project, back in January 2016 Brian Benchoff at Hackaday wrote:
Extremely little is known about the ME, except for some of its capabilities. The ME has complete access to all of a computer’s memory, its network connections, and every peripheral connected to a computer. It runs when the computer is hibernating, and can intercept TCP/IP traffic. Own the ME and you own the computer.

There are no known vulnerabilities in the ME to exploit right now: we’re all locked out of the ME. But that is security through obscurity. Once the ME falls, everything with an Intel chip will fall. It is, by far, the scariest security threat today, and it’s one that’s made even worse by our own ignorance of how the ME works.The EFF's post was a reaction to the discovery of a vulnerability in one of the modules that run on the ME, Intel's Active Management Technology (AMT) admin tool. Chris Williams at The Register explains:
Intel provides a remote management toolkit called AMT for its business and enterprise-friendly processors; this software is part of Chipzilla's vPro suite and runs at the firmware level, below and out of sight of Windows, Linux, or whatever operating system you're using. The code runs on Intel's Management Engine, a tiny secret computer within your computer that has full control of the hardware and talks directly to the network port, allowing a device to be remotely controlled regardless of whatever OS and applications are running, or not, above it.

Thus, AMT is designed to allow IT admins to remotely log into the guts of computers so they can reboot a knackered machine, repair and tweak the operating system, install a new OS, access a virtual serial console, or gain full-blown remote desktop access via VNC. It is, essentially, god mode.

Normally, AMT is password protected. This week it emerged this authentication can be bypassed, potentially allowing miscreants to take over systems from afar or once inside a corporate network. This critical security bug was designated CVE-2017-5689. While Intel has patched its code, people have to pester their hardware suppliers for the necessary updates before they can be installed.The vulnerability was embarrassing:
AMT is accessed over the network via a bog-standard web interface: the service listens on ports 16992 and 16993. Visiting this with a browser brings up a prompt for a password, and this passphrase is sent using standard HTTP Digest authentication: the username and password are hashed using a nonce from the AMT firmware plus a few other bits of metadata. This scrambled response is checked by the AMT software to be valid, and if so, access is granted to the management interface.

But if you send an empty response, the firmware is fooled into thinking this is correct and lets you through. Intel patched it, but it took a while for the patch to filter through to the system vendors and to get installed on the millions of vulnerable CPUs in the field. Meanwhile, an incredible number of systems were vulnerable to being remotely pwned.

Then, in late September Richard Chirgwin at The Register reported that:
Positive Technologies researchers say the exploit “allows an attacker of the machine to run unsigned code in the Platform Controller Hub on any motherboard via Skylake+”.
...
For those whose vendors haven't pushed a firmware patch for AMT, in August Positive Technologies discovered how to switch off Management Engine.
...
The company's researchers Mark Ermolov and Maxim Goryachy discovered is that when Intel switched Management Engine to a modified Minix operating system, it introduced a vulnerability in an unspecified subsystem.

Because ME runs independently of the operating system, a victim's got no way to know they were compromised, and infection is “resistant” to an OS re-install and BIOS update, Ermolov and Goryachy say.More details emerged two weeks ago:
Positive has confirmed that recent revisions of Intel's Management Engine (IME) feature Joint Test Action Group (JTAG) debugging ports that can be reached over USB. JTAG grants you pretty low-level access to code running on a chip, and thus we can now delve into the firmware driving the Management Engine. ... There have been long-running fears IME is insecure, which is not great as it's built right into the chipset: it's a black box of exploitable bugs, as was confirmed in May when researchers noticed you could administer the Active Management Technology software suite running on the microcontroller with an empty credential string over a network. Positive discovered that:
since Skylake, Intel's Platform Controller Hub, which manages external interfaces and communications, has offered USB access to the engine's JTAG interfaces. The new capability is DCI, aka Direct Connect Interface.

Aside from any remote holes found in the engine's firmware code, any attack against IME needs physical access to a machine's USB ports which as we know is really difficult.Google's Ronald Minich reported that running on the ME was a well-known open source operating system, MINIX:
And it turns out that while Intel talked to MINIX's creator about using it, the tech giant never got around to saying it had put it into recent CPU chipsets it makes.

Which has the permissively licensed software's granddaddy, Professor Andrew S. Tanenbaum, just a bit miffed. As Tanenbaum wrote this week in an open letter to Intel CEO Brian Krzanich:
The only thing that would have been nice is that after the project had been finished and the chip deployed, that someone from Intel would have told me, just as a courtesy, that MINIX was now probably the most widely used operating system in the world on x86 computers. That certainly wasn't required in any way, but I think it would have been polite to give me a heads up, that's all.Google isn't happy about this:
What’s concerning Google is the complexity of the ME. ... The real focus, though, is what’s in it and the consequences. According the Minnich, that list includes web server capabilities, a file system, drivers for disk and USB access, and, possibly, some hardware DRM-related capabilities. ... An OS full of latent capabilities to access hardware is just giving those people more room to be creative. The possibilities of what could happen if attackers figure out how to load their own software onto the ME’s OS are endless. Minnich and his team (and a number of others) are interested in removing ME to limit potential attackers’ capabilities.And, as one should have expected, once Intel took a look at the problem they found it was much worse than initially reported:
Intel has issued a security alert that management firmware on a number of recent PC, server, and Internet-of-Things processor platforms are vulnerable to remote attack. Using the vulnerabilities, the most severe of which was uncovered by Mark Ermolov and Maxim Goryachy of Positive Technologies Research, remote attackers could launch commands on a host of Intel-based computers, including laptops and desktops shipped with Intel Core processors since 2015. They could gain access to privileged system information, and millions of computers could essentially be taken over as a result of the bug. Most of the vulnerabilities require physical access to the targeted device, but one allows remote attacks with administrative access.Google, and anyone running a data center, clearly needs an equivalent of the remote access capabilities AMT provides. For the rest of us, Purism Librem Laptops Completely Disable Intel’s Management Engine
“Disabling the Management Engine, long believed to be impossible, is now possible and available in all current Librem laptops, it is also available as a software update for previously shipped recent Librem laptops.” says Todd Weaver, Founder & CEO of Purism.
...
Disabling the Management Engine is no easy task, and it has taken security researchers years to find a way to properly and verifiably disable it. Purism, because it runs coreboot and maintains its own BIOS firmware update process has been able to release and ship coreboot that disables the Management Engine from running, directly halting the ME CPU without the ability of recovery. What does all this mean? It means physical security of "Intel inside" computers is really important, since they are all vulnerable to a really hard to detect version of the "Evil Maid Attack":
"Evil maid" attacks can be anything that is done to a machine via physical access while it is turned off, even though it's encrypted. The name comes from the idea that an attacker could infiltrate or pay off the cleaning staff wherever you're staying to compromise your laptop while you're out.Since effective physical security for laptops is impossible, this means that any network to which laptops can be connected has to assume that one of them may be infected at a level that cannot be detected by any software running on the CPU, and this infection may be a threat to other machines on the network.
Although I didn't know about the ME issues when I crowdfunded [ORWL], it is a good reason for doing so.

Eric Hellman: Inter Partes Review is Improving the Patent System

planet code4lib - Tue, 2017-11-28 15:35
Today (Monday, November 27), the Supreme Court is hearing a case, Oil States Energy Services, LLC v. Greene’s Energy Group, LLC, that seeks to end a newish  procedure called inter partes review (IPR). The arguments in Oil States will likely focus on arcane constitutional principles and crusty precedents from the Privy Council of England; go read the SCOTUSblog overview if that sort of thing interests you. Whatever the arguments, if the Court decides against IPR proceedings, it will be a big win for patent trolls, so it's worth understanding what these proceedings are and how they are changing the patent system. I've testified as an expert witness in some IPR proceedings, so I've had a front row seat for this battle for technology and innovation.

A bit of background: the inter partes review was introduced by the "America Invents Act" of 2011,  which was the first major update of the US patent system since the dawn of the internet. To understand how it works, you first have to understand some of the existing patent system's perverse incentives.

When an inventor brings an idea to a patent attorney, the attorney will draft a set of "claims" describing the invention. The claims are worded as broadly as possible, often using incomprehensable language. If the invention was a clever shelving system for color-coded magazines, the invention might be titled "System and apparatus for optical wavelength keyed information retrieval". This makes it difficult for the patent examiner to find "prior art" that would render the idea unpatentable. The broad language is designed to prevent a copycat from evading the core patent claims via trivial modifications.

The examination proceeds like this: The patent examiner typically rejects the broadest claims, citing some prior art. The inventor's attorney then narrows the patent claims to exclude prior art cited by the examiner, and the process repeats itself until the patent office runs out of objections. The inventor ends up with a patent, the attorney runs up the billable hours, and the examiner has whittled the patent down to something reasonable.

As technology has become more complicated and the number of patents has increased, this examination process breaks down. Patents with very broad claims slip through, often because the addition of the internet means that prior art was either un-patented or unrecognized because of obsolete terminology. These bad patents are bought up by "non-practicing entities" or "patent trolls" who extort royalty payments from companies unwilling or unable to challenge the patents. The old system for challenging patents didn't allow the challengers to participate in the reexamination. So the patent system needed a better way to correct the inevitable mistakes in patent issuance.

In an inter partes review, the challenger participates in the challenge. The first step in drafting a petition is proposing a "claim construction". For example. if the patent claims "an alphanumeric database key allowing the retrieval of information-package subject indications", the challenger might "construct" the claim as "a call number in a library catalog", and point out that call numbers in library catalogs predated the patent by several decades. The patent owner might respond that the patent was never meant to cover call numbers in library catalog. (Ironically,  in an infringement suit, the same patent owner might have pointed to the broad language of the claim asserting that of course the patent applies to call numbers in library catalogs!) The administrative judge would then have the option of accepting the challenger's construction and open the claim to invalidation, or accepting the patent owner's construction, and letting the patent stand (but with the patent owner having agreed to a narrow claim construction!)
Disposition of IPR Petitions in the first 5 years. From USPTO.
In the 5 years that IPR proceedings have been available, 1,153 patents have been completely invalidated and 287 others have had some claims cancelled. 331 patents that have been challenged have been found to be completely valid. (See this statistical summary.) This is a tiny percentage of patents; it's likely that only the worst patents have been challenged; in the same period, about one and a half million patents have been granted.

It was hoped that the IPR process would be more efficient and less costly than the old process; I don't know if this has been true but patent litigation is still very costly. At least in the cases I worked on had correct outcomes.

Some companies in the technology space have been using the IPR process to oppose the patent trolls. One notable effort has been Cloudflare's Project Jengo. Full disclosure: They sent me a T-shirt!


Update (November 28): Read Adam Liptak's news story about the argument at the New York Times
  • Apparently Justices Gorsuch and Roberts were worried about patent property being taken away by administrative proceedings. This seems odd to me, since in the case of bad patents, the initial grant of a patent amounts to a taking of property away from the public, including companies who rely on prior art to assure their right to use public property.
  • Some news stories are characterizing the IPR process as lopsided against patent owners. (Reuters: "In about 1,800 final decisions up to October, the agency’s patent board canceled all or part of a patent around 80 percent of the time.") Apparently the news media has difficulty with sampling bias - given the expense of an IPR filing, of course only the worst of the worst patents are being challenged; more than 99.9% of patents are untouched by challenges!


Terry Reese: MarcEdit 7 is Here!

planet code4lib - Tue, 2017-11-28 09:00

After 9 months of development, hundreds of thousands of lines of changed code, 3 months of beta testing over which time, tens of millions of records were processed using MarcEdit 7, the tool is finally ready. Will you occasionally run into issues…possibly – any time that this much code has changed, I’d say that there is a distinct possibility. But I believe (hope) that the program has been extensively vetted and is ready to move into production. So, what’s changed? A lot. Here’s a short list of the highlights:

  • Native Clustering – MarcEdit implements the Levenshtein Distance and Composite Coefficient matching equations to provide built-in clustering functionality. This will let you group fields and perform batch edits across like items. In many ways, it’s a lite-weight implementation of OpenRefine’s clustering functionality designed specifically for MARC data. Already, I’ve used this tool to provide clustering of data sets over 700,000 records. For performance sake, I believe 1 million to 1.5 million records could be processed with acceptable performance using this method.
  • Smart XML Profiling – A new XML/JSON profiler has been added to MarcEdit that removes the need to know XSLT, XQuery or any other Xlanguage. The tool uses an internal markup language that you create through a GUI based mapper that looks and functions like the Delimited Text Translator. The tool was designed to lower barriers and make data transformations more accessible to users.
  • Speaking of accessibility, I spent over 3 months researching fonts, sizes, and color options – leading to the development of a new UI engine. This enabled the creation of themes (and theme creator), identification of free fonts (and a way to download them directly and embed fonts for use directly in MarcEdit within the need of administrator rights), and a wide range of other accessibility and keyboard options.
  • New versions – MarcEdit is now available as 4 downloads. Two which require administrative access and two that can be installed by anyone. This should greatly simplify management of the application.
  • Tasks have been super charged. Tasks that in MarcEdit 6.x could take close to 8 hours now can process in under 10-20 minutes. New task functions have been added, tasks have been extended, and more functions can be added to tasks.
  • Link data tools have been expanded. From the new SPARQL tools, to the updated linked data platform, the resource has been updated to support better and faster linked data work. Coming in the near future will be direct support for HDT and linked data fragments.
  • A new installation wizard was implemented to make installation fun and easier. User follow Hazel, the setup agent, as she guides you through the setup process.
  • Languages – MarcEdit’s interface has been translated into 26+ languages
  • .NET Language update – this seems like a small thing, but it enabled many of the design changes
  • MarcEdit 7 *no* longer supports Windows XP
  • Consolidated and improved Z39.50/SRU Client
  • Enhanced COM support, with legacy COM namespaces preserved for backward compatibility
  • RDA Refinements
  • Improved Error Handling and expanded knowledge-base
  • The new Search box feature to help users find help

With these new updates, I’ve updated the MarcEdit Website and am in the process of bringing new documentation online. Presently, the biggest changes to the website can be seen on the downloads page. Rather than offering users four downloads, the webpage provides a guided user experience. Go to the downloads page, and you will find:

If you want to download the 64-bit version, when the user clicks on the link, the following modal window is presented:

Hopefully this will help users, because I think that for the lion’s share of MarcEdit’s user community, the non-Administrator download is the version that most users should use. This version simplifies program management, sandboxes the application, and can be managed by any user. But the goal of this new downloads page is to make the process of selecting your version of MarcEdit easier to understand and empower users to make the best decision for their needs.

Additionally, as part of the update process, I needed to update the MarcEdit MSI Cleaner. This file was updated to support MarcEdit 7’s GUID keys created on installation. And finally, the program was developed so that it could be installed and used side by side with MarcEdit 6.x. The hope is that users will be able to move to MarcEdit 7 as their schedules allow, while still keeping MarcEdit 6.x until they are comfortable with the process and able to uninstall the application.

Lastly, this update is seeing the largest single creation of new documentation in the application’s history. This will start showing up throughout the week and I continue to wrap up documentation and add new information about the program. This update has been a long-time coming, and I will be posting a number of tid-bits throughout the week as I complete updating the documentation. My hope is that the wait will have been worth it, and that users will find the new version, it’s new features, and the improved performance useful within their workflows.

The new version of MarcEdit can be downloaded from: http://marcedit.reeset.net/downloads

As always, if you have questions, let me know.

–tr

Terry Reese: MarcEdit MSI Cleaner Changes for MarcEdit 7

planet code4lib - Tue, 2017-11-28 05:35

The MarcEdit MSI cleaner was created to help fix problems that would occasionally happen when using the Windows Installer. Sometimes, problems happen, and when they do, it becomes impossible to install or update MarcEdit. MarcEdit 7, from a programming perspective, is much easier to manage (I’ve removed all data from the GAC (global assembly cache) and limited data outside of the user data space), but things could occur that might cause the program to be unable to be updated. When that happens, this tool can be used to remove the registry keys that are preventing the program from updating/reinstalling.

In working on the update for this tool, there were a couple significant changes made:

  1. I removed the requirement that you had to be an administrator in order to run the tool. You will need to be an administrator to make changes, but I’ve enabled the tool so users can now run the application to see if the cleaner would likely solve their problem.
  2. Updated UI – I updated the UI so that you will know that this tool has been changed to support MarcEdit 7.
  3. I’ve signed the application…it has now been signed with a security certificate and now is identified as a trusted program.

I’ve posted information about the update here: https://youtu.be/HLnG8bczypQ.

If you have questions, let me know.

–tr

Open Knowledge Foundation: Paradise Lost: a data-driven report into who should be on the EU’s tax haven blacklist

planet code4lib - Mon, 2017-11-27 23:15

Open Knowledge International coordinates the Open Data for Tax Justice project with the Tax Justice Network, working to create a global network of people and organisations using open data to improve advocacy, journalism and public policy around tax justice. Today, in partnership with the Tax Justice Network, we are publishing Paradise lost, a data-driven investigation looking at who should feature on the forthcoming European Union blacklist of non-cooperative tax jurisdictions.

This blogpost has been reposted from the Tax Justice Network site.

UK financial services could face heavy penalties after Brexit unless the UK turns back on its aggressive tax haven policies, a new study from the Tax Justice Network has found.

The study looks at the new proposals for a EU blacklist of tax havens. The EU has said that it will impose sanctions on countries which find themselves on the list.

The EU is due to publish its list of tax havens on 5 December, but analysis by the Tax Justice Network, using the EU’s own criteria, suggests that the UK and 5 other EU countries would find themselves on that list if the criteria were applied equally to them.

The EU has stated that the list will only apply to non-EU member states and with the UK set to leave the Union in 2019, this could mean sanctions being applied to the UK unless it changes course and ends policies which make the country a tax haven. This includes the UK’s controlled foreign companies rules which allow UK based multinationals to hoard foreign profits in zero-tax countries. The UK’s CFC rules are currently the subject of a ‘state aid’ investigation by the European Commission.

The study also finds that up to 60 jurisdictions could be listed under the EU’s blacklist procedure, if the criteria are strictly applied. But there are fears that intense lobbying by the tax avoidance industry and some countries could see the list severely watered down. In addition to the UK, researchers have identified Ireland, The Netherlands, Cyprus, Luxembourg, Malta as countries that should be included in the blacklist if the EU were to apply its own criteria to member states.

Alex Cobham, chief executive of the Tax Justice Network, said: “The EU list is a step forward from previous efforts, in that the criteria are at least partially transparent and so partially objectively verifiable. It’s clear that the UK is at real risk of counter-measures from the EU. The government should forget any Brexit tax haven threat, which could easily open up EU counter-measures like the loss of financial passporting rights, and instead address the flaws in current policies.

“But it’s also true that the EU list lacks legitimacy as well as transparency, because it will not apply to its own members. For global progress on a truly level playing field, we need the G77 and G20 to engage in negotiating an international convention that would set an ambitious and fair set of minimum standards, with powerful sanctions, to which all jurisdictions would be held equally to account.”

Wouter Lips of Ghent University, who led the study, said: “The EU blacklist represents an important step forward from empty OECD lists – but the criteria need to be completely transparent to be objectively verifiable by independent researchers. Otherwise the risk remains that political power will be a factor in whether countries like the UK or indeed the USA are listed. Many of the 60 countries that are listed are lower-income countries which had no say in the OECD standards upon which much of the EU’s criteria depend, and that raises further questions of fairness.”

Please email contact@datafortaxjustice.net if you’d like to be added to the project mailing list or want to join the Open Data for Tax Justice network. You can also follow the #OD4TJ hashtag on Twitter for updates.

Lucidworks: Caching, and Filters, and Post-Filters, Oh My!

planet code4lib - Mon, 2017-11-27 17:10

A while back, I joined the #solr IRC channel in the middle of conversation about Solr’s queryResultCache & filterCache. The first message I saw was…

< victori:#solr> anyway, are filter queries applied independently on the full dataset or one after another on a shrinking resultset?

As with many things in life, the answer is “It Depends”

In particular, the answer to this question largely depends on:

… but further nuances come into play depending on:

  • The effective cost param specified on each fq (defaults to ‘0’ for most queries)
  • The type of the underlying Query object created for each fq: Do any implement the PostFilter API?

As I explained some of these nuances on IRC (and what they change about the behavior) I realized 2 things:

  • This would make a really great blog post!
  • I wish there was a way to demonstrate how all this happens, rather then just describe it.

That led me to wonder if it would be possible to create a “Tracing Wrapper Query Parser” people could use to get get log messages showing when exactly a given Query (or more specifically the “Scorer” for a that Query) was asked to evaluated each document. With something like this, people could experiment (on small datasets) with different q & fq params and different cache and cost local params and see how the execution changes. I made a brief attempt at building this kind of QParser wrapper, and quickly got bogged down in lots of headaches with how complex the general purpose Query, Weight, and Scorer APIs can be at the lower level.

On the other hand: the ValueSource (aka Function) API is much simpler, and easily facilitates composing functions around other functions. Solr also already makes it easy to use any ValueSource as a Query via the {!frange} QParser — which just so happens to also support the PostFilter API!

A few hours later, the “TraceValueSource” and trace() function syntax were born, and now I can use it to walk you through the various nuances of how Solr executes different Queries & Filters.

IMPORTANT NOTE:

In this article, we’re going to assume that the underlying logic Lucene uses to execute a simple Query is essentially: Loop over all docIds in the index (starting at 0) testing each one against the Query, if a document matches record it’s score and continue with the next docId in the index.

Likewise we’re going to assume that when Lucene is computing the Conjunction (X ^ Y) of two Queries, the logic is essentially:

  • Loop over all docIds in the index (starting at 0) testing each one against X until we find a matching document
  • if that document also matches Y then record it’s score and continue with the next docId
  • If the document does not match Y, swap X & Y, and start the process over with the next docId

These are both extreme over simplifications of how most queries are actually executed — many Term & Points based queries are much more optimized to “skip ahead” in the list of documents based on the term/points metadata — but it is a “close enough” approximation to what happens when all Queries are ValueSource based for our purposes today.

{!frange} Queries and the trace() Function

Let’s start with a really simple warm up to introduce you to the {!frange} QParser and the trace() function I added, beginning with some trivial sample data…

$ bin/solr -e schemaless -noprompt ... curl -H 'Content-Type: application/json' 'http://localhost:8983/solr/gettingstarted/update?commit=true' --data-binary ' [{"id": "A", "foo_i": 42, "bar_i": 99}, {"id": "B", "foo_i": -42, "bar_i": 75}, {"id": "C", "foo_i": -7, "bar_i": 1000}, {"id": "D", "foo_i": 7, "bar_i": 50}]' ... tail -f example/schemaless/logs/solr.log ...

For most of this blog I’ll be executing queries against these 4 documents, while showing you:

  • The full request URL
  • Key url-decoded request params in the request for easier reading
  • All log messages written to solr.log as a result of the request

The {!frange} parser allows user to specify an arbitrary function (aka: ValueSource) that will be wrapped up into a query that will match documents if and only if the results of that function fall in a specified range. For example: With the 4 sample documents we’ve indexed above, the query below does not match document ‘A’ or ‘C’ because the sum of the foo_i + bar_i fields (42 + 100 = 142 and -7 + 1000 = 993 respectively) does not fall in between the lower & upper range limits of the query (0 <= sum(foo_i,bar_i) <= 100) …

http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}sum%28foo_i,bar_i%29 // q = {!frange l=0 u=100}sum(foo_i,bar_i) { "response":{"numFound":2,"start":0,"docs":[ { "id":"B"}, { "id":"D"}] }} INFO - 2017-11-14 20:27:06.897; [ x:gettingstarted] org.apache.solr.core.SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}sum(foo_i,bar_i)&omitHeader=true&fl=id} hits=2 status=0 QTime=29

Under the covers, the Scorer for the FunctionRangeQuery produced by this parser loops over each document in the index and asks the ValueSource if it “exists” for that document (ie: do the underlying fields exist) and if so then it asks for the computed value for that document.

Generally speaking, the trace() function we’re going to use, implements the ValueSource API in such a way that any time it’s asked for the “value” of a document, it delegates to another ValueSource, and logs a message about the input (document id) and the result — along with a configurable label.

If we change the function used in our previous query to be trace(simple_sum,sum(foo_i,bar_i)) and re-execute it, we can see the individual methods called on the “sum” ValueSource in this process (along with the internal id + uniqueKey of the document, and the “simple_sum” label we’ve chosen) and the result of the wrapped function …

http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) TraceValueSource$TracerValues; simple_sum: exists(#0: "A") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#0: "A") -> 141.0 TraceValueSource$TracerValues; simple_sum: exists(#1: "B") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#1: "B") -> 33.0 TraceValueSource$TracerValues; simple_sum: exists(#2: "C") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#2: "C") -> 993.0 TraceValueSource$TracerValues; simple_sum: exists(#3: "D") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#3: "D") -> 57.0 SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id} hits=2 status=0 QTime=6

Because we’re using the _default Solr configs, this query has now been cached in the queryResultCache. If we re-execute it no new “tracing” information will be logged, because Solr doesn’t need to evaluate the ValueSource against each of the documents in the index in order to respond to the request…

http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id} hits=2 status=0 QTime=0 Normal fq Processing

Now let’s use multiple {!frange} & trace() combinations to look at what happens when we have some filter queries in our request…

http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29&fq={!frange%20l=0}trace%28pos_foo,foo_i%29&fq={!frange%20u=90}trace%28low_bar,bar_i%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) // fq = {!frange l=0}trace(pos_foo,foo_i) // fq = {!frange u=90}trace(low_bar,bar_i) TraceValueSource$TracerValues; pos_foo: exists(#0: "A") -> true TraceValueSource$TracerValues; pos_foo: floatVal(#0: "A") -> 42.0 TraceValueSource$TracerValues; pos_foo: exists(#1: "B") -> true TraceValueSource$TracerValues; pos_foo: floatVal(#1: "B") -> -42.0 TraceValueSource$TracerValues; pos_foo: exists(#2: "C") -> true TraceValueSource$TracerValues; pos_foo: floatVal(#2: "C") -> -7.0 TraceValueSource$TracerValues; pos_foo: exists(#3: "D") -> true TraceValueSource$TracerValues; pos_foo: floatVal(#3: "D") -> 7.0 TraceValueSource$TracerValues; low_bar: exists(#0: "A") -> true TraceValueSource$TracerValues; low_bar: floatVal(#0: "A") -> 99.0 TraceValueSource$TracerValues; low_bar: exists(#1: "B") -> true TraceValueSource$TracerValues; low_bar: floatVal(#1: "B") -> 75.0 TraceValueSource$TracerValues; low_bar: exists(#2: "C") -> true TraceValueSource$TracerValues; low_bar: floatVal(#2: "C") -> 1000.0 TraceValueSource$TracerValues; low_bar: exists(#3: "D") -> true TraceValueSource$TracerValues; low_bar: floatVal(#3: "D") -> 50.0 TraceValueSource$TracerValues; simple_sum: exists(#3: "D") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#3: "D") -> 57.0 SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id&fq={!frange+l%3D0}trace(pos_foo,foo_i)&fq={!frange+u%3D90}trace(low_bar,bar_i)} hits=1 status=0 QTime=23

There’s a lot of information here to consider, so let’s break it down and discuss in the order of the log messages…

  • In order to cache the individual fq queries for maximum possible re-use, Solr executes each fq query independently against the entire index:
    • First the “pos_foo” function is run against all 4 documents to identify if 0 <= foo_i
      • this resulting DocSet is put into the filterCache for this fq
    • then the “low_bar” function is run against all 4 documents to see if bar_i <= 90
      • this resulting DocSet is put into the filterCache for this fq
  • Now the main query (simple_sum) is now ready to be run:
    • Instead of executing the main query against all documents in the index, it only needs to be run against the intersection of the DocSets from each of the individual (cached) filters
    • Since document ‘A’ did not match the “low_bar” fq, the “simple_sum” function is never asked to evaluated it as a possible match for the overall request
    • Likewise: since ‘B’ did not match the “pos_foo” fq, it is also never considered.
    • Likewise: since ‘C’ did not match the “low_bar” fq, it is also never considered.
    • Only document “D” matched both fq filters, so it is checked against the main query — and it is a match, so we have hits=1

In future requests, even if the main q param changes and may potentially match a different set of values/documents, the cached filter queries can still be re-used to limit the set of documents the main query has to check — as we can see in this next request using the same fq params…

http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20u=999}trace%28max_foo,foo_i%29&fq={!frange%20l=0}trace%28pos_foo,foo_i%29&fq={!frange%20u=90}trace%28low_bar,bar_i%29 // q = {!frange u=999}trace(max_foo,foo_i) // fq = {!frange l=0}trace(pos_foo,foo_i) // fq = {!frange u=90}trace(low_bar,bar_i) TraceValueSource$TracerValues; max_foo: exists(#3: "D") -> true TraceValueSource$TracerValues; max_foo: floatVal(#3: "D") -> 7.0 SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+u%3D999}trace(max_foo,foo_i)&omitHeader=true&fl=id&fq={!frange+l%3D0}trace(pos_foo,foo_i)&fq={!frange+u%3D90}trace(low_bar,bar_i)} hits=1 status=0 QTime=1 Non-cached Filters

Now let’s consider what happens if we add 2 optional local params to our filter queries:

  • cache=false – Tells Solr that we don’t need/want this filter to be cached independently for re-use.
    • This will allow Solr to evaluate these filters at the same time it’s processing the main query
  • cost=X – Specifies an integer “hint” to Solr regarding how expensive it is to execute this filter.
    • Solr provides special treatment to some types of filters when 100 <= cost (more on this later)
    • By default Solr assumes most filters have a default of cost=0 (but beginning with Solr 7.2, {!frange} queries default to cost=100)
    • For this examples, we’ll explicitly specify a cost on each fq such that: 0 < cost < 100.
http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29&fq={!frange%20cache=false%20cost=50%20l=0}trace%28pos_foo_nocache_50,foo_i%29&fq={!frange%20cache=false%20cost=25%20u=100}trace%28low_bar_nocache_25,bar_i%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) // fq = {!frange cache=false cost=50 l=0}trace(pos_foo_nocache_50,foo_i) // fq = {!frange cache=false cost=25 u=100}trace(low_bar_nocache_25,bar_i) TraceValueSource$TracerValues; low_bar_nocache_25: exists(#0: "A") -> true TraceValueSource$TracerValues; low_bar_nocache_25: floatVal(#0: "A") -> 99.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#0: "A") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#0: "A") -> 42.0 TraceValueSource$TracerValues; simple_sum: exists(#0: "A") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#0: "A") -> 141.0 TraceValueSource$TracerValues; low_bar_nocache_25: exists(#1: "B") -> true TraceValueSource$TracerValues; low_bar_nocache_25: floatVal(#1: "B") -> 75.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#1: "B") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#1: "B") -> -42.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#2: "C") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#2: "C") -> -7.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#3: "D") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#3: "D") -> 7.0 TraceValueSource$TracerValues; low_bar_nocache_25: exists(#3: "D") -> true TraceValueSource$TracerValues; low_bar_nocache_25: floatVal(#3: "D") -> 50.0 TraceValueSource$TracerValues; simple_sum: exists(#3: "D") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#3: "D") -> 57.0 SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id&fq={!frange+cache%3Dfalse+cost%3D50+l%3D0}trace(pos_foo_nocache_50,foo_i)&fq={!frange+cache%3Dfalse+cost%3D25+u%3D100}trace(low_bar_nocache_25,bar_i)} hits=1 status=0 QTime=8

Let’s again step through this in sequence and talk about what’s happening at each point:

  • Because the filters are not cached, Solr can combine them with the main q query and execute all three in one pass over the index
  • The filters are sorted according to their cost, and the lowest cost filter (low_bar_nocache_25) is asked to find the “first” document it matches:
    • Document “A” is a match for low_bar_nocache_25 (bar_i <= 100) so then the next filter is consulted…
    • Document “A” is also a match for pos_foo_nocache_50 (0 <= foo_i) so all filters match — the main query can be consulted…
    • Document “A” is not a match for the main query (simple_sum)
  • The filters are then asked to find their “next” match after “A”, beginning with the lowest cost filter: low_bar_nocache_25
    • Document “B” is a match for ‘low_bar_nocache_25’, so the next filter is consulted…
    • Document “B” is not a match for the ‘pos_foo_nocache_50’ filter, so that filter keeps checking until it finds it’s “next” match (after “B”)
    • Document “C” is not a match for the ‘pos_foo_nocache_50’ filter, so that filter keeps checking until it finds it’s “next” match (after “C”)
    • Document “D” is the “next” match for the ‘pos_foo_nocache_50’ filter, so the remaining filter(s) are consulted regarding that document…
    • Document “D” is also a match for the ‘low_bar_nocache_25’ filter, so all filters match — the main query can be consulted again.
    • Document “D” is a match for the main query (simple_sum), and we have our first (and only) hit for the request

There are two very important things to note here that may not be immediately obvious:

  1. Just because the individual fq params indicate cache=false does not mean that nothing about their results will be cached. The results of the main q in conjunction with the (non-cached) filters can still wind up in the queryResultCache, as you can see if the exact same query is re-executed… http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29&fq={!frange%20cache=false%20cost=50%20l=0}trace%28pos_foo_nocache_50,foo_i%29&fq={!frange%20cache=false%20cost=25%20u=100}trace%28low_bar_nocache_25,bar_i%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) // fq = {!frange cache=false cost=50 l=0}trace(pos_foo_nocache_50,foo_i) // fq = {!frange cache=false cost=25 u=100}trace(low_bar_nocache_25,bar_i) SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id&fq={!frange+cache%3Dfalse+cost%3D50+l%3D0}trace(pos_foo_nocache_50,foo_i)&fq={!frange+cache%3Dfalse+cost%3D25+u%3D100}trace(low_bar_nocache_25,bar_i)} hits=1 status=0 QTime=1

    …we don’t get any trace() messages, because the entire “q + fqs + sort + pagination” combination was in the queryResultCache.

    (NOTE: Just as using cache=false in the local params of the fq params prevent them from being put in the filterCache, specifying cache=false on the q param can also prevent an entry for this query being added to the queryResultCache if desired)

  2. The relative cost value of each filter does not dictate the order that they are evaluated against every document.
    • In the example above, the higher cost=50 specified on on the ‘pos_foo_nocache_50’ filter did not ensure it would be executed against fewer documents then the lower cost ‘low_bar_nocache_25’ filter
      • Document “C” was checked against (and ruled out by) the (higher cost) ‘pos_foo_nocache_50’ filter with out ever checking that document against the lower cost ‘low_bar_nocache_25’
    • The cost only indicates in what order each filter should be consulted to find it’s “next” matching document after each previously found match against the entire request
      • Relative cost values ensure that a higher cost filter will not be asked to find check the “next” match against any document that a lower cost filter has already definitively ruled out as a non-match.

    Compare the results above with the following example, where the same functions use new ‘cost’ values:

    http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29&fq={!frange%20cache=false%20cost=10%20l=0}trace%28pos_foo_nocache_10,foo_i%29&fq={!frange%20cache=false%20cost=80%20u=100}trace%28low_bar_nocache_80,bar_i%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) // fq = {!frange cache=false cost=10 l=0}trace(pos_foo_nocache_10,foo_i) // fq = {!frange cache=false cost=80 u=100}trace(low_bar_nocache_80,bar_i) TraceValueSource$TracerValues; pos_foo_nocache_10: exists(#0: "A") -> true TraceValueSource$TracerValues; pos_foo_nocache_10: floatVal(#0: "A") -> 42.0 TraceValueSource$TracerValues; low_bar_nocache_80: exists(#0: "A") -> true TraceValueSource$TracerValues; low_bar_nocache_80: floatVal(#0: "A") -> 99.0 TraceValueSource$TracerValues; simple_sum: exists(#0: "A") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#0: "A") -> 141.0 TraceValueSource$TracerValues; pos_foo_nocache_10: exists(#1: "B") -> true TraceValueSource$TracerValues; pos_foo_nocache_10: floatVal(#1: "B") -> -42.0 TraceValueSource$TracerValues; pos_foo_nocache_10: exists(#2: "C") -> true TraceValueSource$TracerValues; pos_foo_nocache_10: floatVal(#2: "C") -> -7.0 TraceValueSource$TracerValues; pos_foo_nocache_10: exists(#3: "D") -> true TraceValueSource$TracerValues; pos_foo_nocache_10: floatVal(#3: "D") -> 7.0 TraceValueSource$TracerValues; low_bar_nocache_80: exists(#3: "D") -> true TraceValueSource$TracerValues; low_bar_nocache_80: floatVal(#3: "D") -> 50.0 TraceValueSource$TracerValues; simple_sum: exists(#3: "D") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#3: "D") -> 57.0 SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id&fq={!frange+cache%3Dfalse+cost%3D10+l%3D0}trace(pos_foo_nocache_10,foo_i)&fq={!frange+cache%3Dfalse+cost%3D80+u%3D100}trace(low_bar_nocache_80,bar_i)} hits=1 status=0 QTime=3

    The overall flow is fairly similar to the last example:

    • Because the filters are not cached, Solr can combine them with the main query and execute all three in one pass over the index
    • The filters are sorted according to their cost, and the lowest cost filter (pos_foo_nocache_10) is asked to find the “first” document it matches:
      • Document “A” is a match for pos_foo_nocache_10 (0 <= foo) — so the next filter is consulted…
      • Document “A” is a match for low_bar_nocache_80 (bar <= 100) — so all filters match, and so the main query can be consulted…
      • Document “A” is not a match for the main query (simple_sum)
    • The filters are then asked to find their “next” match after “A”, beginning with the lowest cost filter: (pos_foo_nocache_10)
      • Document “B” is not a match for the ‘pos_foo_nocache_10’ filter, so that filter keeps checking until it finds it’s “next” match (after “B”)
      • Document “C” is not a match for the ‘pos_foo_nocache_10’ filter, so that filter keeps checking until it finds it’s “next” match (after “C”)
      • Document “D” is the “next” match for the ‘pos_foo_nocache_10’ filter, so the remaining filter(s) are consulted regarding that document…
      • Document “D” is also a match for the ‘low_bar_nocache_80’ filter, so all filters match — the main query can be consulted again.
      • Document “D” is a match for the main query, and we have our first (and only) hit for the request

The key thing to note in these examples, is that even though we’ve given Solr a “hint” at the relative cost of these filters, the underlying Scoring APIs in Lucene depend on being able to ask each Query to find the “next match after doc#X”. Once a “low cost” filter has been asked to do this, the document it identifies will be used as the input when asking a “higher cost” filter to find it’s “next match”, and if the higher cost filter matches very few documents, it may have to “scan over” more total documents in the segment then the lower cost filter.

Post Filtering

There are a small handful of Queries available in Solr (notably {!frange} and {!collapse}) which — in addition to supporting the normal Lucene iterative scoring APIs — also implement a special “PostFilter” API.

When a Solr request includes a filter that is cache=false and has a cost >= 100 Solr will check if the underlying Query implementation supports the PostFilter API; If it does, Solr will automatically use this API, ensuring that these post filters will only be consulted about a potential matching document after:

  • It has already been confirmed to be a match for all regular (non-post) fq filters
  • It has already been confirmed to be a match for the main q Query
  • It has already been confirmed to be a match for any lower cost post-filters

(This overall user experience (and special treatment of cost >= 100, rather then any sort of special postFilter=true syntax) is focused on letting users indicate how “expensive” they expect the various filters to be, while letting Solr worry about the best way to handle those various expensive filters depending on how they are implemented internally with out the user being required to know in advance “Does this query support post filtering?”)

For Advanced Solr users who want to write custom filtering plugins (particularly security related filtering that may need to consult external data sources or enforce complex rules) the PostFilter API can be a great way to ensure that expensive operations are only executed if absolutely necessary.

Let’s reconsider our earlier example of non-cached filter queries, but this time we’ll use cost=200 on the bar < 100 filter condition so it will be used as a post filter…

http://localhost:8983/solr/gettingstarted/select?omitHeader=true&fl=id&q={!frange%20l=0%20u=100}trace%28simple_sum,sum%28foo_i,bar_i%29%29&fq={!frange%20cache=false%20cost=50%20l=0}trace%28pos_foo_nocache_50,foo_i%29&fq={!frange%20cache=false%20cost=200%20u=100}trace%28low_bar_postfq_200,bar_i%29 // q = {!frange l=0 u=100}trace(simple_sum,sum(foo_i,bar_i)) // fq = {!frange cache=false cost=50 l=0}trace(pos_foo_nocache_50,foo_i) // fq = {!frange cache=false cost=200 u=100}trace(low_bar_postfq_200,bar_i) TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#0: "A") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#0: "A") -> 42.0 TraceValueSource$TracerValues; simple_sum: exists(#0: "A") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#0: "A") -> 141.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#1: "B") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#1: "B") -> -42.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#2: "C") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#2: "C") -> -7.0 TraceValueSource$TracerValues; pos_foo_nocache_50: exists(#3: "D") -> true TraceValueSource$TracerValues; pos_foo_nocache_50: floatVal(#3: "D") -> 7.0 TraceValueSource$TracerValues; simple_sum: exists(#3: "D") -> true TraceValueSource$TracerValues; simple_sum: floatVal(#3: "D") -> 57.0 TraceValueSource$TracerValues; low_bar_postfq_200: exists(#3: "D") -> true TraceValueSource$TracerValues; low_bar_postfq_200: floatVal(#3: "D") -> 50.0 SolrCore; [gettingstarted] webapp=/solr path=/select params={q={!frange+l%3D0+u%3D100}trace(simple_sum,sum(foo_i,bar_i))&omitHeader=true&fl=id&fq={!frange+cache%3Dfalse+cost%3D50+l%3D0}trace(pos_foo_nocache_50,foo_i)&fq={!frange+cache%3Dfalse+cost%3D200+u%3D100}trace(low_bar_postfq_200,bar_i)} hits=1 status=0 QTime=4

Here we see a much different execution flow from the previous examples:

  • The lone non-cached (non-post) filter (pos_foo_nocache_50) is initially consulted to find the “first” document it matches
    • Document “A” is a match for pos_foo_nocache_50 (0 <= foo) — so all “regular” filters match, and the main query can be consulted…
    • Document “A” is not a match for the main query (simple_sum) so we stop considering “A”
    • The post-filter (low_bar_postfq_200) is never consulted regarding “A”
  • The lone non-post filter is again asked to find it’s “next” match after “A”
    • Document “B” is not a match for the ‘pos_foo_nocache_50’ filter, so that filter keeps checking until it finds it’s “next” match (after “B”)
    • Document “C” is not a match for the ‘pos_foo_nocache_50’ filter, so that filter keeps checking until it finds it’s “next” match (after “C”)
    • Document “D” is the “next” match for the ‘pos_foo_nocache_50’ filter — since there are no other “regular” filters, the main query is consulted again
    • Document “D” is also a match for the main query
    • After all other conditions have been satisfied Document “D” is then checked against the post filter (low_bar_postfq_200) — since it matches we have our first (and only) hit for the request

In these examples, the functions we’ve used in our filters have been relatively simple, but if you wanted to filter on multiple complex math functions over many fields, you can see how specifying a “cost” relative to the complexity of the function could be advantageous to ensure that the “simpler” functions are checked first.

In Conclusion…

Hopefully these examples I’ve walked through are helpful for folks trying to wrap their heads around how/why filter queries behave in various sitautions, and specifically how {!frange} queries work so you can consider some of the trade offs of tweaking the cache and cost params of your various filters.

Even for me, with ~12 years of Solr experience, running through these examples made me realize I had a missconception about how/when FunctionRangeQuery could be optimized (ultimately leading to SOLR-11641 which should make {!frange cache=false ...} much faster by default in future Solr versions)

The post Caching, and Filters, and Post-Filters, Oh My! appeared first on Lucidworks.

ACRL TechConnect: Memory Labs and audiovisual digitization workflows with Lorena Ramírez-López

planet code4lib - Mon, 2017-11-27 16:00

Hello! I’m Ashley Blewer, and I’ve recently joined the ACRL TechConnect blogging team. For my first post, I wanted to interview Lorena Ramírez-López. Lorena is working (among other places) at the D.C. Public Library on their Memory Lab initiative, which we will discuss below. Although this upcoming project targets public libraries, Lorena has a history of dedication to providing open technical workflows and documentation to support any library’s mission to set up similar “digitization stations.”

Hi Lorena! Can you please introduce yourself?

Hi! I’m Lorena Ramírez-López. I am a born and raised New Yorker from Queens. I went to New York University for Cinema Studies and Spanish where I did an honors thesis on Paraguayan cinema in regards to sound theory. I continued my education at NYU and graduated from the Moving Image Archiving and Preservation program where I concentrated on video and digital preservation. I was one of the National Digital Stewardship Residents for the American Archive of Public Broadcasting. I did my residency at Howard University television station (WHUT) in Washington D.C from 2016 until this June 2017. Along with being the project manager for the Memory Lab Network, I do contracting work for the National Portrait Gallery on their time based media artworks, joined the Women who Code community, and teach Spanish at Fluent City!

 

Tell us a little bit about DCPL’s Memory Lab and your role in it.

The DC Public Library’s Memory Lab was a National Digital Stewardship Project back in 2014 through 2015. This was the baby of DCPL’s National Digital Stewardship Resident, Jaime Mears, back in the day. A lot of my knowledge on how it started comes from reading the original project proposal, which you can find that on the Library of Congress’s website as well as Jaime Mear’s final report on the Memory Lab is found on the DC Library website. But to summarize its origin story, the Memory Lab was created as a local response to the fact that communities are generating a lot of digital content while still keeping many of their physical materials like VHS, miniDVs, and photos but might not necessarily have the equipment or knowledge to preserve their content. It has been widely accepted in the archival and preservation fields that we have an approximate 15- to 20-year window of opportunity to digitally preserve legacy audio and video recordings on magnetic tape because of the rate of degradation and the obsolescence of playback equipment. The term “video at risk” might ring a bell to some people. There’s also photographs and film, particularly color slides and negatives and moving image film formats, that will also fade and degrade over time. People want to save their memories as well as share them on a digital platform.

There are well-established best practices for digital preservation in archival practice, but these guidelines and documentation are generally written for a professional audience. And while there are a various personal digital archiving resources for a public audience, they aren’t really easy to find on the web and a lot of these resources aren’t updated to reflect the changes in our technology, software, and habits.

That being the case, our communities risk massive loss of history and culture! And to quote Gabriela Redwine’s Digital Preservation Coalition report,  “personal digital archives are important not just because of their potential value to future scholars, but because they are important to the people who created them.”

So the Memory Lab was the library’s local response in the Washington D.C. area to bridge this gap of digital archiving knowledge and provide the tools and resources for library patrons to personally digitize their own personal content.

My role is maintaining this memory lab (digitization rack). When hardware gets worn down or breaks, I fix it. When software for our computers upgrade to newer systems, I update our workflows.

I am currently re-doing the website to reflect the new wiring I did and updating the instructions with more explanations and images. You can expect gifs!

 

You recently received funding from IMLS to create a Memory Lab Network. Can you tell us more about that?

Yes! The DC Public Library in partnership with the Public Library Association received a national leadership grant to expand the memory lab model.

During this project, the Memory Lab Network will partner with seven public libraries across the United States. Our partners will receive training, mentoring, and financial support to develop their own memory lab as well as programs for their library patrons and community to digitize and preserve their personal and family collections. A lot of focus is put on the digitization rack, mostly because it’s cool, but the memory lab model is not just creating a digitization rack. It’s also developing classes and online resources for the community to understand that digital preservation doesn’t just end with digitizing analog formats.

By creating these memory labs, these libraries will help bridge the digital preservation divide between the professional archival community and the public community. But first we have to train and help the libraries set up the memory lab, which is why we are providing travel grants to Washington, D.C. for an in-depth digital preservation bootcamp and training for these seven partners.

If anyone wants to read the proposal, the Institute of Museum and Library Sciences has it here.

 

What are the goals of the Memory Lab Network, and how do you see this making an impact on the overall library field (outside of just the selected libraries)?

One of the main goals is to see how well the memory lab model holds up. The memory lab was a local response to a need but it was meant to be replicated. This funding is our chance to see how we can adapt and improve the memory lab model for other public libraries and not just our own urban library in Washington D.C.

There are actually many institutions and organizations that have digitization stations and/or the knowledge and resources, but we just don’t realize who they are. Sometimes it feels like we keep reinventing the wheel with digital preservation. There are plenty of websites that had contemporary information on digital preservation and links to articles and other explanations at one time. Then those websites weren’t sustained and remained stagnant while housing a series of broken links and lost PDFs. We could (and should) be better of not just creating new resources, but updating the ones we have.

The reasons why some organization aren’t transparent or updating the information, or why we aren’t searching in certain areas, varies, but we should be better at documenting and sharing our information to our archival and public communities. This is why the other goal is to create a network to better communicate and share.

 

What advice do you have for librarians thinking of setting up their own digitization stations? How can someone learn more about aspects of audiovisual preservation on the job?

If you are thinking of setting up your own digitization station, announce that not only to your local community but also the larger archival community. Tell us about this amazing adventure you’re about to tackle. Let us know if you need help! Circulate and cite that article you thought was super helpful. Try to communicate not only your successes but also your problems and failures.

We need to be better at documenting and sharing what we’re doing, especially when dealing with how to handle and repair playback decks for magnetic media. Besides the fact that the companies just stopped supporting this equipment, a lot of this information on how to support and repair equipment could have been shared or passed down by really knowledge experts, but it wasn’t. Now we’re all holding our breath and pulling our hair out because this one dude who repairs U-matic tapes is thinking about retiring. This lack of information and communication shouldn’t be the case in our environment when we can email and call.

We tend to freak out about audiovisual preservation because we see how other professional institutions set up their workflows and the amount of equipment they have. The great advantage libraries have is that not only can they act practically with their resources but also they have the best type of feedback to learn from: library patrons. We’re creating these memory lab models for the general public so getting practical experience, feedback, and concerns are great ways to learn more on what aspects of audiovisual preservation really need to be fleshed out and researched.

And for fun, try creating and archiving your own audiovisual media! You technically already do with taking photos and videos on your phone. Getting to know your equipment and where your media goes is very helpful.

 

Thanks very much, Lorena!

For more information on how to set up a “digitization station” at your library, I recommend Dorothea Salo’s robust website detailing how to build an “audio/video or digital data rescue kit”, available here.

 

HangingTogether: Equity, Diversity, and Inclusion (EDI)

planet code4lib - Mon, 2017-11-27 14:00

My colleagues Rebecca Bryant and Merrilee Proffitt have summarized discussions at our 1 November 2017 regional OCLC Research Library Partnership meeting in Baltimore on evolving scholarly services and workflows and on moving forward with unique and distinctive collections. This summarizes our third discussion thread, on equity, diversity, and inclusion (EDI).

We conducted an “Equity, Diversity and Inclusion Survey of the OCLC Research Library Partnership” between 12 September and 13 October of this year.  We wanted to obtain a snapshot of the EDI efforts within the Partnership that could help identify specific follow-up activities which could provide assistance and better inform practice. We were pleased that 63 Partners in nine countries responded to the survey. A summary of the results served as the framework for discussions with Partners at our meeting in Baltimore.

Some highlights from the survey results:

  • 59% of the respondents had set up or plan to set up an EDI committee or working group.
  • 72% were using or planned to use EDI goals and principles to inform their collections’ workflows, practices, or services.
  • Of those who responded they were using or planned to use EDI goals, 79% were working with other institutions, organizations, or community groups on EDI to improve representation of marginalized groups in collections, practices, or services.
  • The top three areas that 80% or more of the respondents had already changed due to their institutions’ EDI goals and principles were:
    • Activities and events
    • Recruitment and retention
    • Outreach to marginalized communities
  • The top two areas where 70% or more of the respondents planned to change but haven’t yet were:
    • Search and discovery interfaces
    • Metadata descriptions in library catalogs

Biggest institutional challenges in Partners’ EDI efforts included building relationships with marginalized communities and creating a positive work climate, which, in turn, would help recruit and retain diverse staff.

OCLC Research staff will be looking at ways to follow-up on the suggestions from our Partnership meeting discussions, which included sharing a list of resources from the survey responses and researching the landscape of EDI efforts being conducted by other professional organizations to identify gaps so we can better leverage community activities. Meanwhile, feel free to post your own EDI efforts in the comments below!

The OCLC Research Library Partnership provides a unique transnational collaborative network of peers to address common issues as well as the opportunity to engage directly with OCLC Research.

OCLC Dev Network: Alexa Meets WorldCat

planet code4lib - Mon, 2017-11-27 14:00

The provocative statement “voice is new UI” came into vogue in 2016. I was skeptical at first, in the way many of us have become when topics trend like this. I thought to myself, “Isn’t a UI something on a screen?” However, as 2017 now comes to a close, it is abundantly clear that voice UIs aren’t just a fad. 

Pages

Subscribe to code4lib aggregator