As we countdown to the annual Lucene/Solr Revolution conference in Boston this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Brett Hoerner’s talk, “Solr at Scale for Time-Oriented Data”.
This talk will go over tricks used to index, prune, and search over large (>10 billion docs) collections of time-oriented data, and how to migrate collections when inevitable changes are required.
Brett Hoerner lives in Austin, TX and is an Infrastructure Engineer at Rocana where they are helping clients control their global-scale modern infrastructure using big data and machine learning techniques. He began using SolrCloud in 2012 to index the Twitter firehose at Spredfast, where the collection eventually grew to contain over 150 billion documents. He is primarily interested in the performance and operation of distributed systems at scale.
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana from Lucidworks
Join us at Lucene/Solr Revolution 2016, the biggest open source conference dedicated to Apache Lucene/Solr on October 11-14, 2016 in Boston, Massachusetts. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…
For libraries, proxying user requests is how we provide authenticated access--and some level of anonymized access--to almost all of our licensed resources. Proxying Google Scholar in the past would direct traffic through a campus IP address, which prompted Scholar to automatically include links to the licensed content that we had told it about. It seemed like a win-win situation: we would drive traffic en masse to Google Scholar, while anonymizing our user's individual queries, and enabling them swift access to our library's licensed content as well as all the open access content that Google knows about.
However, in the past few months things changed. Now when Google Scholar detects proxied access it tries to throw up a Recaptcha test--which would be an okay-ish speed bump, except it uses a key for a domain (google.ca of course) which doesn't match the proxied domain and thus dies with a JS exception, preventing any access. That doesn't help our users at all, and it hurts Google too because those users don't get to search and generate anonymized academic search data for them.
Folks on the EZProxy mailing list have tried a few different recipes to try to evade the Recaptcha but that seems doomed to failure.
If we don't proxy these requests, then every user would need to set their preferred library(via the Library Links setting) to include convenient access to all of our licensed content. But that setting can be hard to find, and relies on cookies, so behaviour can be inconsistent as they move from browser to browser (as happens in universities with computer labs and loaner laptops). And then the whole privacy thing is lost.
On the bright side, I think a link like https://scholar.google.ca/scholar_setprefs?instq=Laurentian+University+Get+full+text&inst=15149000113179683052 makes it a tiny bit easier to help users set their preferred library in the unproxied world. So we can include that in our documentation about Google Scholar and get our users a little closer to off-campus functionality.
But I really wish that Google would either fix their Recaptcha API key domain-based authentication so it could handle proxied requests, or recognize that the proxy is part of the same set of campus IP addresses that we've identified as having access to our licensed resources in Library Links and just turn off the Recaptcha altogether.
Open Knowledge Foundation: What skills do you need to become a data-driven storyteller? Join a week-long data journalism training #ddjcamp
European Youth Press is organising a week-long intensive training on data journalism funded by Erasmus+. It is aimed at young journalists, developers and human rights activists from 11 countries: Czech Republic, Germany, Belgium, Italy, Sweden, Armenia, Ukraine, Montenegro, Slovakia, Denmark or Latvia.
If you have always wanted to learn more about what it means to be a data-driven storyteller, then this is an opportunity not to miss! Our course was designed with wanna-be data journalists in mind and for people who have been following others’ work in this area but are looking to learn more about actually making a story themselves.
You will have classes and workshops along the data pipeline: where to get the data, what to do to make it ‘clean’, and how to find a story in the data. In parallel to the training, you will work in teams and produce a real story that will be published in the national media of one of the participating countries.
The general topic of all the stories produced has been chosen as migration/refugees. Data journalism has a reputation to be a more objective kind of journalism, opposed to ‘he said – she said’ narratives. However, there is still great potential to explore data-driven stories about migrants and the effects of migration around the world.
Praising the refugee hunters as national heroes; violence targeting international journalists and migrants; sentimental pleas with a short-time effect – those are few examples of media coverage of the refugee crisis. The backlash so far to these narratives has mostly been further distrust in the media. What are the ways out of it?
We want to produce more data-driven balanced stories on migrants. For this training, we are inviting prominent researchers and experts in the field of migration. They will help us with relevant datasets and knowledge. We will not fix the world, but we can make a little change together.
So, if you are between 18 and 30 years old and come from Czech Republic, Germany, Belgium, Italy, Sweden, Armenia, Ukraine, Montenegro, Slovakia, Denmark or Latvia, don’t wait – apply now (deadline is 11 Sept):
Geoff distinguishes between "DOI-like strings" and "fake DOIs", presenting three ways DOI-like strings have been (ab)used:
- As internal identifiers. Many publishing platforms use the DOI they're eventually going to register as their internal identifier for the article. Typically it appears in the URL at which it is eventually published. The problem is that: the unregistered DOI-like strings for unpublished (e.g. under review or rejected manuscripts) content ‘escape’ into the public as well. People attempting to use these DOI-like strings get understandably confused and angry when they don’t resolve or otherwise work as DOIs.Platforms should use internal IDs that can't be mistaken for external IDs, because they can't guarantee that the internal ones won't leak.
- As spider- or crawler-traps. This is the usage that Eric Hellman identified. Strings that look like DOIs but are not even intended to eventually be registered but which have bad effects when resolved: When a spider/bot trap includes a DOI-like string, then we have seen some particularly pernicious problems as they can trip-up legitimate tools and activities as well. For example, a bibliographic management browser plugin might automatically extract DOIs and retrieve metadata on pages visited by a researcher. If the plugin were to pick up one of these spider traps DOI-like strings, it might inadvertently trigger the researcher being blocked- or worse- the researcher’s entire university being blocked. In the past, this has even been a problem for Crossref itself. We periodically run tools to test DOI resolution and to ensure that our members are properly displaying DOIs, CrossMarks, and metadata as per their member obligations. We’ve occasionally been blocked when we ran across the spider traps as well. Sites using these kinds of crawler traps should expect a lot of annoyed customers whose legitimate operations caused them to be blocked.
- As proxy bait. These unregistered DOI-like strings can be fed to system such as Sci-Hub in an attempt to detect proxies. If they are generated afresh on each attempt, the attacker knows that Sci-Hub does not have the content. So it will try to fetch it using a proxy or other technique. The fetch request will be routed via the proxy to the publisher, who will recognize the DOI-like string, know where the proxy is located and can take action, such as blocking the institution: In theory this technique never exposes the DOI-like strings to the public and automated tools should not be able to stumble upon them. However, recently one of our members had some of these DOI-like strings “escape” into the public and at least one of them was indexed by Google. The problem was compounded because people clicking on these DOI-like strings sometimes ended having their university’s IP address banned from the member’s web site. ... We think this just underscores how hard it is to ensure DOI-like strings remain private and why we recommend our members not use them. As we see every day, designing computer systems that in the real world never leak information is way beyond the state of the art.
The following is what we have sometimes called a “fake DOI”
It is registered with Crossref, resolves to a fake article in a fake journal called The Journal of Psychoceramics (the study of Cracked Pots) run by a fictitious author (Josiah Carberry) who has a fake ORCID (http://orcid.org/0000-0002-1825-0097) but who is affiliated with a real university (Brown University).
Again, you can try it.
And you can even look up metadata for it.
Our dirty little secret is that this “fake DOI” was registered and is controlled by Crossref.These "starting with 5" DOIs are used by Crossref to test their systems. They too can confuse legitimate software, but the bad effects of the confusion are limited. And now that the secret is out, legitimate software can know to ignore them, and thus avoid the confusion.
Last updated September 1, 2016. Created by Peter Murray on September 1, 2016.
Log in to edit this page.
The Islandora Foundation is thrilled to announce the second Islandoracon, to be held at the lovely LIUNA Station in Hamilton, Ontario from May 15 - 19, 2017. The conference schedule will take place over five days, including a day of post-conference sessions and a full-day Hackfest.
New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.
New This Week
Visit the LITA Job Site for more available jobs and for information on submitting a job posting.
Open Knowledge Foundation: Freedom to control MyData: Access to personal data as a step towards solving wider social issues.
This piece is part of a series of posts from MyData 2016 – an international conference that focuses on human centric personal information management. The conference is co-hosted by the Open Knowledge Finland chapter of the Open Knowledge International Network.Song lyrics: Pharrell Williams “Freedom”; Image Pixabay CC0
Indeed, the theme of MyData so far is freedom. The freedom to own our data. Freedom however, is a very complicated subject that has been subjected to so many arguments, interpretations and even wars. I will avoid the more complicated philosophy and dive instead into a more daily life example. In the pop song quoted above, freedom can be understood as being carefree – “Who cares what they see and what they know?” Taking it to the MyData context, are we granting freedom to others to do whatever they want with our data and information because we trust them, or just because we don’t care?
MyData speakers have looked at the issue of freedom from a different angle. Taavi Kotka, Estonia CIO, claims that the fifth freedom of the EU should be the freedom of Data. People, explains Kotka, should have the choice of what can be done with their data. They should know and understand the possibilities that sharing the data can bring (for example, like better and easier services across the EU countries), and the threat that this can entail, like misuses of their data. For that we need pioneer regulators. For that we need the private sector and civil society to pressure and showcase what we can do with data and shift change accordingly.…thinking outside of the box can help governments to move forward and at the end of the day, to supply better services for citizens
This shifting in regulations and thinking should also be accepted by government. It was refreshing to hear the Finnish Minister of Transport and Communications, Anne Berner saying that government should not be afraid of disruption, but accept disruption and be disruptive themselves. MyData is disruptive in the sense that it is challenging the norms of the current data storage and use, and thinking outside of the box can help governments to move forward and at the end of the day, to supply better services for citizens.
Another topic that has been raised up repeatedly is the digital self and the idea that data is a stepping stone to a better society. The question is then, that in order to build a good society do we need to understand our private data? Maybe understanding data is not a good enough end goal? Maybe a better framing would be to create information and knowledge from the data? I was excited to see a project that can help consumers to evaluate and decide who to trust: Ranking Digital Rights. Ranking Digital Rights looks at big tech corporations and ranks their public commitments and disclosed policies affecting users’ freedom of expression and privacy. This is a very good tool for discussion and advocacy on these topics.Ranking Digital Rights looks at big tech corporations and ranks their public commitments and disclosed policies affecting users’ freedom of expression and privacy.
To return to the question of open. Does freedom of data mean open data? The closed system does not allow us to access our own data. We can’t get insights. How do we create different models to get there?
And I think this is where I enjoy this conference the most – the variety of people. In the last two years I have been in many open data conferences, but the business community side of these events has been very limited, or at least for me, not appealing. Here at MyData, there are tracks for many different stakeholders – from insurance firms to banks, from health to education. I have met people who see the MyData initiative not only as a moral thing to do, but also as an opportunity to innovate and create trust with users. Trust, as I am rediscovering, is key for growth. Ignoring the mistrust of users can lead to a broken market. More than trust, I was happy to see people who are trying to influence their companies not only to go the MyData way, but also to open relevant data from their companies to the public, so we can work on and maybe solve social issues. Seeing the two go hand-in-hand is great, and I am looking forward to more conversations like these.
Tomorrow, Rufus Pollock, our president and Open Knowledge International’s Co-Founder is going to speak about how we can collaborate with others for a better future. You can catch him at 9.30 Helsinki time on screen.io/mydata. Here is a preview for his talk tomorrow:http://blog.okfn.org/files/2016/09/MyDataRufus.mp3 We want openness for public data – public information that could be made available to anyone. And we want access for every person to their own personal data…both are about empowering people to access information.
Posted Sept. 1, this update resolves a couple issues. Particularly:
* Bug Fix: Custom Field Sorting: Fields without the sort field may drop the LDR. This has been corrected.
* Bug Fix: OCLC Integration: regression introduced with the engine changes when dealing with diacritics. This has been corrected.
* Bug Fix: MSI Installer: AUTOUPDATE switch wasn’t being respected. This has been corrected.
* Enhancement: MARCEngine: Tweaked the transformation code to provide better support for older processing statements.
* Bug Fix: Custom Field Sorting: Fields without the sort field may drop the LDR. This has been corrected.
* Bug Fix: MARCEngine: Regression introduced with the last update that caused one of the streaming functions to lose encoding information. This has been corrected.
* Enhancement: MARCEngine: Tweaked the transformation code to provide better support for older processing statements.
I’ll be adding a knowledge-base article, but I updated the windows MSI to fix the admin command-line added to allow administrators to turn off the auto-update feature. Here’s an example of how this works: >>MarcEdit_Setup64.msi /qn AUTOUPDATE=no
I don’t believe the AUTOUPDATE key is case sensitive – but the documented use pattern is upper-case and what I’ll test against going forward.
Downloads are available via the downloads page: http://marcedit.reeset.net/downloads/
When I hear the term “Evergreen,” it immediately invokes images of nature’s symbiotic relationships – Bald eagles nesting in coniferous trees, lady slipper orchids thriving in soil nutrients typically found beneath conifers and hemlocks, pollinators and mammals relying on evergreens for food and, in return, help to redistribute seeds. There is also a complex network of dialogues being exchanged throughout these evergreen forests.
During the past decade, I have been very blessed to hold multiple discussions with people about Evergreen, and it’s not surprising that the continued theme from my fellow coworkers’ blog posts is the emphasis on community. Community grants opportunities and a feeling of personal ownership (how awesome is it that non-proprietary software helps to promote a sense of ownership). Community also helps to foster symbiotic and sustainable relationships. Relationships that are rooted in dialog.
In February 2007, as a reference and genealogy librarian at a rural public library, I held my first conversations with both librarians and patrons about their Evergreen user experiences. Fast forwarding to August 2016, I still treasure every conversation that I have with librarians about their needs, expectations, and experiences. With each library migration, it is with honor and humbleness to hear about the librarians’ current workflows and needs. These user needs are constantly being met with each passing version of Evergreen.
For some, those needs may appear simple. I was so excited by the Update Expire Date button! Or, more complex, like the intricate gears that make meta-record level holds possible. One of the strongest examples of community dialog and symbiosis is the continued refinement of the Acquisitions module.
I couldn’t possible describe all of the awesomeness that I have observed over the past 10 years or single it down to a special moment; there’s just too much. Each patron, library staff member, consortia member, volunteer, contributor, developer, support, data analyst (did I forget anyone?) contributes to Evergreen’s complex web of communication and overall sustainability. I can say that I know how fortunate I am, as a Project Manager, to see the forest for the trees and to know that the Evergreen Community’s roots are growing stronger with each passing year.
This is the eleventh in our series of posts leading up to Evergreen’s Tenth birthday.
Yesterday, just one day before the anniversary of the 1.1.2 release, I published the 1.1.3 release of the PEAR File_MARC library. The only change is the addition of a convenience method for fields called getContents() that simply concatenates all of the subfields together in order, with an optional separator string. Many thanks to Carsten Klee for contributing the code!
You can install File_MARC through the usual channels: PEAR or composer. Have fun!
Today I was privileged to present to the 6th International Congress of Technological Innovation, Innovatics 2016, organized by Duoc UC Libraries, Library of Santiago, and University of Chile Libraries. The conference was simultaneously translated in English and Spanish. To aid the translators, I wrote out the text of my presentation for them to review. Below is the text as it was intended to be presented; I did diverge in a few places mostly based on what others said earlier in the conference.Evolution of Open Source in Libraries
Thank you for the opportunity to talk with you today. My name is Peter Murray, and I’ve been involved in open source projects in libraries for at least 20 years. After receiving my undergraduate degree in computer science, I went to work for my university library as they were bringing up their first library automation system. This was early in the days of information retrieval on the internet, and I adapted the Gopher program to offer a front end to the library’s resources. Gopher came from the University of Minnesota in the United States, and they released their code on the internet for free. There wasn’t a notion of organized open source at the time – it was just free software that anyone could download and adapt. There wasn’t a sense of community around the software, and the tools to share changes with each other were very rudimentary. Stretching back to the 1950s and 1960s, this was the era of “free software”.
During the mid-1990s I worked for Case Western Reserve University in Cleveland, Ohio. I was part of a team that saw the early possibilities of the World Wide Web and aggressively pursued them. We worked to reorganize the library services onto the web and to try to add personalization to the library’s website using a new programming language called Personal Home Page. We know it today as PHP. It was also at this time that the phrase “open source” was coined. “Open source” meant more than just the having the source code available. It also a recognition that organizations had a business case for structuring a community around the source code. In this case, it was release of the Netscape browser code that was the spark that ignited the use of the “open source” phrase.
In the early 2000s I worked for the University of Connecticut. During this time, we saw the formation of open source projects in libraries. Two projects that are still successful today are the DSpace and the Fedora repository projects. These two projects started with grants from foundations to create software that allowed academic libraries to efficiently and durably store the growing amount of digital files being produced by scholars. Both projects followed paths where the software was created for the needs of their parent organizations. It was seen as valuable by other organizations, and new developers were added as contributors to the project.
Also in the early 2000s the Koha integrated library system started to build its community. The core code was written by a small team of developers for a public library in New Zealand in the last few months of 1999 to solve a year-2000 issue with their existing integrated library system. Within a year, Koha publicly released its code, created a community of users on SourceForge – a website popular in the 2000s for hosting code, mailing lists, documentation, and bug reports. The tools for managing the activity of open source communities were just starting to be formed. There is a direct line between SourceForge – the most popular open source community of its time – and GitHub – arguably the most important source code hosting community today.
In the late 2000s I was working for a consortium of academic libraries in Ohio called OhioLINK. In this part of my career, I was with an organization that began actively using library open source software to deliver services to our users. Up until this point, I had – like many organizations – made use of open source tools – the HTTPd web server from Apache, MySQL as a database, PHP and Perl as programming languages, and so forth. Now we saw library-specific open source make headway into front-line library services. And now that our libraries were relying on this software for services to our patrons, we began looking for supporting organizations. DSpace and Fedora each created foundations to hold the source code intellectual property and hire staff to help guide the software. The DSpace and Fedora foundations then merged to become DuraSpace. Foundations are important because the become a focal point of governance around the software and a place where money could be sent to ensure the ongoing health of the open source project.
In the early 2010s I went to work for a larger library consortium called LYRASIS. LYRASIS has about 1,400 member libraries across much of the United States. I went to work at LYRASIS on a project funded by the Andrew W. Mellon Foundation on helping libraries make decisions about using open source software. The most visible part of the project was the FOSS4LIB.org website. It hosts decision support tools, case studies, and a repository of library-oriented open source software. We proposed the project to the Mellon Foundation because LYRASIS member libraries were asking questions about how they could make use of open source software themselves. Throughout the 2000s it was the libraries with developers that were creating and contributing to open source projects. Libraries without developers were using open source through service provider companies, and now they wanted to know more about what it meant to get involved in open source communities. FOSS4LIB.org was one place where library professionals could learn about open source.
The early 2010s also saw growth in service providers for open source software in libraries. The best example of that is this list of service providers for the Koha integrated library system. As we let this scroll through the continents, I hope this gives you a sense that large, well-supported, multinational projects are alive and well in the library field. Koha is the most impressive example of a large service provider community. Other communities, such as DSpace, also have worldwide support for the software. What is not represented here is the number of library consortia that have started supporting open source software for their members. Where it makes sense for libraries to pool their resources to support a shared installation of an open source system, those libraries can reap the benefits of open source.
Now here in the mid-2010s I’m working for a small, privately-held software development company called Index Data. Index Data got its start 20 years ago when its two founders left the National Library of Denmark to create software tools that they saw libraries needed. Index Data’s open source Z39.50 toolkit is widely used in commercial and open source library systems, as is its MasterKey metasearch framework. The project I’m working on now is called FOLIO, an acronym for the English phrase “Future of Libraries is Open”. I’ll be talking more about FOLIO this afternoon, but by way of introduction I want to say now that the FOLIO project is a community of library professionals and an open source project that is rethinking the role of the integrated library system in library services.Revisit the Theme
With that brief review of the evolution of open source software in libraries, let’s return to the topic of this talk – Free Software in Libraries: Success Stories and Their Impact on Today’s Libraries. As you might have guessed, open source software can have a significant impact on how services are delivered to our patrons. In fact, open source software – in its best form – is significantly different from buying software from a supplier. On the one hand, when you buy software from a supplier you are at the mercy of that supplier for implementing new features and for fixing bugs in the software. You also have an organization that you can hold accountable. On the other hand, open source software is as much about the community surrounding the software as it is the code itself. And to be a part of the community means that you have rights and responsibilities. I’d like to start first with rights.Rights
The rights that come along with open source are, I think, somewhat well understood. These are encoded in the open source licenses that projects adopt. You have the right to view, adapt, and redistribute the source code. You can use the software for any purpose you desire, even for purposes not originally intended by the author. And the one that comes most to mind, you have the right to run the software without making payment to someone else. Let’s look at these rights.Use
In the open source license, the creator of the software is giving you permission to use the software without needing to contact the author. This right cannot be revoked, so you have the assurance that the creator cannot suddenly interrupt your use, even if you decide to use the software for something the creator didn’t intend. This also means that you can bring together software from different sources and create a system that meets the needs of your users.Copy
You have the right to make a copy of the software. You can copy the software for your own use or to give to a friend or colleague. You can create backup copies and run the software in as many places as you need. Most importantly, you have this right without having to pay a royalty or fee to the creator. One example of this is the Fenway Libraries Online in the Boston Massachusetts area. Ten libraries within the Fenway consortium needed a system to manage their electronic resource licenses and entitlements. After an exhaustive search for a system that met their requirements, they selected the CORAL project originally built at the University of Notre Dame. There is a case study on Fenway’s adoption of CORAL on the FOSS4LIB.org website.Inspect
A key aspect of open source is the openness of the code itself. You can look at the source code and figure out how it works. This is especially crucial if you want to move away from the system to something else; you can figure out, for instance, how the data is stored in a database and write programs that will translate that data from one system to another. Have you ever needed to migrate from one system to another? Even if you didn’t have to do the data migration yourself, can you see where it would be helpful to have a view of the data structures?Modify
Hand-in-hand with the right to inspect open source code is the right to modify it to suit your needs. In almost all cases, the modifications to the code use the same open source license as the original work. What is interesting about modifications, though, is that sometimes the open source license may specify conditions for sharing modifications.Fork
Ultimately, if the open source project you are using is moving in a different direction, you have the right to take the source code and start in your own direction. Much like a fork in the road, users of the open source project decide which branch of the fork to take. Forks can sometimes remain relatively close together, which makes it somewhat easy to move back and forth between them. Over time, though, forks usually diverge and go separate ways, or one will turn out not to be the one chosen by many, and it will die off. With this right to fork the code, it is ultimately the community that decides the best direction for the software. There was an example a few years ago within the Koha community where one service provider wanted to control the process of creating and releasing the Koha software in ways that a broader cross-section of the community didn’t appreciate. That segment of the community took the code and re-formed new community structures around it.Responsibilities
These rights – Use, Copy, Inspect, Modify, and Fork – form the basis of the open source license statement that is a part of the package. Some form of each of these rights are spelled out in the statement. What is left unsaid is the responsibilities of users of the open source software. These responsibilities are not specified in the open source license that accompanies the software, but they do form the core values of the community that grows around the development and enhancing of the project. Each community is different, just like each software package has its own features and quirks, but they generally have some or all of the following characteristics. And depending on each adopter's needs and capacity, there will be varying levels of commitments each organization can make to these responsibilities. As you work with open source code, I encourage you to keep these responsibilities in mind.Participate
The first responsibility is to participate in the community. This can be as simple to joining web forums or mailing lists to get plugged into what the community is doing and how it goes about doing it. By joining and lurking for a while, you can get familiar with the community norms and find out what roles others are playing. The larger automation companies in libraries typically have users groups, and joining the community of an open source group is usually no different than that of a proprietary company. One of the key aspects of open source projects is how welcoming it is to new participants. The Evergreen community has what I think is a great web page for encouraging libraries to try out the software, read the documentation, and get involved in the community.Report Bugs
Library software has bugs – the systems we use are just too complicated to account for every possible variation as the software is developed and tested. If you find something that doesn't work, report it! Find the bug reporting process and create a good, comprehensive description of your problem. Chances are you are not the only one seeing the problem, and your report can help others triangulate the issue. This is an area where open source is, I think, distinctly different from proprietary software. With proprietary systems, the list of bugs is hidden from the customers. You may not know if you are the only one seeing a problem or if the issue is in common with many other libraries. In open source projects, the bug list is open, and you can see if you are one among many people seeing the issue. In most open source projects, bug reports also include a running description of how the issue is being addressed – so you can see when to anticipate a fix coming in the software. As an aside for open source projects in the audience: make it easy for new community members to find and use your bug reporting process. This is typically a low barrier of entry into the community, and a positive experience in reporting a bug and getting an answer will encourage that new community member to stick around and help more. We'll talk about triaging bug reports in a minute.Financially Support
Thinking back to our open source software rights, we know we can use the software without paying a license fee. That doesn't mean, though, that open source is free to write and maintain. Some of the most grown up software projects are backed by foundations or community homes, and these organizations hire developers and outreach staff, fund user group meetings, and do other things that grow the software and the community surrounding it. If your organization's operations are relying on a piece of open source, use some of your budget savings from paying for a vendor's proprietary system to contribute to the organization that is supporting the software. DuraSpace is a success story here. Since its founding, it has attracted memberships from libraries all around the world. Libraries don’t have to pay to use the DSpace or Fedora software. Those that do pay recognize that their membership dues are going to fund community and technical managers as well as server infrastructure that everyone counts on to keep the projects running smoothly.Help
As staff become more familiar with the open source system, they can share that expertise with others around them. This is not only personally rewarding, but it also improves the reputation of your organization. A healthy and growing open source community will have new adopters coming in all the time, and your experience can help someone else get started too. EIFL, an acronym for the English “Electronic Information for Libraries”, is a not-for-profit organization that works to enable access to knowledge in 47 countries in Africa, Asia, and Europe. One of their programs is to help libraries in developing countries adopt and use open source software for their own institutions. They gather groups of new users and match them with experienced users so they can all learn about a new open source software package at about the same pace. Through this mentoring program, these libraries now have capabilities that they previously didn’t have or couldn’t afford.Triage
A few slides earlier, I encouraged new adopters to report bugs. That can quickly overwhelm a bug tracking system with issues that are not really issues, issues that have been solved and the code is waiting for the next release, issues where there is a workaround in a frequently asked questions document, and issues that are real bugs where more detail is needed for the developers to solve the problem. As a community member triaging bugs, you look for new reports that match your experience and where you can add more detail or point the submitter to a known solution or workaround. Sometimes this points to a need for better documentation (discussed in the next slide); other times it needs a fix or an enhancement to the software, and the report moves on to the development group. Another note for projects: provide a clear way for reported issues to move through the system. This can be as informal as a shared understanding in the community, or as formal as a state diagram published as part of the project's documentation that describes how an issue is tagged and moved through various queues until it reaches some resolution.Documentation
Open source software is often criticized — rightly so — for poor documentation. It is often the last thing created as part of the development process, and is sometimes created by developers who feel more comfortable using a writing voice full of jargon rather than a voice that is clear, concise, and targeted to end-users. Contributing to documentation is a perfect place for expert users but inexperienced coders to make the deepest impact on a project. You don't need to understand how the code was written, you just need to describe clearly how the software is used. Documentation can come in the form of user manuals, frequently-asked-questions lists, and requests to the developers to add help language or change the display of feature to make its use clearer.Translate
Translation is also something that an experienced user can do to support a project. One sign of a mature open source project is that the developers have taken the time to extract the strings of field labels, help text, and user messages into a language file so that these strings can be translated into another language. Translating doesn't mean being able to write code; it just means being able to take these strings and convert them into another language. If you find an open source package that has all of the functionality you need but the native language of the system is not the native language of your users, providing translations can be a great way to make the software more useful to you while also opening up a whole new group of organizations that can use the project as well.Test
This is getting a little more complicated, but if you are running the software locally and can set up a test environment, as release candidates for new software come from the developers, try them out. Run a copy of your data and your workflows through the release candidate to make sure that nothing breaks and new features work as advertised. Some projects will create downloadable "virtual machines" or put the release candidate version of the software on a sandbox for everyone to test, and that lowers the barrier for testing to just about anyone.Request
How feature requests are made are another distinguishing characteristic between open source and proprietary systems. All proprietary systems have some way of registering feature requests and various processes for working through them. In an open source community, there is a lot more transparency about what is being worked on. All of the bug reports and feature requests are listed for everyone to see and comment on. There might even be a formal voting mechanism that guides developers on what to work on next. Volunteer developers from different organizations with similar needs can more easily find each other and tackle a problem together. Developers hired by the software's foundation or community home have a better understanding of what activity will have the biggest impact for the users. This all starts, though, with you making your requests and commenting on the requests of others.Challenge
Healthy projects need forward tension to keep moving, and one way to do that is with eyes-wide-open constructive criticism. It is easy and common for communities to get bogged down in doing things the same way when it might make sense to try a different technique or adopt a different tool. It is also, unfortunately, common for communities to become insular and to become unwelcoming to new people or people unlike themselves. Open source works best when a wide variety of people are all driving towards the same goal. Be aware, though that the good will within communities can be destroyed by unkind and insulting behavior. Just as meetings have adopted codes of conduct, I think it is appropriate for project communities to develop codes of conduct and to have action and enforcement mechanisms in place to step in when needed. This can be tough — in the cultural heritage field, participants in open source are typically volunteers, and it can be difficult to confront or offend a popular or prolific volunteer. The long-term health of the community requires it, though.Code
Only at the very last do we get to coding. The software can't exist without developers, but it is too easy to put the developers first and forget the other pillars that the community relies on — bug reporters and triage volunteers, documentation writers and translators, software testers and community helpers. If you are a developer, try fixing a bug. Pick something that is small but annoying — something that scratches your own itch but that the more mainstream developers don't have time to tackle. A heads-up to open source project leaders: create a mentorship pathway for new developers to join the project, and provide a mechanism to list "easy" bugs that would be useful for developers new to a project to work on.Public Libraries
Throughout the presentation I’ve mentioned academic libraries and library organizations that are making use of open source now. Open source adoption is not limited to academic libraries, though, and I wanted to mention the work of the Meadville Public Library in rural Pennsylvania of the United States. There is a case study on the FOSS4LIB.org website where they describe their library and how they came to choose the Koha integrated library system. The Meadville Public Library has a small staff and an even smaller technology budget. When they decided to migrate to a new system in the mid-2000s, they realized they had a choice to pay the commercial software licensing fees to a traditional library vendor or to put that money towards building skills in the staff to host a system locally. The case study describes their decision-making process, including site visits, fiscal analysis, and even joining a “hackfest” developer event in France to help build new functionality that their installation would need. I invite you to read through the case study to learn about their path to open source software. This library uses open source almost exclusively throughout their systems – from their desktop computers and word processing software to their servers.Conclusion
Making use of open source software is more often about the journey than the destination. In the end, our libraries need systems that enable patrons to find the information they are seeking and to solve the problems that they face. If nothing else, open source software in libraries is a different path for meeting those needs. Open source software, though, can be more. It can be about engaging the library and its staff in the process of designing, building, and maintaining those systems. It can be about the peer-to-peer exchange of ideas with colleagues from other institutions and from service providers on ways to address our patrons’ needs. And sometimes open source software can be about reducing the total cost of ownership as compared with solutions from proprietary software providers. Libraries across the world have successfully adopted open source software. There have been a few unsuccessful projects as well. From each of these successful and unsuccessful projects, we learn a little more about the process and grow a little more as a profession. I encourage you if you haven’t done so already, to learn about how open source software can help in reaching your library’s goals.
Thank you for your attention, and I am happy to take questions, observations, or to hear about your stories of using open source software.
From A. Soroka, the University of Virginia
DuraSpace News: NOW AVAILABLE–TRAC Certified Long-term Digital Preservation: DuraCloud and Chronopolis for Institutional Treasures
Austin, TX An institution’s identity is often formed by what it saves for current and future access. Digital collections curated by the academy can include research data, images, texts, reports, artworks, books, and historic documents help define an academic institution’s identity.
Sixteen years is long enough, surely, to get to know a cat.
Amelia had always been her mother’s child. She had father and sister too, but LaZorra was the one Mellie always cuddled up to and followed around. Humans were of dubious purpose, save for our feet: from the scent we trod back home Mellie seemed to learn all she needed of the outside world.
Her father, Erasmus, left us several years ago; while Mellie’s sister mourned, I’m not sure Rasi’s absence made much of an impression on our clown princess — after all, LaZorra remained, to provide orders and guidance and a mattress.
Where Zorri went, Mellie followed — and thus a cat who had little use for humans slept on our bed anyway.
Recently, we lost both LaZorra and Sophia, and we were afraid: afraid that Amelia’s world would close in on her. We were afraid that she would become a lost cat, waiting alone for comfort that would never return.
The first couple days after LaZorra’s passing seemed to bear our fears out. Amelia kept to her routine and food, but was isolated. Then, some things became evident.
Our bed was, in fact, hers. Hers to stretch out in, space for my legs be damned.
Our feet turned out not to suffice; our hands were required too. For that matter, for the first time in her life, she started letting us brush her.
And she enjoyed it!
Then she decided that we needed correction — so she began vocalizing, loudly and often.
And now we have a cat anew: talkative and demanding of our time and attention, confident in our love.
Sixteen years is not long enough to get to know a cat.
- Identifying the subject of the search.
- Locating this subject in a guide which refers the searcher to one or more documents.
- Locating the documents.
- Locating the required information in the documents.
These overlap somewhat with FRBR's user tasks (find, identify, select, obtain) but the first step in Vickery's group is my focus here: Identifying the subject of the search. It is a step that I do not perceive as implied in the FRBR "find", and is all too often missing from library/use interactions today.
A person walks into a library... Presumably, libraries are an organized knowledge space. If they weren't the books would just be thrown onto the nearest shelf, and subject cataloging would not exist. However, if this organization isn't both visible and comprehended by users, we are, firstly, not getting the return on our cataloging investment and secondly, users are not getting the full benefit of the library.
In Part V of my series on Catalogs and Context, I had two salient quotes. One by Bill Katz: "Be skeptical of the of information the patron presents"; the other by Pauline Cochrane: "Why should a user ever enter a search term that does not provide a link to the syndetic apparatus and a suggestion about how to proceed?". Both of these address the obvious, yet often overlooked, primary point of failure for library users, which is the disconnect between how the user expresses his information need vis-a-vis the terms assigned by the library to the items that may satisfy that need.
Vickery's Three Issues for Stage 1
Issue 1: Formulating the topic Vickery talks about three issues that must be addressed in his first stage, identifying the subject on which to search in a library catalog or indexing database. The first one is "...the inability even of specialist enquirers always to state their requirements exactly..." [1 p.1] That's the "reference interview" problem that Katz writes about: the user comes to the library with an ill-formed expression of what they need. We generally consider this to be outside the boundaries of the catalog, which means that it only exists for users who have an interaction with reference staff. Given that most users of the library today are not in the physical library, and that online services (from Google to Amazon to automated courseware) have trained users that successful finding does not require human interaction, these encounters with reference staff are a minority of the user-library sessions.
In online catalogs, we take what the user types into the search box as an appropriate entry point for a search, even though another branch of our profession is based on the premise that users do not enter the library with a perfectly formulated question, and need an intelligent intervention to have a successful interaction with the library. Formulating a precise question may not be easy, even for experienced researchers. For example, in a search about serving persons who have been infected with HIV, you may need to decide whether the research requires you to consider whether the person who is HIV positive has moved along the spectrum to be medically diagnosed as having AIDS. This decision is directly related to the search that will need to be done:
HIV-positive persons--Counseling of
AIDS (Disease)--Patients--Counseling of
Issue 2: from topic to query The second of Vickery's caveats is that "[The researcher] may have chosen the correct concepts to express the subject, but may not have used the standard words of the index."[1 p.4] This is the "entry vocabulary" issue. What user would guess that the question "Where all did Dickens live?" would be answered with a search using "Dickens, Charles -- Homes and haunts"? And that all of the terms listed as "use for" below would translate to the term "HIV (Viruses)" in the catalog? (h/t Netanel Ganin):
As Pauline Cochrane points out, beginning in the latter part of the 20th century, libraries found themselves unable to include the necessary cross-reference information in their card catalogs, due to the cost of producing the cards. Instead, they asked users to look up terms in the subject heading reference books used by catalog librarians to create the headings. These books are not available to users of online catalogs, and although some current online catalogs include authorized alternate entry points in their searches, many do not.* This means that we have multiple generations of users who have not encountered "term switching" in their library catalog usage, and who probably do not understand its utility.
Even with such a terminology-switching mechanism, finding the proper entry in the catalog is not at all simple. The article by Thomas Mann (of Library of Congress, not the German author) on “The Peloponnesian War and the Future of Reference, Cataloging, and Scholarship in Research Libraries”  shows not only how complex that process might be, but it also indicates that the translation can only be accomplished by a library-trained expert. This presents us with a great difficulty because there are not enough such experts available to guide users, and not all users are willing to avail themselves of those services. How would a user discover that literature is French, but performing arts are in France?:
Performing arts -- France -- History
Or, using the example in Mann's piece, the searcher looking for in information on tribute payments in the Peloponnesian war needed to look under "Finance, public–Greece–Athens". This type of search failure fuels the argument that full text search is a better solution, and a search of Google Books on "tribute payments Peloponnesian war" does yield some results. The other side of the argument is that full text searches fail to retrieve documents not in the search language, while library subject headings apply to all materials in all languages. Somehow, this latter argument, in my experience, doesn't convince.
Issue 3: term order The third point by Vickery is one that keyword indexing has solved, which is "...the searcher may use the correct words to express the subject, but may not choose the correct combination order."[1 p.4] In 1959, when Vickery was writing this particular piece, having the wrong order of terms resulted in a failed search. Mann, however, would say that with keyword searching the user does not encounter the context that the pre-coordinated headings provide; thus keyword searching is not a solution at all. I'm with him part way, because I think keyword searching as an entry to a vocabulary can be useful if the syndetic structure is visible with such a beginning. Keyword searching directly against bibliographic records, less so.
Comparison to FRBR "find" FRBR's "find" is described as "to find entities that correspond to the user’s stated search criteria". [6 p. 79] We could presume that in FRBR the "user's stated search criteria" has either been modified through a prior process (although I hardly know what that would be, other than a reference interview), or that the library system has the capability to interact with the user in such a way that the user's search is optimized to meet the terminology of the library's knowledge organization system. This latter would require some kind of artificial intelligence and seems unlikely. The former simply does not happen often today, with most users being at a computer rather than a reference desk. FRBR's find seems to carry the same assumption as has been made functional in online catalogs, which is that the appropriateness of the search string is not questioned.
Summary There are two take-aways from this set of observations:
- We are failing to help users refine their query, which means that they may actually be basing their searches on concepts that will not fulfill their information need in the library catalog.
- We are failing to help users translate their query into the language of the catalog(s).
I would add that the language of the catalog should show users how the catalog is organized and how the knowledge universe is addressed by the library. This is implied in the second take-away, but I wanted to bring it out specifically, because it is a failure that particularly bothers me.
Notes*I did a search in various catalogs on "cancer" and "carcinoma". Cancer is the form used in LCSH-cataloged bibliographic records, and carcinoma is a cross reference. I found a local public library whose Bibliocommons catalog did retrieve all of the records with "cancer" in them when the search was on "carcinoma"; and that the same search in the Harvard Hollis system did not (carcinoma: 1889 retrievals; cancer 21,311). These are just two catalogs, and not a representative sample, to say the least, but the fact seems to be shown.
References Vickery, B C. Classification and Indexing in Science. New York: Academic Press, 1959.
 Katz, Bill. Introduction to Reference Work: Reference Services and Reference Processes. New York: McGraw-Hill, 1992. p. 82 http://www.worldcat.org/oclc/928951754. Cited in: Brown, Stephanie Willen. The Reference Interview: Theories and Practice. Library Philosophy and Practice 2008. ISSN 1522-0222
 Modern Subject Access in the Online Age: Lesson 3 Author(s): Pauline A. Cochrane, Marcia J. Bates, Margaret Beckman, Hans H. Wellisch, Sanford Berman, Toni Petersen, Stephen E. Wiberley and Jr. Source: American Libraries, Vol. 15, No. 4 (Apr., 1984), pp. 250-252, 254-255 Stable URL: http://www.jstor.org/stable/25626708
 Modern Subject Access in the Online Age: Lesson 2 Pauline A. Cochrane American Libraries Vol. 15, No. 3 (Mar., 1984), pp. 145-148, 150 Stable URL: http://www.jstor.org/stable/25626647
 Thomas Mann, “The Peloponnesian War and the Future of Reference, Cataloging, and Scholarship in Research Libraries” (June 13, 2007). PDF, 41 pp. http://guild2910.org/Pelopponesian%20War%20June%2013%202007.pdf
 IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional Requirements for Bibliographic Records, 2009. http://archive.ifla.org/VII/s13/frbr/frbr_2008.pdf.
Library of Congress: The Signal: Nominations Sought for the U.S. Federal Government End of Term Web Archive
This is a guest post by Abbie Grotke, lead information technology specialist of the Library of Congress Web Archiving Team
Readers of The Signal may recall prior efforts to archive United States Federal Government websites during the end of presidential terms. I last wrote about this in 2012 when we were working on preserving the government domain during the end of President Obama’s first term. To see the results of our 2008 and 2012 efforts, visit the End of Term Archive.
As the Obama administration comes to a close, the End of Term project team has formed again and we need help from you.
For the End of Term 2016 archive, the Library of Congress, California Digital Library, University of North Texas Libraries, Internet Archive, George Washington University Libraries, Stanford University Libraries and the U.S. Government Publishing Office have joined together for a collaborative project to preserve public United States Government websites at the end of the current presidential administration ending January 20, 2017. Partners are joining together to select, collect, preserve, and make the web archives available for research use.
This web harvest — like its predecessors in 2008 and 2012 — is intended to document the federal government’s presence on the web during the transition of Presidential administrations and to enhance the existing collections of the partner institutions. This broad comprehensive crawl of the .gov domain will include as many federal .gov sites as we can find, plus federal content in other domains (such as .mil, .com and social media content).
And that’s where you come in. You can help the project immensely by nominating your favorite .gov website, other federal government websites or governmental social media account with the End of Term Nomination Tool. Please nominate as many sites as you want. Nominate early and often. Tell your friends, family and colleagues to do the same. Help us preserve the .gov domain for posterity, public access and long-term preservation.
I’ve never actually read Fred Brooks’ Mythical Man-Month, but have picked up many of it’s ideas by cultural osmosis. I think I’m not alone, it’s a book that’s very popular by reputation, but perhaps not actually very influential in terms of it’s ideas actually being internalized by project managers and architects.
Or as Brooks himself said:
Some people have called the book the “bible of software engineering.” I would agree with that in one respect: that is, everybody quotes it, some people read it, and a few people go by it.
Ha. I should really get around to reading it, I routinely run into things that remind me of the ideas I understand from it that I’ve just sort of absorbed (perhaps inaccurately).
In the meantime, here’s another good quote from Brooks to stew upon:
The ratio of function to conceptual complexity is the ultimate test of system design.
Quite profound really. Terribly frustrating to work with software packages can, I think, almost always be described in those terms: The ratio of function to conceptual complexity is far, far too low. That is nearly(?) the definition of a frustrating to work with software package.
Filed under: General
Open Knowledge Foundation: What does personal data have to do with open data? Initial thoughts from #MyData2016
This piece is part of a series of posts from MyData 2016 – an international conference that focuses on human centric personal information management. The conference is co-hosted by the Open Knowledge Finland chapter of the Open Knowledge International Network.
What does personal data have to do with open data? We usually preach NOT to open personal data, and to be responsible about it. So why should an open knowledge organisation devote a whole conference to topics related to personal data management? I will explore these questions in a series of blog posts written straight from the MyData16 conference in Helsinki, Finland.
MyData is a very abstract concept that is still in the process of refinement. In its essence, MyData is about giving control of the personal data trail that we leave on the internet to the users. Under the MyData framework, users decide where to store their data and can control and guide the use this data can have. In most applications today, our data is closed and owned by other big corporations, where it is primarily used to make money. The MyData concept looks to bring back the control to the user, but also tries to develop the commercial use of the data, making everyone happy.Under the MyData framework, users decide where to store their data and can control and guide the use this data can have.
Here is Mika Honkanen, vice chairman of the OK Finland board, explaining about MyData:http://blog.okfn.org/files/2016/08/MyDataMika.mp3
For those of you who missed Open Knowledge Festival in 2012 (like me), Open Knowledge Finland know how to produce events. Besides the conference program (and super exciting evening program!), you can also find the Ultrahack, a 72 hours hackathon that will try and answer my questions above and will be involved in creating applicable uses to the MyData concept. I am excited to see how it will turn out and what uses – social and fiscal ones, people can find.
For the following three days, keep following us on the OKI Twitter account for updates from the conference. Check the MyData website, and let us know if you want us to go to a session for you!