You are here

Feed aggregator

DuraSpace News: From Ethiopia to Portugal: Say Goodbye to 2017 and Welcome to 2018

planet code4lib - Tue, 2017-12-19 00:00

From Claudio Cortese, 4Science  4Science is in Africa to support VLIR-UOS. VLIR-UOS is a Flemish initiative that supports partnerships between universities in Flanders (Belgium) and in developing countries, looking for innovative responses to global and local challenges.

DuraSpace News: Migrating to DSpace

planet code4lib - Tue, 2017-12-19 00:00

From Atmire   Are your considering to migrate your current repository to DSpace but you are wondering what the options are? Over the years, Atmire has carried out migrations from a variety of platforms such as EPrints, Fedora, Digital Commons, CONTENTdm, DigiTool, Equella and homegrown software. Because of its wide array of commonly used import facilities and formats, migrating to DSpace may require less work and introduce less risk than you may think.

District Dispatch: House committee passes Higher Education Act

planet code4lib - Mon, 2017-12-18 21:45

As anticipated in a District Dispatch post earlier this fall, the House Education and Workforce Committee this week approved (or marked up) legislation to reauthorize the Higher Education Act on a party line 23-17 vote. This legislation, among other things, threatens a popular student loan program for LIS program graduates.

The Promoting Real Opportunity, Success, and Prosperity Through Education (PROSPER) Act, introduced earlier this month by Committee Chairwoman Virginia Foxx (R-NC), is an authorization bill (not a funding measure) that sets policies for federal student financial aid programs. The PROSPER Act would largely keep the overall structure of the original 1965 Higher Education Act and authorize the Act through 2024 (the most recent authorization expired in 2013). The legislation streamlines student loan programs, expands job training opportunities, among other changes.

The PROSPER Act also threatens the Public Service Loan Forgiveness (PSLF) Program, which supports graduates working the public sector, often in lower-paying jobs. PSLF allows graduates to erase debt after 10 years of working in public service careers and meeting loan payments. Librarians are among those who currently qualify for benefits under the program. (Under the PROSPER Act, current participants currently in the program would not be impacted by the proposed elimination.) In addition, demand in rural communities for PSLF students remains high. ALA is alarmed by the proposed elimination of PSLF program and has worked throughout the fall to advocate for the retention of the program.

Also threatened by the PROSPER Act are programs for teacher prep under Title II of the Act.

In a 14-hour-long marathon markup of the PROSPER Act, the Committee more than 60 amendments, including one introduced by Rep. Joe Courtney (D-CT) to support PSLF that narrowly failed on a 20-19 vote.

The PROSPER Act now heads to the full House of Representatives for consideration. The timing is uncertain as Congress faces a number of budget and tax priorities that must be addressed. In addition, some members from rural communities have raised concern with key provisions such as the elimination of PSLF, which is popular for rural communities seeking to attract college graduates. We expect the PROSPER Act to come up in the House early next year.

The Senate has not announced a timetable for its Higher Education Act legislation, but Senator Lamar Alexander (R-TN), who chairs the Senate Health, Education, Labor, and Pensions Committee has expressed optimism that he can craft a bill with the committee’s ranking Member, Senator Patty Murray (D-WA). PSLF has generally received stronger support in the Senate, and ALA will continue to work with members of Congress to defend PSLF in the coming months.

The post House committee passes Higher Education Act appeared first on District Dispatch.

Eric Lease Morgan: LexisNexis hacks

planet code4lib - Mon, 2017-12-18 17:37

This blog posting briefly describes and makes available two Python scripts I call my LexisNexis hacks.

The purpose of the scripts is to enable the researcher to reformulate LexisNexis full text downloads into tabular form. To accomplish this goal, the researchers is expected to first search LexisNexis for items of interest. They are then expected to do a full text download of the results as a plain text file. Attached ought to be an example that includes about five records. The first of my scripts — — parses the search results into individual records. The second script — — reads the output of the first script and parses each file into individual but selected fields. The output of the second script is a tab-delimited file suitable for further analysis in any number of applications.

These two scripts can work for a number of people, but there are a few caveats. First, saves its results as a set of files with randomly generated file names. It is possbile, albeit unlikely, files will get overwritten because the same randomly generated file names was… generated. Second, the output of only includes fields required for a specific research question. It is left up to the reader to edit for additional fields.

In short, your milage may vary.

District Dispatch: Deadlock on Section 702 surveillance bills

planet code4lib - Mon, 2017-12-18 14:28

Legislation to reauthorize the Foreign Intelligence Surveillance Act (FISA) is due to expire on Dec 31st of this year. Currently three Senate and two House bills are on the table, but the House and Senate Judiciary and Intelligence Committees are not close to resolving differences. Experts say that the reauthorization deadline likely will be extended by a short-term continuing resolution, pushing the legislation into next year.

The ALA Washington Office continues to support legislation that would advance the privacy protections of the public, such as the USA Rights Act (S. 1997). The Washington Office has joined coalition partners by signing a letter in support of S. 1997 to close the “backdoor search loophole” by requiring that surveillance of Americans only be done with a warrant. While no legislation pending would specifically impact libraries, the right to privacy is central to the ability of individuals to seek and communicate information freely and anonymously, a core value of libraries.

For more information on FISA and existing legislation see the Electronic Frontier Foundation’s “decoding FISA” resources.

The post Deadlock on Section 702 surveillance bills appeared first on District Dispatch.

Terry Reese: MarcEdit 7 Updates: Focusing on the Task Broker

planet code4lib - Mon, 2017-12-18 14:04

This past weekend, I spent a bit of time doing some additional work around the task broker. Since the release in Nov. , I’ve gotten a lot of feedback from folks that are doing all kinds of interesting things in tasks. And I’ll be honest, many of these task processes are things that I could never, ever, imagine. These processes have often times, tripped the broker – which evaluates each task prior to running, to determine if the task actually needs to be run, or not. So, I spent a lot of time this weekend with a number of very specific examples, and working on update the task broker to, in reality, identify the outliers, and let them through. This may mean running a few additional cycles that are going to return zero results, but I think that it makes more sense for the broker to pass items through with these edge cases, rather than attempting to accommodate every case (also, some of these cases would be really hard to accommodate). Additionally, within the task broker, I’ve updated the process so that it no long just looks at the task to decide how to process the file. Additionally, the tool is reading the file to process, and based on file size, auto scaling the buffer of records processed on each pass. This way, smaller files are processed closer to 1 record at a time, while larger files are buffered in record groups of 1500.

Anyway, there were a number of changes make this weekend, the full list is below:

  • Enhancement: Task Broker — added additional preprocessing functions to improve processing (specifically, NOT elements in the Replace task, updates to the Edit Field)
  • Enhancement: Task Broker — updated the process that selects the by record or by file approach to utilize file size in selecting the record buffer to process.
  • Enhancement: New Option — added an option to offer preview mode if the file is too large.
  • Enhancement: Results Window — added an option to turn on word wrapping within the window.
  • Enhancement: Main Window — More content allowed to be added to the most recent programs run list
  • ** Added a shortcut (CTRL+P) to immediately open the Recent Programs List
  • Bug Fix: Constant Data Elements — if the source file isn’t present, and error is thrown rather than automatically generating a new file. This has been corrected.
  • Update: Task Debugger — Updated the task debugger to make stepping through tasks easier.
  • Bug Fix: Task Processor — commented tasks were sometimes running — this has been corrected in both the MarcEditor and console program.
  • Enhancement: Status Messages been added to many of the batch edit functions in the MarcEditor to provide more user feedback.
  • Enhancement: Added check so that if you use “X” in places where the tool allows for field select with “x” or “*”, the selection is case insensitive (that has not been the default, though it’s worked that way in MarcEdit 6 but this was technically not a supported format
  • Updated Installer: .NET Requirements set to 4.6 as the minimum. This was done because there are a few problems when running only against 4.5.



DuraSpace News: iPRES 2018 Preliminary Call for Contributions

planet code4lib - Mon, 2017-12-18 00:00

From Sibyl Schaefer, Chronopolis Program Manager and Digital Preservation Analyst, University of California, San Diego

iPRES 2018 BOSTON—Where Art and Science Meet —The Art In the Science & The Science In the Art of Digital Preservation—will be co-hosted by MIT Libraries and Harvard Library on September 24-27, 2018. Help us celebrate the first 15 Years of iPRES, the premier international conference on the preservation and long-term management of digital materials.

DuraSpace News: VIVO Updates for December 17

planet code4lib - Mon, 2017-12-18 00:00

From Mike Conlon, VIVO Project Director

Congratulations Cambridge and Cornell!  The University of Cambridge launched its long awaited VIVO this week.  Congratulations to the team at Cambridge, and to Symplectic, who assisted with the implementation.  Congratulations to the team at Cornell who launched an upgrade to Scholars at Cornell this week.

District Dispatch: ALA invites nominations for 2018 James Madison Award

planet code4lib - Fri, 2017-12-15 16:40

The American Library Association’s (ALA) Washington Office invites nominations for its James Madison Award. The award recognizes an individual or group who has championed, protected, or promoted public access to government information and the public’s right to know at the national level. 

The award is named in honor of President James Madison, known as “the Father of the Constitution.” Madison was a noted advocate for the importance of an informed citizenry, famously writing in 1822:

“A popular Government, without popular information, or the means of acquiring it, is but a prologue to a Farce or a Tragedy; or perhaps both. Knowlege will for ever govern ignorance: and a people who mean to be their own Governours, must arm themselves with the power which knowledge gives.”

Winner of the 2017 Madison Award, Sen. Jon Tester of Montana, was nominated by ALA member Ann Ewbank for his longstanding bipartisan commitment to increasing public access to information. You can watch the presentation and Sen. Tester’s  acceptance speech here.

ALA will present the 2018 award during or near Sunshine Week, the week that includes March 16, Madison’s birthday.

Submit your nominations to the ALA Washington Office, care of Gavin Baker, no later than January 19, 2018. Please include a brief statement (maximum one page) about the nominee’s contribution to public access to government information. If possible, please also include a brief biography and contact information for the nominee.

The post ALA invites nominations for 2018 James Madison Award appeared first on District Dispatch.

Brown University Library Digital Technologies Projects: Understanding scoring of documents in Solr

planet code4lib - Fri, 2017-12-15 14:03

During the development of our new Researchers@Brown front-end I spent a fair amount of time looking at the results that Solr gives when users execute searches. Although I have always known that Solr uses a sophisticated algorithm to determine why a particular document matches a search and why a document ranks higher in the search result than others, I have never looked very close into the details on how this works.

This post is an introduction on how to interpret the ranking (scoring) that Solr reports for documents returned in a search.

Requesting Solr “explain” information

When submitting search request to Solr is possible to request debug information that clarifies how Solr interpreted the client request and information on how the score for each document was calculated for the given search terms.

To request this information you just need to pass debugQuery=true as a query string parameter to a normal Solr search request. For example:

$ curl "http://someserver/solr/collection1/select?q=alcohol&wt=json&debugQuery=true"

The response for this query will include debug information with a property called explain where the ranking of each of the documents is explained.

"debug": { ... "explain": { "id:1": "a lot of text here", "id:2": "a lot of text here", "id:3": "a lot of text here", "id:4": "a lot of text here", ... } } Raw “explain” output

Although Solr gives information about how the score for each document was calculated, the format that is uses to provide this information is horrendous. This is an example of how the explain information for a given document looks like:

"id:":"\n 1.1542457 = (MATCH) max of:\n 4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcohol in 831) [DefaultSimilarity], result of:\n 4.409502E-5 = score(doc=831,freq=6.0 = termFreq=6.0\n ), product of:\n 2.2283042E-4 = queryWeight, product of:\n 5.170344 = idf(docFreq=60, maxDocs=3949)\n 4.3097796E-5 = queryNorm\n 0.197886 = fieldWeight in 831, product of:\n 2.4494898 = tf(freq=6.0), with freq of:\n 6.0 = termFreq=6.0\n 5.170344 = idf(docFreq=60, maxDocs=3949)\n 0.015625 = fieldNorm(doc=831)\n 4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) [DefaultSimilarity], result of:\n 4.27615E-5 = score(doc=831,freq=6.0 = termFreq=6.0\n), product of:\n 2.1943514E-4 = queryWeight, product of:\n 5.0915627 = idf(docFreq=65, maxDocs=3949)\n 4.3097796E-5 = queryNorm\n 0.1948708 = fieldWeight in 831, product of:\n 2.4494898 = tf(freq=6.0), with freq of:\n 6.0 = termFreq=6.0\n 5.0915627 = idf(docFreq=65, maxDocs=3949)\n 0.015625 = fieldNorm(doc=831)\n 1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) [DefaultSimilarity], result of:\n 1.1542457 = score(doc=831,freq=1.0 = termFreq=1.0\n ), product of:\n 0.1410609 = queryWeight, product of:\n 400.0 = boost\n 8.182606 = idf(docFreq=2, maxDocs=3949)\n 4.3097796E-5 = queryNorm\n 8.182606 = fieldWeight in 831, product of:\n 1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n 8.182606 = idf(docFreq=2, maxDocs=3949)\n 1.0 = fieldNorm(doc=831)\n",

It’s unfortunately that the information comes in a format that is not easy to parse, but since it’s plain text, we can read it and analyze it.

Explaining “explain” information

If you look closely at the information in the previous text you’ll notice that Solr reports the score of a document as the maximum value (max of) from a set of other scores. For example, below is a simplified version of text (the ellipsis represent text that I suppressed):

"id:":"\n 1.1542457 = (MATCH) max of:\n 4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcohol in 831) ... ... 4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) ... ... 1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) ... ..."

In this example, the score of 1.1542457 for the document with id was the maximum of three scores (4.409502E-5, 4.27615E-5, 1.1542457). Notice that the scores are in E-notation. If you look closely you’ll see that each of those scores is associated with a different field in Solr where the search term, alcohol, was found.

From the text above we can determine that the text alcohol was found on the ALLTEXTUNSTEMMED, ALLTEXT, and research_areas. Even more, we can also tell that for this particular search we are giving the research_areas field a boost of 400 which explains why that particular score was much higher than the rest.

The information that I omitted in the previous example provides a more granular explanation on how each of those individual field scores was calculated. For example, below is the detail for the research_areas field:

1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) [DefaultSimilarity], result of:\n 1.1542457 = score(doc=831,freq=1.0 = termFreq=1.0\n), product of:\n 0.1410609 = queryWeight, product of:\n 400.0 = boost\n 8.182606 = idf(docFreq=2, maxDocs=3949)\n 4.3097796E-5 = queryNorm\n 8.182606 = fieldWeight in 831, product of:\n 1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n 8.182606 = idf(docFreq=2, maxDocs=3949)\n 1.0 = fieldNorm(doc=831)\n",

Again, if we look closely at this text we see that the score of 1.1542457 for the research_areas field was the product of two other factors (0.1410609 x 8.182606). There is even information about how these individual factors were calculated. I will not go into details on them in this blog post but if you are interested this is a good place to start.

Another interesting clue that Solr provides in the explain information is what values were searched for in a given field. For example, if I search for the word alcoholism (instead of alcohol) the Solr explain result would show that in one of the fields it used the stemmed version of the search term and in other it used the original text. In our example, this would look more or less like this:

"id:":"\n 1.1542457 = (MATCH) max of:\n 4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcoholism in 831) ... ... 4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) ... ...

Notice how in the unstemmed field (ALLTEXTUNSTEMMED) Solr used the exact word searched (alcoholism) whereas in the stemmed field (ALLTEXT) it used the stemmed version (alcohol). This is very useful to know if you were wondering why a value was (or was not) found in a given field. Likewise, if you are using (query time) synonyms those will show in the Solr explain results.

Live examples

In our new Researchers@Brown site we have an option to show the explain information from Solr. This option is intended for developers to troubleshoot tricky queries, not for the average user.

For example, if you pass explain=text to a search URL in the site you’ll get the text of the Solr explain output formatted for each of the results (scroll to the very bottom of the page to see the explain results).

Likewise, if you pass explain=matches to a search URL the response will include only the values of the matches that Solr evaluated (along with the field and boost value) for each document.

Source code

If you are interested in the code that we use to parse the Solr explain results you can find it in our GitHub repo. The code for this lives in two classes Explainer and ExplainerEntry.

Explainer takes a Solr response and creates an array with the explain information for each result. This array is comprised of ExplainEntry objects that in turn parse each of the results to make the match information easily accessible. Keep in mind that this code does mostly string parsing and therefore is rather brittle. For example, the code to extract the matches for a given document is as follows:

class ExplainEntry ... def get_matches(text) lines = text.split("\n") {|l| l.include?("(MATCH)") || l.include?("coord(")} end end

As you can imagine, if Solr changes the structure of the text that it returns in the explain results this code will break. I get the impression that this format (as ugly as it is) has been stable in many versions of Solr so hopefully we won’t have many issues with this implementation.

Journal of Web Librarianship: Editorial Board

planet code4lib - Wed, 2017-12-13 05:14
Volume 11, Issue 3-4, July-December 2017, Page ebi-ebi

District Dispatch: Keeping public information public: an introduction to federal records laws

planet code4lib - Tue, 2017-12-12 21:17

This is the first post in a three-part series.

The National Archives and Records Administration preserves and documents government and historical records. Photo by Wknight94

The federal government generates an almost unimaginable quantity of information, such as medical research, veterans’ service records, regulations and policy documents, and safety inspection records, to name just a few examples.

For decades, ALA has advocated for policies that ensure government information is appropriately managed, preserved and made available to the public, including through libraries. Federal records laws play important roles in those processes. This series will introduce those laws and highlight two current issues that impact public access to government information: the management of information on federal websites and the preservation and destruction of government information.

What are the fundamental concepts of federal records laws?

Today’s legal framework for managing government information dates back to the 1934 National Archives Act and the 1950 Federal Records Act (FRA). Under the supervision of the National Archives and Records Administration (NARA), federal agencies must:

  1. Create adequate records to document their activities;
  2. Keep the records for a sufficient length of time, based on their value; and
  3. Once the government no longer has an immediate use for the records, transfer those records with permanent value to the National Archives and dispose of the rest.

To highlight some of the significant implications of that framework:

  • Federal agencies do not have the discretion to do whatever they want with their records; they have responsibilities under federal law.
  • The National Archives isn’t just a building with a bunch of neat old documents: it actively oversees the ongoing archival activities of federal agencies. In particular, agencies generally may not destroy records without NARA’s approval.
  • To avoid drowning in a sea of information, the government needs to determine the expected value of a record – in other words, to predict whether a record will be needed next year, next decade, or next century, and then dispose of it after that time.

How do records laws affect public access to government information?

Members of the public use federal records both directly and indirectly:

Directly: The public may access records directly through a federal website, in an agency reading room, or by making a request under the Freedom of Information Act (FOIA) or a similar process. If your library has ever helped a user find information from a federal website, you’ve probably worked with a federal record.

Indirectly: Researchers and editors, both inside and outside of government, may utilize records in creating new information products for a larger audience. For instance, inside government, the Executive Office for Immigration Review publishes a bound volume of selected administrative decisions by compiling individual records into a product of interest to a wider audience. Outside of government, journalists often make FOIA requests for records, then incorporate information gleaned from those records into news stories that can better inform the public about government decision-making. Libraries might then collect the resulting publications that incorporated the information from federal records.

However, these activities can only take place if federal agencies properly create, manage and preserve their records. In an upcoming post, learn more about how agencies manage records on federal websites.

The post Keeping public information public: an introduction to federal records laws appeared first on District Dispatch.

Lucidworks: 7 Predictions for Search in 2018

planet code4lib - Tue, 2017-12-12 18:26

It wasn’t long ago that search technology was stagnant. The hard problems back then were data connectivity and import. Search teams might still struggle with these challenges for specific projects, but they are broadly solved from a technology standpoint. The next set of problems are about language parsing, how to stem words and match up to keywords, phrase matching, and faceting. There is always another tweak, but nearly every search product on the market does those things.

In recent times, search technology came alive with open source technology that allowed it to scale to big data sizes and brought web-scale to every company that needed it. However, as technology has progressed and the amount of data increased, so has our need to find the information we need in a mountain of data. Search technology is exciting again and here are the changes we can expect to see in the coming year!

Personalization becomes extreme

The new frontiers are not in connecting to yet another datasource or how to chop up a word or sentence. The new challenge is to match the results to the person asking the question. Doing this requires gathering more than a collection of documents and terms. It means capturing human behavior and assembling a complete user profile. That profile may be related to their profession (accounting vs sales/marketing vs engineering). That profile may be related to their past searches or purchases. This capability is moving past vendors like Amazon and Google. Other retailers are now incorporating this technology, we can expect to see this find its way into enterprise search applications.

Search gets more contextual

Who I am, what I’ve done, and what I want are “personalization.” Where I am in terms of geography, mobile phone provide location or in what part of an application is “context.” Up until now, web applications that provide search tend to provide a simple context-less search box.

However, the bar is being raised. If you ask your mobile device to provide a list of restaurants you don’t mean “in the whole world,” you want “around me and open right now.” Meanwhile, while working all day on one customer account, when you type “contracts” or “contacts” into any corporate search bar, in most enterprise search applications you get back a list of all documents that have those keywords. That’s “dumb” and search should be smarter and show you files related to the account you’ve been working on all day. The capability is there and this year users are going to start to expect it.

Applications become search-centric

Companies tend to deploy their applications and then add in some kind of “search portal” or series of search applications that are functionally separate to try and tie the search experience and the app experience back together. This requires a lot of the user, they have to context-switch and go back to a different site. To minimize this friction, search is getting integrated into the core of most applications. Whether it be traditional stalwarts like your CMS or CRM or newcomers like Slack, search is no longer an afterthought, it is the main way to interact with the application. In 2018, this will become more of an expectation of internal- and customer-facing applications as well regardless of their use case.

Machine learning becomes ubiquitous

So much of what we do as human beings is grouping things together because they look somehow similar (clustering) and sorting things into categories (classification). So many business problems are one of projecting based on trends (regression). Search has long been used to group stuff together and finding that stuff has often meant dividing it up into classes. What is different is that we can automate that.
However, it goes beyond this. In an era of rampant fraudulent news, reviews and transactions, Machine Learning allows search to sort through to the most relevant and most real results. This isn’t a nice to have anymore for most retailers, financial services or customer service sites.

In healthcare and insurance similar types of diagnosis, claims, and notes can automatically be grouped. Across the board, as a user’s profile is filled out recommendations for that user or for similar items are a necessity in an era where there is more data than information.

Migration to the cloud begins in earnest

The sky is falling! Cloud, ho! Many organizations will cease to run their own data centers. If you make pizzas, you should just do that, not deploy servers all over the planet. With that said, there are legal and technical barriers that have to be overcome before we’re all serverless. Because search is needed behind the firewall as well as in the cloud, for some time we’ll see on-premise and hybrid solutions more commonly than all cloud. With that said, the weather report for 2018 is a partly cloudy when it comes to search. Expect fewer pure on-premise deployments as more companies turn to hybrid and cloud search installations.

Search application development becomes a solved problem

In 2017, most search application developers were still writing yet another web search box with typeahead and facets from the ground up. Forget some of the more advanced features that you can implement, if you’re still hand coding basic features step-by-step, you’re not getting to the finish line very quickly. Most basic features and capabilities for search applications have already been written and the components are out there pre-coded for you. Savvy developers will start to use these pre-written, pre-tested components and frameworks and make their tweaks where necessary rather than reinventing the wheel every time. In 2018, we’ll see the end of from-scratch search applications at least for standard and mobile websites.

Single Service Solutions will start to abate

This year there were a lot of “new” and “old but reimagined” search solutions that were aimed at just one task. For example a search solution just for Salesforce. For one, it is hard to see a future where there is a significant market for a solution that does little more than improve a feature that is already built in without that vendor just augmenting with the same functionality. These single service search solutions are going to go away. Data wants to be together. Search is more than just “find all customers with ‘Associates’ in the name.” In order to reach the next level of customer engagement and employee productivity and find the answers that you need, you need to be able to augment data with other data. To do this you need a search solution that supports multiple sources and use cases. Sure it was fun being able to add search to this service, but now you need more. You don’t really want to manage ten single service search solutions for multiple data sources and use cases. In 2018, expect to see some pushback on search solutions that only serve one use case or search one data source.

Search in 2018: Get started now

Search in 2018, is going to be more personal, contextual, spread through all of your applications, powered by machine learning, and in the cloud. In 2018, we’ll stop developing search UIs from scratch and start putting together components that are pre-written, pre-tested, and ready for prime time. In 2018, we’ll stop using single-service solutions and power our search with a solution that supports all of our use cases with multiple data sources.

Get a move on the new year with Lucidworks Fusion and Fusion App Studio. Consider signing up for Fusion Cloud. This will put you on the path to more personalized modern search applications and give you an excellent path through 2018 and beyond.

The post 7 Predictions for Search in 2018 appeared first on Lucidworks.

Terry Reese: MarcEdit MacOS 3.0 Roadmap

planet code4lib - Tue, 2017-12-12 17:33

With MarcEdit 7 out, I’ve turned my focus to completing the MarcEdit MacOS update. Right now, I’m hoping to have this version available by the first of the year. I won’t be doing a long, extended beta, in part, because this version utilizes all the business code written for MarcEdit 7. And like MarcEdit 7, the mac version will be able to be installed with the current Mac build (it won’t replace it) – this way, you can test the new build while continuing to use the previous software.


Library of Congress: The Signal: The Time and Place for PDF: An Interview with Duff Johnson of the PDF Association

planet code4lib - Tue, 2017-12-12 15:09

The following is a guest post by Kate Murray, Digital Projects Coordinator at the Library of Congress.

The Library of Congress is both a producer and collector of PDFs and has recently joined the PDF Association as a Partner Organization. At the upcoming PDF Day organized by the PDF Association, the Library of Congress will be among the presenters describing the proposed expansion from microfilm to PDF for publishers of daily newspapers registering their issues with the United States Copyright Office, a separate federal department within the Library of Congress. The presentation will cover the anticipated benefits to the publishers, the Library, and its patrons.

In this interview, Kate Murray talks with Duff Johnson, PDF Association Executive Director, about the role of the PDF Association in standards and development communities.

Tell me about the mission and goals of the PDF Association

Duff Johnson, PDF Association Executive Director: The PDF Association is a non-profit industry association for PDF technology developers and implementers. It exists to promote awareness and adoption of ISO-standardized PDF technology. Members include over 130 member-companies, over 20 non-profit organizations as Partners, including our newest members the Library of Congress and National Archives, and numbers of individual consultants. Like the ISO committees whose products it promotes, the PDF Association is a vendor-neutral space; the largest and the smallest of companies have the same vote in all Technical Working Groups and on the Board of Directors.

What role does the PDF Association have in the development of PDF standards documents?

Johnson: The PDF Association has a proven track-record in terms of formalizing the interpretation and best-practice use of ISO standards for PDF. The organization began as the PDF/A Competence Center in 2006 because the founding companies all saw the need to share a common understanding about the PDF/A (ISO 19005-1) standard, published the previous year. The Isartor Test Suite for PDF/A-1b and several Technical Notes resulted, all of which helped developers produce complimentary implementations of PDF/A.

What other projects has the PDF Association participated in?

Johnson: More recently, the PDF Association, as part of the veraPDF Consortium, took a leading role in producing the veraPDF Test Suite, including complete coverage for all conformance levels of PDF/A-1, A-2 and A-3. We also published a new Technical Note detailing the industry-supported understandings reached during the course of the veraPDF project.

How can the Library of Congress gain from PDF Association membership?

Johnson: The Library’s participation will foster enhanced relationships with industry, providing more opportunities for the Library to communicate its needs to the community that develops the software which content producers and publishers use for digital documents, and more. Partner organizations to the PDF Association, now including the Library of Congress as well as the National Archives, can leverage this vendor-neutral space to access expertise from the global community of PDF technology developers, and participate in developing new standards and other formal documents to further best-practice use of PDF.  A recent example of just such an effort was the formation of the PDF/raster TWG to explore the TWAIN Working Group’s request for a PDF subset enabling scanners to produce PDF files directly, without TIFF images. The result, PDF/raster 1.0, is already adopted in the TWAIN Direct specification.


Learn more about PDF Day to be held in January 2018 in Washington, D.C.

How is the PDF Association reaching out to other members of the cultural heritage and government sectors?

Johnson: Inspired in part by the new memberships of the Library of Congress and NARA in the organization, we’re organizing a PDF Day for January 29, 2018 in Washington DC. PDF Day is an PDF Association educational event comprised of non-commercial presentations on various aspects of PDF technology.  The Library of Congress will be among the presenters. Registration is required for PDF Day with discounts for government and military personnel; additional details available at the PDF Day website.

District Dispatch: This week’s activities to save net neutrality

planet code4lib - Tue, 2017-12-12 13:39

This Thursday, the FCC is expected to vote on a proposal from Chairman Ajit Pai that would rollback the strong, enforceable net neutrality protections established in 2015. The meeting will be webcast, 10:30 am – 12:30 pm EST.

As John Lewis said, “Every voice matters, and we cannot let the interests of profit silence the voices of those pursuing human dignity.” The American Library Association (ALA) understands that net neutrality enables opportunities for all by protecting an open and accessible internet – so that every voice, idea, information seeker and person gets a chance to prosper using the dominant communications platform of our day.

As we’ve mentioned in the past, the vote this week will likely be a giant step backwards, but we and our allies will continue to vehemently advocate for a neutral net. Here are a few things going on this week:

Putting the pressure on Congress
Advocates have asked members of Congress to step in, as overseers of the FCC, to stop the impending vote on Thursday. Thousands of calls and emails have been sent from across the country, including nearly 37,000 emails using the ALA’s library-specific action alert. The ALA also is one of more than 150 groups (including individual libraries!) that have signed a joint letter to House and Senate committee leaders.

Also yesterday, 21 Internet and tech leaders, headlined by Tim Berners-Lee, Vint Cerf and Steve Wozniak, and including inventors, innovators and creators of many of the fundamental technologies of the Internet, sent a letter to Congress with their own concerns.

Wait for the FTC?
Proponents of the draft order that will be voted on this Thursday have claimed consumers will still be protected from potential internet service provider misbehavior by the Federal Trade Commission (FTC). But any day now, a federal court is expected to rule on a case that has serious implications for the FTC’s ability to help broadband consumers. At the end of last week, consumer groups and advocates including ALA asked the FCC to wait on any decision on net neutrality until this case is decided.

Making some noise
While we hope the efforts above will have an impact, advocates also are focused on activities to coincide with this week’s FCC meeting, including vigils, rallies and continued online engagement. Here are a few ways that you can add your voice:

  • Join Break the Internet Day by calling or emailing Congress via the ALA action alert on December 12.
  • Join protests online with some suggested social media messages on December 13 and 14:
    • Hey @AjitPaiFCC – America’s 120,000 libraries depend on equitable and robust access to the internet to serve our communities. We need #netneutrality!
    • .@FCC – Our libraries’ digital collections, podcasts, video tutorials, and more rely on an open internet. @AjitPaiFCC, keep #netneutrality!
    • #netneutrality is the First Amendment of the internet. @FCC, please protect the right to read, create and share freely without commercial gatekeepers.
    • OR tell us a story about what net neutrality means for your library and tag @ALALibrary, @FCC, @AjitPaiFCC

Stay tuned for additional actions if the FCC continues to ignore millions of people. Know that we would be far from game over as we seek relief in federal court–along with our many, many allied organizations.

The post This week’s activities to save net neutrality appeared first on District Dispatch.

Open Knowledge Foundation: Requiem for an Internet Dream

planet code4lib - Tue, 2017-12-12 12:15

The dream of the Internet is dying. Killed by its children. We have barely noticed its demise and done even less to save it.

It was a dream of openness, of unprecedented technological and social freedom to connect and innovate. Whilst expressed in technology, it was a dream that was, in essence, political and social. A dream of equality of opportunity, of equality of standing, and of liberty. A world where anyone could connect and almost everyone did.

No-one controlled or owned the Internet; no one person or group decided who got on it or who didn’t. It was open to all.

But that dream is dying. Whilst the Internet will continue in its literal, physical sense, its spirit is disappearing.

In its place, we are getting a technological infrastructure dominated by a handful of platforms which are proprietary, centralized and monopolized.

Slowly, subtly, we no longer directly access the Net. Instead, we live within the cocoons created by the Internet’s biggest children. No longer do you go online: you go on Facebook or you Google something. In those cocoons we seem happy, endlessly-scrolling through our carefully curated feeds, barely, if ever, needing to venture beyond those safe blue walls to the Net beyond.

And if not on Facebook, we’ll be on Google, our friendly guide to the overwhelming, unruly hinterlands of the untamed Net. Like Facebook, Google is helpfully ensuring that we need never leave, that everything is right there on its pages. They are hoovering up more and more websites into the vastness that is the Googleplex. Chopping them up and giving them back to us in the bite-sized morsels we need. Soon we will never need to go elsewhere, not even to Wikipedia, because Google will have helpfully integrated whatever it was we needed; the only things left will be the advertisers who have something to sell (and who Google need to pay them).

As the famous Microsoft mantra went: embrace, extend, extinguish.

Facebook, Google, Apple and the like have done this beautifully, aided by our transition back from the browser to the walled garden of mobile. And this achievement is all the more ironic for its unintended nature; if questioned, Facebook and Google would honestly protest their innocence.

Let me be clear, this is not a requiem for some half-warm libertarianism. The Internet is not a new domain, and it must play by laws and jurisdictions of the states in which it lives. I am no subscriber to independence declarations or visions of brave new worlds.

What I mourn is something both smaller and bigger. The disappearance of something rare and special: proof that digital was different, that platforms at a planetary scale could be open, and that from that magical combination of tech and openness something special flowed. Not only speech and freedom of speech, but also innovation and creativity in all its wondrous fecundity and generous, organized chaos on a scale previously unimagined.

And we must understand that the death of this dream was not inevitable. It is why I hesitate to use the word dream. Dreams always fade in the morning; we always wake up. This was not so much a dream as possibility. A delicate one, and a rare one. After all, the history of technology and innovation is full of proprietary platforms and exclusive control — of domination by the one or the few.

The Internet was different. It was like language: available to all, almost as a birthright. And in the intoxicating rush of discovery we neglected to realise how rare it was. What a strange and wonderful set of circumstances had caused its birth: massive, far-sighted government investment at DARPA, an incubation in an open-oriented academia, maturity before anyone realised its commercial importance, and its lucky escape in the 1990s from control by the likes of AOL or MSN. And then, as the web took off, it was free, so clearly, unarguably, and powerfully valuable for its openness that none could directly touch it.

The Internet’s power was not a result of technology but of a social and political choice. The choice of openness. The fact that every single major specification of how the Internet worked was open and free for anyone to use. That production grade implementations of those specifications were available as open software — thanks to government support. That a rich Internet culture grew that acknowledged and valued that openness, along with the bottom-up, informal innovation that went with it.

We must see this, because even if it is too late to save the Internet dream, we can use our grief to inspire a renewed commitment to the openness that was its essence, to open information and open platforms. And so, even as we take off our hats to watch the Internet pass in all its funereal splendour, in our hearts we can have hope that its dream will live again.

Image: Fylkesarkivet i Sogn og Fjordane

FOSS4Lib Recent Releases: Zebra - 2.1.3

planet code4lib - Mon, 2017-12-11 21:34

Last updated December 11, 2017. Created by Peter Murray on December 11, 2017.
Log in to edit this page.

Package: ZebraRelease Date: Wednesday, December 6, 2017

John Mark Ockerbloom: Adding more, and more structured, information about public domain serials

planet code4lib - Mon, 2017-12-11 21:24

I’ve been happy to hear from a number of people and institutions interested in the IMLS-funded project we now have underway to shed light on the hidden public domain of 20th century serials that I discussed in my last post.  I gave a short 6-minute presentation on the project at this fall’s Digital Library Federation Forum, and you can find my presentation slides and script on the Open Science Framework.   (You can download either a Powerpoint file or a PDF as you prefer; both have the slides and the notes.)

With the help of Alison Miner and Carly Sewell, I’m now starting to add data to the inventory of serials that is one of the deliverables of the project.  Right now, we’re putting up data from 1966 renewals, where serial contributions from 1938 and 1939 were renewed.  But there’s quite a bit more data in the pipeline, and I hope that we’ll have all of the 1930s covered by the end of this month, and then advance rapidly into the 1940s in the new year.  (Our goal is to get up to 1978’s renewals, which will finish off the 1940s and get to 1950, and from there on the Copyright Office’s online database can be consulted for serial renewals.  We’re aiming to have that completed sometime in the spring.)

I’ve heard from various people who are interested in clearing copyrights for their own serials digitization projects– as well as some projects that are doing it already, like the Hevelin Fanzines project that was also discussed at the DLF Forum.  As I mentioned in my previous post, we intend to publish suggested procedures for doing such copyright-clearing.  We’ll be preparing drafts of such procedures in the new year, and we’ll let interested folks know when such drafts are available for comments and suggestions.

We’re also happy to hear suggestions about other aspects of the project.  One suggestion we heard in an early presentation was the inventory should be shared as downloadable structured data, and not just as a big web page, so that it would be easier for people to repurpose and automatically analyze the data or various purposes.  That sounded like a good idea to me, and I got more excited about its possibilities when looking at all the work that people were doing with projects like the FictionMags project, ISFDB and similar projects, where volunteers have crowd-compiled detailed structured contents information for a large number of serials.

So the new data going into the inventory is going in as structured data files, and we’re also slowly refitting existing entries into such files.  That’s meant a bit of a slower startup than we’d originally planned, but we believe this work will pay dividends in the future.  Already it means that we can reuse the same data in multiple contexts– for instance, we can show first-renewal information for Adventure magazine on the Online Books Page issue listings, on a copyright information page, and in the big inventory page.  Updating the underlying structured data file can change what appears in all of these contexts.

Moreover, data structures are expandable.  When readers asked me to list Amazing Stories and Galaxy science fiction magazines on The Online Books Page, I had to determine which parts of their runs were public domain and thus could be listed without any further inquiries into permission.   I looked up copyright renewals for these magazines and then recorded renewal data in the same sorts of structured data files so it could be reused.  (Here, for instance, is the automatically generated copyright information page for Amazing Stories.)  I also added structured data fields that allowed the inclusion of name identifiers, in particular, Library of Congress Name Authorities, so that authors could be consistently identified, and then linked to other information about them, such as contact information for permissions.)  With these links, and with the links to full issue tables of contents compiled by other projects, it becomes easier to digitize nearly any story in the early years of the magazine, by checking to see whether there is still an active copyright on it, and by sending an inquiry for permissions if there is one.

To be clear: I’m not going to compile lists of all renewals for all the serials in my inventory.   That would take more time than I have– the scope of my 1-year grant only covers the first renewals for the serials that have them, up to 1950.  But if I can create more detailed renewal lists for Amazing Stories using the defined structure, then others who are interested could create similarly structured renewal lists for other serials.  And maybe those lists could be linked, shared or distributed from my inventory as well, if there’s interest.

So before I get much further into the project, I’d like to hear from folks who might be interested in using or compiling this sort of detailed renewal information.  Is this sort of structured information useful to you?  And if so, will the format and structure I’ve defined for the data files work?  Or should it change (something that’s easier to do now than later), or be augmented?

I didn’t find any pre-existing schema that covered detailed article-level copyright renewal data, so I decided to roll my own for starters.  There’s a variety of encoding schemes one could use for it, including XML, JSON, and the various RDF formats.  I figured JSON would be easiest for laypeople and librarians to understand and edit in its  native format, and it can be automatically translated into suitable XML or RDF schemes if desired.  But if you know of good reasons for preferring a different native format, or know of schemes I should reuse or extend instead of reinventing the wheel, I’d be interested in hearing about them.  (I’m especially interested if the format or schema is already in common use by the sorts of folks who compile serial contents information.)  Alternatively, if what I have now is as good a starting point as anything else, I’d be happy to know that, and could then take the time to write up formal documentation for it.

To see the existing files I have, you can go to the big inventory page and follow the “More details” links that you’ll see for certain serials in the list.  These lead to copyright information pages for the serials in question, which in turn have links to JSON files at the bottom of each page.  I also have most of the JSON files in a Github folder that’s part of my Online Books Page project repository.

If you work with this sort of metadata, or would like to, I’d love to hear from you.  If we get this right, I hope this data will spark all kinds of useful work opening access to a wide variety of 20th-century serial publications.


Islandora: New Member: Louisiana State University

planet code4lib - Mon, 2017-12-11 20:01

The Islandora Foundation is very happy to announce that Louisiana State University will be joining us as a Member in the Foundation as of January, 2018.

Longtime users of Islandora, LSU's latest big Islandora project is the Lousiana Digital Library, a stunning multi-site that encompasses 17 Louisiana archives, libraries, museums, and other repositories, with more than 144,000 digital items. Collections run the gamut from photographs, maps, manuscript materials, books, and oral histories. 


Subscribe to code4lib aggregator