You are here

Feed aggregator

Open Knowledge Foundation: ROUTETOPA Case Study: Hetor Pilot

planet code4lib - Tue, 2017-12-19 10:00

Since 2015 Open Knowledge International has been part of the consortium of RouteToPA, a European innovation project aimed at improving citizen engagement by enabling meaningful interaction between open data users, open data publishers and open data. In the ROUTETOPA case study series, we shine a light on public administrations, organizations and communities that have adopted and are using ROUTETOPA tools for work and discussions around open data.

This case study narrative was written by Hetor Pilot’s Carmen Luciano, Vanja Annunziata, Maria Anna Ambrosino and Gianluca Santangelo and has been reposted from the RouteToPA website.

Italy has a long cultural tradition and the Campania region in particular is a territory that has a huge number of worthy local resources. Campania region’s cultural heritage must be preserved and promoted: first, by public administrations, but also by citizens. Unfortunately, these actions become more and more arduous, especially in a society totally oriented to the technological world, in which people are no more interested in “old things”.

The Hetor project was born with the aim of “revealing Campania cultural heritage essence via open data power”, combining cultural heritage with new technologies. The project is part of the initiatives organized by the DATABENC Technological district (High Technology Consortium for Cultural Heritage) within the EU H2020 ROUTE-TO-PA Project, and it is based on the creation and accessibility of knowledge concerning Campania cultural heritage.

The term Hetor (‘heart’ in greek) is connected to the principle of ‘truth’, meaning a shared and participatory construction of knowledge. The project aims to motivate and engage public administrations, local communities and schools in co-producing open data to enhance the local cultural heritage.

Therefore, we have created a website for open data concerning the cultural heritage of the Campania region, which contains official data from national institutions, such as ISTAT, MIBACT, MIUR and Campania Region.

The project is even more ambitious: by logging in to Hetor’s Social Platform for Open Data (SPOD) citizens can hold discussions, using free licensed data that’s available for use all over the world, in addition to data collected on the project repository. They can also co-create contents related to their town, enhancing their local cultural heritage.

Screen grab of a co-created dataset on Hetor’s Social Platform for Open Data and a visualization created from the dataset

To reach these goals, the project follows two main directions:

  • Reuse of data, via various formats (images, GIF, articles) in order to spread the information collected within the datasets on SPOD;
  • Spreading of data, via a specific communication strategy that uses two main ways of communication, the Hetor Facebook page and the Hetor blog .

The initial activities of the project involved a group of trainees undertaking their ICT Masters program for  in “Cultural Heritage Information System” at DATABENC. They produced 8 datasets about the Cultural Heritage resources of Campania Region, including material and immaterial resources, in order to facilitate the creation of touristic itineraries to promote the territory.

In the second phase, students have been involved in the project, in particular 4 schools located in the provinces of Salerno, Avellino and Caserta. At the end of the activities, conducted within the ‘School-to-work transition programme’, students have produced 19 datasets about their local resources, both tangible and intangible ones. Communities also collaborated with the project: two groups of citizens in particular, in the province of Salerno, produced two datasets related to their territory.

The power of the Hetor Project lies in the combination of cultural heritage with ICT: the open data collected on SPOD are the means to promote and enhance the territory. Currently they concern the Campania region, but that could be implemented to the national level with citizen’s participation. Everyone can join us, co-creating data in order to enhance their own town, revealing information that even native citizens did not know before! To stay updated on Hetor’s future work, you can read more on this blog and follow updates via Facebook.

Ed Summers: Web Histories

planet code4lib - Tue, 2017-12-19 05:00

Rogers (2017) provides a useful introduction of how to use screencasts of archived web content as a method when doing web history. He credits Jon Udell for coming up with the software as movie technique in his screencast of the editing history of the Wikipedia article Heavy Metal Umlaut. The contribution Rogers makes is applying this method to historical work with web archives, which he provides an example of in his Google and the politics of tabs screencast.

The article contains a description of the method itself, which is to use the [Wayback Machine] as a data source, Grab Them All for collecting screenshots, and [iMovie] for knitting the images together into a video which can then be narrated. Identifying what URLs to screencap is an important step to the process, and Rogers explains some of the features of the Wayback Machine to make this easier to understand.

Apart from the method itself Rogers thinks more generally about the historiography of web archives. Specifically he distinguishes between different types of histories that can be conducted with the archived web content:

In the following, narrations or particular goals for telling the history of a website are put forward. They offer means to study the history of the Web (as seen through a single website like Google Web Search), the history of the Web as media (such as how a newspaper has grappled with the new medium) as well as the history of a particular institution (such as the US White House or marriage, as seen through a leading wedding website). Arguably, the first is a form of Web and medium history, the second media history, and the third digital history, however much each also blends the approaches and blurs the distinctions.

Reading the rest of the article helps to understand these distinctions, but it took me several reads until I felt like I understood the differences here. It helped me to consider why the archived web content is being studied.

Is the archived content being studied to better understand:

  1. the website?
  2. the web as a medium?
  3. a real-world phenomena that leaves traces on, or is entangled with, the web?

As Rogers says, these things blur together quite a bit. It’s hard to imagine studying the web as a medium (2) without looking at specific examples of web content (1), or considering the social, cultural, political and economic factors that gave rise to it. But I still think these categories are a useful rubric or guidepost for characterizing how scholars work with archived web content, particularly when emphasis is placed more in one area than another. As Rogers says it’s also important to consider that the Wayback Machine itself is a historical artifact.

The article also pointed me to some work that has been done about the use of web archives as evidence in legal settings and authenticity (Andersen, 2013 ; Gazaryan, 2013 ; Russell & Kane, 2008) which I’d heard about before (Zittrain, Albert, & Lessig, 2014) but not actually seen referenced from a humanities perspective.

Also, Rogers points to Chun (2011) as a way of talking about different forms of digital ephemerality that historians encounter in web archives. Chun (2016) was already on my reading list for the coming months, but I may have to add this one too, particularly because of the software studies angle it seems to take.


Andersen, H. (2013). A website owner’s practice guide to the wayback machine. J. On Telecomm. & High Tech. L., 11, 251. Retrieved from

Chun, W. H. K. (2011). Programmed visions: Software and memory. MIT Press.

Chun, W. H. K. (2016). Updating to remain the same: Habitual new media. MIT Press.

Gazaryan, K. (2013). Authenticity of archived websites: The need to lower the evidentiary hurdle is imminent. Rutgers Computer & Tech. L.J., 39, 216.

Rogers, R. (2017). Doing web history with the internet archive: Screencast documentaries. Internet Histories, 1–13.

Russell, E., & Kane, J. (2008). The missing link: Assessing the reliability of internet citations in history journals. Technology and Culture, 49(2), 420–429.

Zittrain, J., Albert, K., & Lessig, L. (2014). Perma: Scoping and addressing the problem of link and reference rot in legal citations. Legal Information Management, 14(02), 88–99.

District Dispatch: After the FCC vote: continuing the fight for net neutrality

planet code4lib - Tue, 2017-12-19 03:28

On December 14, a majority of FCC commissioners voted to gut net neutrality protections, limiting the power of ISPs to block, throttle, degrade or assign preference to some online content and services over others. This 3-2 vote to roll back strong, enforceable net neutrality protections was made in the face of widespread protests. Here is where we are today.

What Happened
This FCC decision opens the door for very different consumer experiences of the internet, especially within libraries for the patrons and communities they serve. For now, these changes are likely farther in the future. The country’s largest ISPs have told customers they will not see a change in how they experience the web (although past records don’t look bode well). With so many eyes on them right now and the certainty of a legal challenge, it is unlikely these companies will take any immediate action that will draw scrutiny. That said, significant changes are almost certain. In particular, ISPs could create faster delivery lanes for their own content (since many of these companies have media interests). This would make it harder for third-party or competing content (like content created by and in libraries) to reach people with the same quality of service.

Commissioners Mignon Clyburn and Jessica Rosenworcel dissented in last week’s decision and detailed some of their concerns about the potential harms to the public interest and the internet ecosystem as we know it.

What’s Next?
As we’ve noted before, the FCC vote is not the final word. There are a several avenues supporters of net neutrality could take:

  1. Right after the vote, members of Congress announced their intent to attempt to nullify the FCC’s actions. The Congressional Review Act (CRA) gives Congress the ability and authority to do this; the CRA allows Congress to review a new agency regulation (in this case, Pai’s “Restoring Internet Freedom” Order) and pass a Joint Resolution of Disapproval to overrule it. This would repeal last weeks FCC order, restoring the 2015 Open Internet Order and keeping net neutrality protections in place. This Congressional action would be subject to Presidential approval. Senator Ed Markey (D-MA) and Congressman Mike Doyle (D-PA) have both announced their intentions to introduce resolutions to overturn the FCC’s decision using the authority granted by the CRA and Democratic leadership in both Houses have urged their colleagues to support this move. Rep. Marsha Blackburn (R-TN) plans to introduce legislation to codify net neutrality rules. Sen. Bill Nelson (D-FL), the top Democrat on the Senate Commerce Committee, is also calling for Congress to preserve the net neutrality rules. Timing: Congress will have 60 legislative days from when this order is published in the Federal Register. That could be about 5 or 6 months. Other legislation could take even longer.
  2. The FCC will be sued about this decision. Just hours after the vote, 18 state attorneys general announced they would be taking the FCC to court. “There is a strong legal argument that with this action, the federal government violated the Administrative Procedure Act,” Washington state Attorney General Bob Ferguson said in his statement.

    And advocacy groups Free Press and the National Hispanic Media Coalition have also announced they will take the Commission to court, disagreeing with the FCC on their interpretation of the Communications Act. They argue the FCC did not justify its action with any real facts for abandoning Title II classification for broadband ISPs.One important thing to remember is that going to court does not mean a judge will grant an injunction on the rules. And without an injunction, ISPs can still move forward with opening up internet fast lanes while the legal challenges make their way through the courts. Timing: We will know more after the rules are published in the Federal Register (probably any time in the next 60 days) and as cases are announced.

  3. In addition to legal action, several states and localities have indicated they would like to hold ISPs accountable for ensuring a neutral net for consumers in their areas. It is still very early. Some suggestions, like those from the Governor of Washington, are not yet formal legislative proposals. It is worth noting the FCC order specifically pre-empts state and local legislation, but it is not clear yet how expansive this pre-emption will be and whether it will pass legal muster. We will be watching these proposals closely and will update members about which pieces are moving and comport with the strong protections and the principles we have supported. Timing: Many state legislatures introducing bills for their coming sessions, so we expect to see more motion over the next few months.

What ALA is Doing and What You Can Do
ALA is reviewing its options and the best course of action with regard to legal challenges to last week’s vote. In the past, we have submitted extensive friend of the court briefs.

We are also working with allies to encourage Congress to overturn the FCC’s egregious action. You can email your members of Congress today and ask them to use a Joint Resolution of Disapproval under the CRA to restore the 2015 Open Internet Order protections.

ALA will continue to update you on the activities above and other developments as we continue to fight to preserve a neutral internet.

The post After the FCC vote: continuing the fight for net neutrality appeared first on District Dispatch.

District Dispatch: Webinar on public domain now available

planet code4lib - Tue, 2017-12-19 02:29

December’s CopyTalk was so popular, we had to turn people away because the webinar room filled so quickly. For those that missed the show, “Where oh Where is the Public Domain,” presented by public domain guru himself Peter Hirtle of the Berkman Klein Center at Harvard University, a recording is now available.

There are so many tricks to the public domain! Can someone license public domain material? What if the work you think is in the public domain is still covered in the United Kingdom? Are facts and data (where the compilation itself requires research) protected? Is there any public policy justification for a long (and longer) copyright term? Answer: No!

Most CopyTalk webinars have been archived and are available for viewing.

Tune in for the next CopyTalk. Mark your calendars the first Thursday of the month at 2 p.m. Eastern Time.

As we look forward to 2018, please know that we are currently seeking webinar topics. Send your ideas to with the subject heading “CopyTalk Idea.”

The post Webinar on public domain now available appeared first on District Dispatch.

DuraSpace News: From Ethiopia to Portugal: Say Goodbye to 2017 and Welcome to 2018

planet code4lib - Tue, 2017-12-19 00:00

From Claudio Cortese, 4Science  4Science is in Africa to support VLIR-UOS. VLIR-UOS is a Flemish initiative that supports partnerships between universities in Flanders (Belgium) and in developing countries, looking for innovative responses to global and local challenges.

DuraSpace News: Migrating to DSpace

planet code4lib - Tue, 2017-12-19 00:00

From Atmire   Are your considering to migrate your current repository to DSpace but you are wondering what the options are? Over the years, Atmire has carried out migrations from a variety of platforms such as EPrints, Fedora, Digital Commons, CONTENTdm, DigiTool, Equella and homegrown software. Because of its wide array of commonly used import facilities and formats, migrating to DSpace may require less work and introduce less risk than you may think.

District Dispatch: House committee passes Higher Education Act

planet code4lib - Mon, 2017-12-18 21:45

As anticipated in a District Dispatch post earlier this fall, the House Education and Workforce Committee this week approved (or marked up) legislation to reauthorize the Higher Education Act on a party line 23-17 vote. This legislation, among other things, threatens a popular student loan program for LIS program graduates.

The Promoting Real Opportunity, Success, and Prosperity Through Education (PROSPER) Act, introduced earlier this month by Committee Chairwoman Virginia Foxx (R-NC), is an authorization bill (not a funding measure) that sets policies for federal student financial aid programs. The PROSPER Act would largely keep the overall structure of the original 1965 Higher Education Act and authorize the Act through 2024 (the most recent authorization expired in 2013). The legislation streamlines student loan programs, expands job training opportunities, among other changes.

The PROSPER Act also threatens the Public Service Loan Forgiveness (PSLF) Program, which supports graduates working the public sector, often in lower-paying jobs. PSLF allows graduates to erase debt after 10 years of working in public service careers and meeting loan payments. Librarians are among those who currently qualify for benefits under the program. (Under the PROSPER Act, current participants currently in the program would not be impacted by the proposed elimination.) In addition, demand in rural communities for PSLF students remains high. ALA is alarmed by the proposed elimination of PSLF program and has worked throughout the fall to advocate for the retention of the program.

Also threatened by the PROSPER Act are programs for teacher prep under Title II of the Act.

In a 14-hour-long marathon markup of the PROSPER Act, the Committee more than 60 amendments, including one introduced by Rep. Joe Courtney (D-CT) to support PSLF that narrowly failed on a 20-19 vote.

The PROSPER Act now heads to the full House of Representatives for consideration. The timing is uncertain as Congress faces a number of budget and tax priorities that must be addressed. In addition, some members from rural communities have raised concern with key provisions such as the elimination of PSLF, which is popular for rural communities seeking to attract college graduates. We expect the PROSPER Act to come up in the House early next year.

The Senate has not announced a timetable for its Higher Education Act legislation, but Senator Lamar Alexander (R-TN), who chairs the Senate Health, Education, Labor, and Pensions Committee has expressed optimism that he can craft a bill with the committee’s ranking Member, Senator Patty Murray (D-WA). PSLF has generally received stronger support in the Senate, and ALA will continue to work with members of Congress to defend PSLF in the coming months.

The post House committee passes Higher Education Act appeared first on District Dispatch.

Eric Lease Morgan: LexisNexis hacks

planet code4lib - Mon, 2017-12-18 17:37

This blog posting briefly describes and makes available two Python scripts I call my LexisNexis hacks.

The purpose of the scripts is to enable the researcher to reformulate LexisNexis full text downloads into tabular form. To accomplish this goal, the researchers is expected to first search LexisNexis for items of interest. They are then expected to do a full text download of the results as a plain text file. Attached ought to be an example that includes about five records. The first of my scripts — — parses the search results into individual records. The second script — — reads the output of the first script and parses each file into individual but selected fields. The output of the second script is a tab-delimited file suitable for further analysis in any number of applications.

These two scripts can work for a number of people, but there are a few caveats. First, saves its results as a set of files with randomly generated file names. It is possbile, albeit unlikely, files will get overwritten because the same randomly generated file names was… generated. Second, the output of only includes fields required for a specific research question. It is left up to the reader to edit for additional fields.

In short, your milage may vary.

District Dispatch: Deadlock on Section 702 surveillance bills

planet code4lib - Mon, 2017-12-18 14:28

Legislation to reauthorize the Foreign Intelligence Surveillance Act (FISA) is due to expire on Dec 31st of this year. Currently three Senate and two House bills are on the table, but the House and Senate Judiciary and Intelligence Committees are not close to resolving differences. Experts say that the reauthorization deadline likely will be extended by a short-term continuing resolution, pushing the legislation into next year.

The ALA Washington Office continues to support legislation that would advance the privacy protections of the public, such as the USA Rights Act (S. 1997). The Washington Office has joined coalition partners by signing a letter in support of S. 1997 to close the “backdoor search loophole” by requiring that surveillance of Americans only be done with a warrant. While no legislation pending would specifically impact libraries, the right to privacy is central to the ability of individuals to seek and communicate information freely and anonymously, a core value of libraries.

For more information on FISA and existing legislation see the Electronic Frontier Foundation’s “decoding FISA” resources.

The post Deadlock on Section 702 surveillance bills appeared first on District Dispatch.

Terry Reese: MarcEdit 7 Updates: Focusing on the Task Broker

planet code4lib - Mon, 2017-12-18 14:04

This past weekend, I spent a bit of time doing some additional work around the task broker. Since the release in Nov. , I’ve gotten a lot of feedback from folks that are doing all kinds of interesting things in tasks. And I’ll be honest, many of these task processes are things that I could never, ever, imagine. These processes have often times, tripped the broker – which evaluates each task prior to running, to determine if the task actually needs to be run, or not. So, I spent a lot of time this weekend with a number of very specific examples, and working on update the task broker to, in reality, identify the outliers, and let them through. This may mean running a few additional cycles that are going to return zero results, but I think that it makes more sense for the broker to pass items through with these edge cases, rather than attempting to accommodate every case (also, some of these cases would be really hard to accommodate). Additionally, within the task broker, I’ve updated the process so that it no long just looks at the task to decide how to process the file. Additionally, the tool is reading the file to process, and based on file size, auto scaling the buffer of records processed on each pass. This way, smaller files are processed closer to 1 record at a time, while larger files are buffered in record groups of 1500.

Anyway, there were a number of changes make this weekend, the full list is below:

  • Enhancement: Task Broker — added additional preprocessing functions to improve processing (specifically, NOT elements in the Replace task, updates to the Edit Field)
  • Enhancement: Task Broker — updated the process that selects the by record or by file approach to utilize file size in selecting the record buffer to process.
  • Enhancement: New Option — added an option to offer preview mode if the file is too large.
  • Enhancement: Results Window — added an option to turn on word wrapping within the window.
  • Enhancement: Main Window — More content allowed to be added to the most recent programs run list
  • ** Added a shortcut (CTRL+P) to immediately open the Recent Programs List
  • Bug Fix: Constant Data Elements — if the source file isn’t present, and error is thrown rather than automatically generating a new file. This has been corrected.
  • Update: Task Debugger — Updated the task debugger to make stepping through tasks easier.
  • Bug Fix: Task Processor — commented tasks were sometimes running — this has been corrected in both the MarcEditor and console program.
  • Enhancement: Status Messages been added to many of the batch edit functions in the MarcEditor to provide more user feedback.
  • Enhancement: Added check so that if you use “X” in places where the tool allows for field select with “x” or “*”, the selection is case insensitive (that has not been the default, though it’s worked that way in MarcEdit 6 but this was technically not a supported format
  • Updated Installer: .NET Requirements set to 4.6 as the minimum. This was done because there are a few problems when running only against 4.5.



DuraSpace News: iPRES 2018 Preliminary Call for Contributions

planet code4lib - Mon, 2017-12-18 00:00

From Sibyl Schaefer, Chronopolis Program Manager and Digital Preservation Analyst, University of California, San Diego

iPRES 2018 BOSTON—Where Art and Science Meet —The Art In the Science & The Science In the Art of Digital Preservation—will be co-hosted by MIT Libraries and Harvard Library on September 24-27, 2018. Help us celebrate the first 15 Years of iPRES, the premier international conference on the preservation and long-term management of digital materials.

DuraSpace News: VIVO Updates for December 17

planet code4lib - Mon, 2017-12-18 00:00

From Mike Conlon, VIVO Project Director

Congratulations Cambridge and Cornell!  The University of Cambridge launched its long awaited VIVO this week.  Congratulations to the team at Cambridge, and to Symplectic, who assisted with the implementation.  Congratulations to the team at Cornell who launched an upgrade to Scholars at Cornell this week.

District Dispatch: ALA invites nominations for 2018 James Madison Award

planet code4lib - Fri, 2017-12-15 16:40

The American Library Association’s (ALA) Washington Office invites nominations for its James Madison Award. The award recognizes an individual or group who has championed, protected, or promoted public access to government information and the public’s right to know at the national level. 

The award is named in honor of President James Madison, known as “the Father of the Constitution.” Madison was a noted advocate for the importance of an informed citizenry, famously writing in 1822:

“A popular Government, without popular information, or the means of acquiring it, is but a prologue to a Farce or a Tragedy; or perhaps both. Knowlege will for ever govern ignorance: and a people who mean to be their own Governours, must arm themselves with the power which knowledge gives.”

Winner of the 2017 Madison Award, Sen. Jon Tester of Montana, was nominated by ALA member Ann Ewbank for his longstanding bipartisan commitment to increasing public access to information. You can watch the presentation and Sen. Tester’s  acceptance speech here.

ALA will present the 2018 award during or near Sunshine Week, the week that includes March 16, Madison’s birthday.

Submit your nominations to the ALA Washington Office, care of Gavin Baker, no later than January 19, 2018. Please include a brief statement (maximum one page) about the nominee’s contribution to public access to government information. If possible, please also include a brief biography and contact information for the nominee.

The post ALA invites nominations for 2018 James Madison Award appeared first on District Dispatch.

Brown University Library Digital Technologies Projects: Understanding scoring of documents in Solr

planet code4lib - Fri, 2017-12-15 14:03

During the development of our new Researchers@Brown front-end I spent a fair amount of time looking at the results that Solr gives when users execute searches. Although I have always known that Solr uses a sophisticated algorithm to determine why a particular document matches a search and why a document ranks higher in the search result than others, I have never looked very close into the details on how this works.

This post is an introduction on how to interpret the ranking (scoring) that Solr reports for documents returned in a search.

Requesting Solr “explain” information

When submitting search request to Solr is possible to request debug information that clarifies how Solr interpreted the client request and information on how the score for each document was calculated for the given search terms.

To request this information you just need to pass debugQuery=true as a query string parameter to a normal Solr search request. For example:

$ curl "http://someserver/solr/collection1/select?q=alcohol&wt=json&debugQuery=true"

The response for this query will include debug information with a property called explain where the ranking of each of the documents is explained.

"debug": { ... "explain": { "id:1": "a lot of text here", "id:2": "a lot of text here", "id:3": "a lot of text here", "id:4": "a lot of text here", ... } } Raw “explain” output

Although Solr gives information about how the score for each document was calculated, the format that is uses to provide this information is horrendous. This is an example of how the explain information for a given document looks like:

"id:":"\n 1.1542457 = (MATCH) max of:\n 4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcohol in 831) [DefaultSimilarity], result of:\n 4.409502E-5 = score(doc=831,freq=6.0 = termFreq=6.0\n ), product of:\n 2.2283042E-4 = queryWeight, product of:\n 5.170344 = idf(docFreq=60, maxDocs=3949)\n 4.3097796E-5 = queryNorm\n 0.197886 = fieldWeight in 831, product of:\n 2.4494898 = tf(freq=6.0), with freq of:\n 6.0 = termFreq=6.0\n 5.170344 = idf(docFreq=60, maxDocs=3949)\n 0.015625 = fieldNorm(doc=831)\n 4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) [DefaultSimilarity], result of:\n 4.27615E-5 = score(doc=831,freq=6.0 = termFreq=6.0\n), product of:\n 2.1943514E-4 = queryWeight, product of:\n 5.0915627 = idf(docFreq=65, maxDocs=3949)\n 4.3097796E-5 = queryNorm\n 0.1948708 = fieldWeight in 831, product of:\n 2.4494898 = tf(freq=6.0), with freq of:\n 6.0 = termFreq=6.0\n 5.0915627 = idf(docFreq=65, maxDocs=3949)\n 0.015625 = fieldNorm(doc=831)\n 1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) [DefaultSimilarity], result of:\n 1.1542457 = score(doc=831,freq=1.0 = termFreq=1.0\n ), product of:\n 0.1410609 = queryWeight, product of:\n 400.0 = boost\n 8.182606 = idf(docFreq=2, maxDocs=3949)\n 4.3097796E-5 = queryNorm\n 8.182606 = fieldWeight in 831, product of:\n 1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n 8.182606 = idf(docFreq=2, maxDocs=3949)\n 1.0 = fieldNorm(doc=831)\n",

It’s unfortunately that the information comes in a format that is not easy to parse, but since it’s plain text, we can read it and analyze it.

Explaining “explain” information

If you look closely at the information in the previous text you’ll notice that Solr reports the score of a document as the maximum value (max of) from a set of other scores. For example, below is a simplified version of text (the ellipsis represent text that I suppressed):

"id:":"\n 1.1542457 = (MATCH) max of:\n 4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcohol in 831) ... ... 4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) ... ... 1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) ... ..."

In this example, the score of 1.1542457 for the document with id was the maximum of three scores (4.409502E-5, 4.27615E-5, 1.1542457). Notice that the scores are in E-notation. If you look closely you’ll see that each of those scores is associated with a different field in Solr where the search term, alcohol, was found.

From the text above we can determine that the text alcohol was found on the ALLTEXTUNSTEMMED, ALLTEXT, and research_areas. Even more, we can also tell that for this particular search we are giving the research_areas field a boost of 400 which explains why that particular score was much higher than the rest.

The information that I omitted in the previous example provides a more granular explanation on how each of those individual field scores was calculated. For example, below is the detail for the research_areas field:

1.1542457 = (MATCH) weight(research_areas:alcohol^400.0 in 831) [DefaultSimilarity], result of:\n 1.1542457 = score(doc=831,freq=1.0 = termFreq=1.0\n), product of:\n 0.1410609 = queryWeight, product of:\n 400.0 = boost\n 8.182606 = idf(docFreq=2, maxDocs=3949)\n 4.3097796E-5 = queryNorm\n 8.182606 = fieldWeight in 831, product of:\n 1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n 8.182606 = idf(docFreq=2, maxDocs=3949)\n 1.0 = fieldNorm(doc=831)\n",

Again, if we look closely at this text we see that the score of 1.1542457 for the research_areas field was the product of two other factors (0.1410609 x 8.182606). There is even information about how these individual factors were calculated. I will not go into details on them in this blog post but if you are interested this is a good place to start.

Another interesting clue that Solr provides in the explain information is what values were searched for in a given field. For example, if I search for the word alcoholism (instead of alcohol) the Solr explain result would show that in one of the fields it used the stemmed version of the search term and in other it used the original text. In our example, this would look more or less like this:

"id:":"\n 1.1542457 = (MATCH) max of:\n 4.409502E-5 = (MATCH) weight(ALLTEXTUNSTEMMED:alcoholism in 831) ... ... 4.27615E-5 = (MATCH) weight(ALLTEXT:alcohol in 831) ... ...

Notice how in the unstemmed field (ALLTEXTUNSTEMMED) Solr used the exact word searched (alcoholism) whereas in the stemmed field (ALLTEXT) it used the stemmed version (alcohol). This is very useful to know if you were wondering why a value was (or was not) found in a given field. Likewise, if you are using (query time) synonyms those will show in the Solr explain results.

Live examples

In our new Researchers@Brown site we have an option to show the explain information from Solr. This option is intended for developers to troubleshoot tricky queries, not for the average user.

For example, if you pass explain=text to a search URL in the site you’ll get the text of the Solr explain output formatted for each of the results (scroll to the very bottom of the page to see the explain results).

Likewise, if you pass explain=matches to a search URL the response will include only the values of the matches that Solr evaluated (along with the field and boost value) for each document.

Source code

If you are interested in the code that we use to parse the Solr explain results you can find it in our GitHub repo. The code for this lives in two classes Explainer and ExplainerEntry.

Explainer takes a Solr response and creates an array with the explain information for each result. This array is comprised of ExplainEntry objects that in turn parse each of the results to make the match information easily accessible. Keep in mind that this code does mostly string parsing and therefore is rather brittle. For example, the code to extract the matches for a given document is as follows:

class ExplainEntry ... def get_matches(text) lines = text.split("\n") {|l| l.include?("(MATCH)") || l.include?("coord(")} end end

As you can imagine, if Solr changes the structure of the text that it returns in the explain results this code will break. I get the impression that this format (as ugly as it is) has been stable in many versions of Solr so hopefully we won’t have many issues with this implementation.

Journal of Web Librarianship: Editorial Board

planet code4lib - Wed, 2017-12-13 05:14
Volume 11, Issue 3-4, July-December 2017, Page ebi-ebi

District Dispatch: Keeping public information public: an introduction to federal records laws

planet code4lib - Tue, 2017-12-12 21:17

This is the first post in a three-part series.

The National Archives and Records Administration preserves and documents government and historical records. Photo by Wknight94

The federal government generates an almost unimaginable quantity of information, such as medical research, veterans’ service records, regulations and policy documents, and safety inspection records, to name just a few examples.

For decades, ALA has advocated for policies that ensure government information is appropriately managed, preserved and made available to the public, including through libraries. Federal records laws play important roles in those processes. This series will introduce those laws and highlight two current issues that impact public access to government information: the management of information on federal websites and the preservation and destruction of government information.

What are the fundamental concepts of federal records laws?

Today’s legal framework for managing government information dates back to the 1934 National Archives Act and the 1950 Federal Records Act (FRA). Under the supervision of the National Archives and Records Administration (NARA), federal agencies must:

  1. Create adequate records to document their activities;
  2. Keep the records for a sufficient length of time, based on their value; and
  3. Once the government no longer has an immediate use for the records, transfer those records with permanent value to the National Archives and dispose of the rest.

To highlight some of the significant implications of that framework:

  • Federal agencies do not have the discretion to do whatever they want with their records; they have responsibilities under federal law.
  • The National Archives isn’t just a building with a bunch of neat old documents: it actively oversees the ongoing archival activities of federal agencies. In particular, agencies generally may not destroy records without NARA’s approval.
  • To avoid drowning in a sea of information, the government needs to determine the expected value of a record – in other words, to predict whether a record will be needed next year, next decade, or next century, and then dispose of it after that time.

How do records laws affect public access to government information?

Members of the public use federal records both directly and indirectly:

Directly: The public may access records directly through a federal website, in an agency reading room, or by making a request under the Freedom of Information Act (FOIA) or a similar process. If your library has ever helped a user find information from a federal website, you’ve probably worked with a federal record.

Indirectly: Researchers and editors, both inside and outside of government, may utilize records in creating new information products for a larger audience. For instance, inside government, the Executive Office for Immigration Review publishes a bound volume of selected administrative decisions by compiling individual records into a product of interest to a wider audience. Outside of government, journalists often make FOIA requests for records, then incorporate information gleaned from those records into news stories that can better inform the public about government decision-making. Libraries might then collect the resulting publications that incorporated the information from federal records.

However, these activities can only take place if federal agencies properly create, manage and preserve their records. In an upcoming post, learn more about how agencies manage records on federal websites.

The post Keeping public information public: an introduction to federal records laws appeared first on District Dispatch.

Lucidworks: 7 Predictions for Search in 2018

planet code4lib - Tue, 2017-12-12 18:26

It wasn’t long ago that search technology was stagnant. The hard problems back then were data connectivity and import. Search teams might still struggle with these challenges for specific projects, but they are broadly solved from a technology standpoint. The next set of problems are about language parsing, how to stem words and match up to keywords, phrase matching, and faceting. There is always another tweak, but nearly every search product on the market does those things.

In recent times, search technology came alive with open source technology that allowed it to scale to big data sizes and brought web-scale to every company that needed it. However, as technology has progressed and the amount of data increased, so has our need to find the information we need in a mountain of data. Search technology is exciting again and here are the changes we can expect to see in the coming year!

Personalization becomes extreme

The new frontiers are not in connecting to yet another datasource or how to chop up a word or sentence. The new challenge is to match the results to the person asking the question. Doing this requires gathering more than a collection of documents and terms. It means capturing human behavior and assembling a complete user profile. That profile may be related to their profession (accounting vs sales/marketing vs engineering). That profile may be related to their past searches or purchases. This capability is moving past vendors like Amazon and Google. Other retailers are now incorporating this technology, we can expect to see this find its way into enterprise search applications.

Search gets more contextual

Who I am, what I’ve done, and what I want are “personalization.” Where I am in terms of geography, mobile phone provide location or in what part of an application is “context.” Up until now, web applications that provide search tend to provide a simple context-less search box.

However, the bar is being raised. If you ask your mobile device to provide a list of restaurants you don’t mean “in the whole world,” you want “around me and open right now.” Meanwhile, while working all day on one customer account, when you type “contracts” or “contacts” into any corporate search bar, in most enterprise search applications you get back a list of all documents that have those keywords. That’s “dumb” and search should be smarter and show you files related to the account you’ve been working on all day. The capability is there and this year users are going to start to expect it.

Applications become search-centric

Companies tend to deploy their applications and then add in some kind of “search portal” or series of search applications that are functionally separate to try and tie the search experience and the app experience back together. This requires a lot of the user, they have to context-switch and go back to a different site. To minimize this friction, search is getting integrated into the core of most applications. Whether it be traditional stalwarts like your CMS or CRM or newcomers like Slack, search is no longer an afterthought, it is the main way to interact with the application. In 2018, this will become more of an expectation of internal- and customer-facing applications as well regardless of their use case.

Machine learning becomes ubiquitous

So much of what we do as human beings is grouping things together because they look somehow similar (clustering) and sorting things into categories (classification). So many business problems are one of projecting based on trends (regression). Search has long been used to group stuff together and finding that stuff has often meant dividing it up into classes. What is different is that we can automate that.
However, it goes beyond this. In an era of rampant fraudulent news, reviews and transactions, Machine Learning allows search to sort through to the most relevant and most real results. This isn’t a nice to have anymore for most retailers, financial services or customer service sites.

In healthcare and insurance similar types of diagnosis, claims, and notes can automatically be grouped. Across the board, as a user’s profile is filled out recommendations for that user or for similar items are a necessity in an era where there is more data than information.

Migration to the cloud begins in earnest

The sky is falling! Cloud, ho! Many organizations will cease to run their own data centers. If you make pizzas, you should just do that, not deploy servers all over the planet. With that said, there are legal and technical barriers that have to be overcome before we’re all serverless. Because search is needed behind the firewall as well as in the cloud, for some time we’ll see on-premise and hybrid solutions more commonly than all cloud. With that said, the weather report for 2018 is a partly cloudy when it comes to search. Expect fewer pure on-premise deployments as more companies turn to hybrid and cloud search installations.

Search application development becomes a solved problem

In 2017, most search application developers were still writing yet another web search box with typeahead and facets from the ground up. Forget some of the more advanced features that you can implement, if you’re still hand coding basic features step-by-step, you’re not getting to the finish line very quickly. Most basic features and capabilities for search applications have already been written and the components are out there pre-coded for you. Savvy developers will start to use these pre-written, pre-tested components and frameworks and make their tweaks where necessary rather than reinventing the wheel every time. In 2018, we’ll see the end of from-scratch search applications at least for standard and mobile websites.

Single Service Solutions will start to abate

This year there were a lot of “new” and “old but reimagined” search solutions that were aimed at just one task. For example a search solution just for Salesforce. For one, it is hard to see a future where there is a significant market for a solution that does little more than improve a feature that is already built in without that vendor just augmenting with the same functionality. These single service search solutions are going to go away. Data wants to be together. Search is more than just “find all customers with ‘Associates’ in the name.” In order to reach the next level of customer engagement and employee productivity and find the answers that you need, you need to be able to augment data with other data. To do this you need a search solution that supports multiple sources and use cases. Sure it was fun being able to add search to this service, but now you need more. You don’t really want to manage ten single service search solutions for multiple data sources and use cases. In 2018, expect to see some pushback on search solutions that only serve one use case or search one data source.

Search in 2018: Get started now

Search in 2018, is going to be more personal, contextual, spread through all of your applications, powered by machine learning, and in the cloud. In 2018, we’ll stop developing search UIs from scratch and start putting together components that are pre-written, pre-tested, and ready for prime time. In 2018, we’ll stop using single-service solutions and power our search with a solution that supports all of our use cases with multiple data sources.

Get a move on the new year with Lucidworks Fusion and Fusion App Studio. Consider signing up for Fusion Cloud. This will put you on the path to more personalized modern search applications and give you an excellent path through 2018 and beyond.

The post 7 Predictions for Search in 2018 appeared first on Lucidworks.

Terry Reese: MarcEdit MacOS 3.0 Roadmap

planet code4lib - Tue, 2017-12-12 17:33

With MarcEdit 7 out, I’ve turned my focus to completing the MarcEdit MacOS update. Right now, I’m hoping to have this version available by the first of the year. I won’t be doing a long, extended beta, in part, because this version utilizes all the business code written for MarcEdit 7. And like MarcEdit 7, the mac version will be able to be installed with the current Mac build (it won’t replace it) – this way, you can test the new build while continuing to use the previous software.



Subscribe to code4lib aggregator