You are here

Feed aggregator

David Rosenthal: Authors breeding like rabbits

planet code4lib - Thu, 2015-08-13 15:00
The Wall Street Journal points to another problem with the current system of academic publishing with an article entitled How Many Scientists Does It Take to Write a Paper? Apparently, Thousands:
In less than a decade, Dr. Aad, who lives in Marseilles, France, has appeared as the lead author on 458 scientific papers. Nobody knows just how many scientists it may take to screw in a light bulb, but it took 5,154 researchers to write one physics paper earlier this year—likely a record—and Dr. Aad led the list.

His scientific renown is a tribute to alphabetical order.The article includes this amazing graph from Thompson-Reusters, showing the spectacular rise in papers with enough authors that their names had to reflect alphabetical order rather than their contribution to the research. And the problem is spreading:
“The challenges are quite substantial,” said Marica McNutt, editor in chief of the journal Science. “The average number of authors even on a typical paper has doubled.”Of course, it is true that in some fields doing any significant research requires a large team, and that some means of assigning credit to team members is necessary. But doing so by adding their names to an alphabetized list of authors on the paper describing the results has become an ineffective way of doing the job. If each author gets 1/5154 of the credit for a good paper it is hardly worth having compared to the whole credit for a single-author bad paper. If each of the 5154 authors gets full credit, the paper generates 5145 times as much credit as it is due.  And if the list is alphabetized but is treated as reflecting contribution, Dr. Aad is a big winner.

How long before the first paper is published with more authors than words?

Library of Congress: The Signal: DPOE Interview with Austin Schulz of the Oregon State Archives

planet code4lib - Thu, 2015-08-13 13:30

The following is a guest post by Barrie Howard, IT Project Manager at the Library of Congress.

This post is part of a series about digital preservation training inspired by the Library’s Digital Preservation Outreach & Education (DPOE) Program. This series focuses on exceptional individuals who have, among other things, completed one of the DPOE Train-the-Trainer workshops.

Today’s interview is with Austin Schulz, who led a digital preservation training workshop at the Oregon State Archives during 2014 National Archives Month. He is currently a Reference Archivist at the Oregon State Archives.

Austin Schulz at the 2014 Oregon Heritage Excellence Awards. (Photo by Kimberly Jensen)

Barrie: Can you tell the readers about your experience with the Train the Trainer workshop, and how you and others have benefited as a result?

Austin: It was a privilege to attend the September, 2011 “Train-the-Trainer Workshop” in Washington, D.C. and I am grateful to have worked with such a wonderful group of people on such an important topic in digital preservation. The presentations by our regional groups during the workshop were very helpful and provided an opportunity to see how they look from an audience perspective, as well as the chance to present some of the concepts in front of an audience. I particularly enjoyed the chance to work in regional groups because we could modify the presentations to better fit our prospective audiences. The simplicity and adaptability of the digital preservation concepts covered in the Digital Preservation Outreach Education (DPOE) baseline curriculum makes them applicable to anyone that creates and/or maintains digital content. The modules emphasize the primary aspects that both individuals and organizations need to consider as they develop a digital preservation plan, or improve upon an existing plan.

In the months after attending the 2011 DPOE workshop, I led the first in a series of one-hour workshops here at the Oregon State Archives. We decided to make the workshops open to the public with no charge to attend, so that we could better gauge the level of interest in these types of trainings. The following year we hosted another series of workshops which were designed to run approximately two hours and focused on two of the DPOE modules per workshop (all of which I modified to fit a more general audience). Based on the feedback I received after leading the second round of workshops, and those I presented in 2012, we decided to make some additional changes to the format. Due to the distance some attendees were traveling and the time between the workshops, we decided to present all six modules in a single half-day workshop instead of hosting multiple workshops throughout the month. This has allowed more people to attend and resulted in reducing the staff time needed to present each workshop.

Interest in digital preservation workshops is increasing and we continue to receive requests from both public and private entities regarding these workshops. In response, we have incorporated these Digital Preservation workshops, based on the DPOE curriculum, into our Archives Month celebrations each October and I look forward to utilizing the revised DPOE Curriculum this year.

Barrie: Since becoming an official DPOE Trainer, have you provided any other training than the most recent event? For example, have you developed any distance learning materials from the Curriculum, and delivered any online training?

Austin:I recently had the opportunity to lead a digital preservation workshop for a group of genealogists at the Canby Public Library. This gave me the opportunity to re-configure the DPOE workshop slides to fit a more specific type of audience. Unfortunately, I have not had the opportunity to create any online training but we have made the workshop materials available for individuals and organizations that have been unable to attend the live workshops.

Barrie: The DPOE Curriculum, which is built upon the OAIS Reference Model, recently underwent a revision. Have you noticed any significant changes in the materials since you attended the Workshop in 2010? What improvements have you observed?

Austin: The core concepts and much of the content in the DPOE Curriculum has remained largely unchanged but there have been some improvements introduced in the current version. The most significant change that I have noticed is that the new module presentations include very useful notes for trainers regarding the purpose of each slide and tips on how to more effectively present them to the audience. This is a helpful addition that will only increase the effectiveness of DPOE trainers. Each of the revised modules includes slides describing expected outcomes and outputs from each module so that both the instructors and participants have a clear understanding of what should be accomplished.

The revised “Store” module now includes a concise statement of the relationship between Archival Storage and Digital Objects. It also includes a slide on the Ingest stage from the OAIS Reference Model that was not part of the 2010 presentation slides. These additions help to highlight the connection between the DPOE curriculum and the OAIS model in a way that is easier to understand.

In addition to the above changes, I also noted that more descriptive information has been added to the Object-level Metadata slide regarding the types of metadata that should be captured. I believe that all of these changes increase the clarity of the DPOE concepts while providing additional information for trainers presenting the DPOE Curriculum.

Barrie: Regarding training opportunities, could you compare the strengths and challenges of traditional in-person learning environments to distance learning options?

Austin: Distance learning allows participants to access the presentations when it is convenient for them, and requires far less resources to make available than in-person trainings. In addition, distance learning also allows presenters to contact a much larger group than would be possible with in-person trainings. Challenges that I have encountered with distance learning are that it can be a more difficult environment to engage with the audience, and the presenter may not know if the concepts presented were received and understood by the audience. Distance learning requires technology on the part of the presenter and participants in order to function. This can sometimes result in technical difficulties. However, if a presenter is trying to reach the most participants possible, distance learning does provide a viable avenue for doing just that.

Traditional in-person learning options are often geared towards smaller audiences which makes it easier for a presenter to engage with and assess how well participants are following and understanding the materials as they are being presented. This format makes it easier for participants to ask questions during the presentation and provides the presenter with the opportunity to address individual audience concerns and questions during the relevant parts of the presentation. In my experience, in-person learning environments allow for more effective discussions and participants may find it easier to engage with the presenter. However, in-person learning environments do present some challenges as they require a physical site where the training will be held and staff to present the materials. Participants must travel to the training site which creates a cost and distance barrier that may prevent some people from being able to attend. This also limits the number of people that can actually attend the in-person training. Even with these challenges I prefer the in-person format as both a presenter and participant, as it provides an opportunity for more in-depth analysis of the presentation materials.

Barrie: What’s on the horizon for 2015?

Austin: Earlier this year we applied for a grant to provide basic digital preservation training, using the DPOE Curriculum, to some small and medium sized historical repositories in Oregon. Many of these repositories currently have a minimal or nonexistent web presence and very little experience in digital preservation or online publishing of historical records. As one of the regional DPOE trainers, I would be involved in editing and leading the training workshops. Unfortunately, the grant we received was not sufficient to completely fund such a project at this point. Therefore, we have decided to scale the project back and are applying for a smaller grant to do a demonstration project this year which still includes a digital preservation component. If the grant is approved, we are planning to report back with our findings next year. The following year we will re-apply for funding of the original project to provide basic digital preservation training. I am excited to have the opportunity to be involved in such an important project and utilize the revised DPOE Curriculum to assist smaller historical repositories in Oregon.

Thank you very much for allowing me to be interviewed for The Signal. I have thoroughly enjoyed being a regional DPOE Trainer and look forward to continuing this important work in the future.

Hydra Project: Sufia 6.3.0 released

planet code4lib - Thu, 2015-08-13 08:37

We are pleased to announce a new gem release.

The 6.3.0 version of Sufia includes a new widget in the administrative statistics page allowing users to display the number of deposits for a date range that they select [1] and adds a content block to the homepage where administrative users may post site-wide announcements (such as for system downtime or new features) [2]. It pulls in the latest version of ActiveFedora::Noid which handles minting and validation of short, opaque identifiers for Fedora objects [3]. It also contains the following highlighted fixes and enhancements:

* Numerous UI improvements related to layout, accessibility, and mobile displays [4][5][6][7][8][9]

* Single-use links should work when Turbolinks is on [10]

* Unregistered users should have the ability to see file citations [11]

* Remove hard-coded headers from About page [12]

* Allow downstream users to extend and override the administrative statistics module [13]

See the release notes [14] for the upgrade process (NOTE: requires running a new rake task!) and for an exhaustive list of the work that has gone into this release. Thanks to the 12 contributors for this release, which comprised 51 commits touching 342 files: Carolyn Cole, Drew Myers, Trey Terrell, Michael Tribone, Lynette Rayle, Dan Kerchner, Justin Coyne, Colin Gross, Hector Correa, Adam Wead, and Olli Lyytinen.

[1] https://github.com/projecthydra/sufia/pulls/1188

[2] https://github.com/projecthydra/sufia/pulls/1239

[3] https://github.com/projecthydra/sufia/pulls/1037

[4] https://github.com/projecthydra/sufia/pulls/1243

[5] https://github.com/projecthydra/sufia/pulls/1248

[6] https://github.com/projecthydra/sufia/pulls/1249

[7] https://github.com/projecthydra/sufia/pulls/1270

[8] https://github.com/projecthydra/sufia/pulls/1271

[9] https://github.com/projecthydra/sufia/pulls/1273

[10] https://github.com/projecthydra/sufia/pulls/1278

[11] https://github.com/projecthydra/sufia/pulls/1266

[12] https://github.com/projecthydra/sufia/pulls/1239

[13] https://github.com/projecthydra/sufia/pulls/1245

[14] https://github.com/projecthydra/sufia/releases/tag/v6.3.0

FOSS4Lib Recent Releases: Sufia - 6.3.0

planet code4lib - Thu, 2015-08-13 01:27

Last updated August 12, 2015. Created by Peter Murray on August 12, 2015.
Log in to edit this page.

Package: SufiaRelease Date: Wednesday, August 12, 2015

DuraSpace News: Invitation to join the DSpace User Interface (UI) Working Group

planet code4lib - Thu, 2015-08-13 00:00

From Tim Donohue, DSpace Tech Lead, DuraSpace

Winchester, MA  Based on the recently approved Roadmap [1], a new working group to oversee the DSpace User Interface (UI) replacement project is underway. We are extending an invitation to any interested community members to join this working group or take part in UI pilots.

James Cook University, Library Tech: Urquhart's Law: really?

planet code4lib - Wed, 2015-08-12 22:39
Sorry, this is just fluff, but this morning's Campus Morning Mail another bit of fluff that I felt drawn to comment on: <!--[if gte mso 9]> <![endif]--> <!--[if gte mso 9]> Normal 0 false false false EN-AU X-NONE X-NONE <![endif]--><!--[if gte mso 9]>

LibX: Which languages does LibX support?

planet code4lib - Wed, 2015-08-12 14:40

We have translations for LibX in a number of languages, the full translations of all terms can be found here. As of 8/15/2015, this includes English, German, French, Italian, Portuguese and Japanese.

Contributing a new language:

To contribute, download the en_US/messages.json file and translate it. Save the file as UTF-8 and send it to libx.org@gmail.com

To fix issue with existing translations, do the same. The translation files for the supported languages can be found here.

Testing new languages/using a different language:

LibX will use the default locale of the underlying browser. Extensions that can switch that locale should therefore affect the language in which LibX’s user interface is displayed, and this is the only way to do so. For the Chrome browser, instructions on how to switch the locale are here. Note that a restart of the browser is required.

In the Library, With the Lead Pipe: New Grads, Meet New Metrics: Why Early Career Librarians Should Care About Altmetrics &amp; Research Impact

planet code4lib - Wed, 2015-08-12 13:00

Photo courtesy of Peter Taylor. https://www.flickr.com/photos/nickstone333/8013446946

In Brief

How do academic librarians measure their impact on the field of LIS, particularly in light of eventual career goals related to reappointment, promotion, or tenure? The ambiguity surrounding how to define and measure impact is arguably one of the biggest frustrations that new librarians face, especially if they are interested in producing scholarship outside of traditional publication models. To help address this problem, we seek to introduce early career librarians and other readers to altmetrics, a relatively new concept within the academic landscape that considers web-based methods of sharing and analyzing scholarly information.

Introduction

For new LIS graduates with an eye toward higher education, landing that first job in an academic library is often the first and foremost priority. But what happens once you land the job? How do new librarians go about setting smart priorities for their early career decisions and directions, including the not-so-long term goals of reappointment, promotion, or tenure?

While good advice is readily available for most librarians looking to advance “primary” responsibilities like teaching, collection development, and support for access services, advice on the subject of scholarship—a key requirement of many academic librarian positions—remains relatively neglected by LIS programs across the country. Newly hired librarians are therefore often surprised by the realities of their long term performance expectations, and can especially struggle to find evidence of their “impact” on the larger LIS profession or field of research over time. These professional realizations prompt librarians to ask what it means to be impactful in the larger world of libraries. Is a poster at a national conference more or less impactful than a presentation at a regional one? Where can one find guidance on how to focus one’s efforts for greatest impact? Finally, who decides what impact is for librarians, and how does one go about becoming a decision-maker?

The ambiguity surrounding how to both define and measure impact quantitatively is a huge challenge for new librarians, particularly for those looking to contribute to the field beyond the publication of traditional works of scholarship. To help address this problem, this article introduces early career librarians and LIS professionals to a concept within the landscape of academic impact measurement that is more typically directed at seasoned librarian professionals: altmetrics, or the creation and study of metrics based on the Social Web as a means for analyzing and informing scholarship1.  By focusing especially on the value of altmetrics to early career librarians (and vice versa) we argue that altmetrics can and should become a more prominent part of academic libraries’ toolkits at the beginning of their careers. Our approach to this topic is shaped by our own early experiences with the nuances of LIS scholarship, as well as by our fundamental interest in helping researchers who struggle with scholarly directions in their fields.

What is altmetrics & why does it matter?

Altmetrics has become something of a buzzword within academia over the last five years. Offering users a view of impact that looks beyond the world of citations championed by traditional metric makers, altmetrics has grown especially popular with researchers and professionals who ultimately seek to engage with the public—including many librarians and LIS practitioners. Robin, for instance, first learned of their existence in late 2011, when working as a library liaison to a School of Communication that included many public-oriented faculty, including journalists, filmmakers, and PR specialists.

One of the reasons for this growing popularity is the narrow definition of scholarly communication that tends to equate article citations with academia impact. Traditional citation metrics like Impact Factor by nature take for granted the privileged position of academic journal articles, which are common enough in the sciences but less helpful in fields (like LIS) that accept a broader range of outputs and audiences. For instance, when Rachel was an early career librarian, she co-produced a library instruction podcast, which had a sizable audience of regular listeners, but was not something that could described in the same impact terms as an academic article. By contrast, altmetrics indicators tend to land at the level of individual researcher outputs—the number of times an article, presentation, or (in Rachel’s case) podcast is viewed online, downloaded, etc.

Altmetrics also opens up the door to researchers who, as mentioned earlier, are engaged in online spaces and networks that include members beyond the academy. Twitter is a common example of this, as are certain blogs, like those directly sponsored by scholarly associations or publishers. Interested members of the general public, as well as professionals outside of academia, are thus acknowledged by altmetrics as potentially valuable audiences, audiences whose ability to access, share, and discuss research opens up new questions about societal engagement with certain types of scholarship. Consider: what would it mean if Rachel had discovered evidence that her regular podcast listeners included teachers as well as librarians? What if Robin saw on Twitter that a communication professor’s research was being discussed by federal policy makers?

A good example of this from the broader LIS world is the case of UK computer scientist Steve Pettifer, whose co-authored article “Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web” was profiled in a 2013 Nature article for the fact that it had been downloaded by Public Library of Science (PLOS) users 53,000 times between 2008 and 2012, as “the most-accessed review ever to be published in any of the seven PLOS journals”2. By contrast, Pettifer’s article had at that point in time generated about 80 citations, a number that, while far from insignificant, left uncaptured the degree of interest in his research from a larger online community. The fact that Pettifer subsequently included this metric as part of a successful tenure package highlights one of the main attractions of altmetrics for researchers: the ability to supplement citation-based metrics, and to build a stronger case for evaluators seeking proof of a broad spectrum of impact.

The potential of altmetrics to fill gaps for both audiences and outputs beyond traditional limits has also brought it to the crucial attention of funding agencies, the vast majority of which have missions that tie back to the public good. For instance, in a 2014 article in PLOS Biology, the Wellcome Trust, the second-highest spending charitable foundation in the world, openly explained its interest in “exploring the potential value of [article level metrics]/altmetrics” to shape its future funding strategy3. Among the article’s other arguments, it cites the potential for altmetrics to be “particularly beneficial to junior researchers, especially those who may not have had the opportunity to accrue a sufficient body of work to register competitive scores on traditional indicators, or those researchers whose particular specialisms seldom result in key author publications.” Funders, in other words, acknowledge the challenges that (1) early career academics face in proving the potential impact of their ideas; and (2) researchers experience in disciplines that favor a high degree of specialization or collaboration.

As librarians who work in public services, we have both witnessed these challenges in action many times—even in our own field of LIS. Take, for instance, the example of a librarian hoping to publish the results of an information literacy assessment in a high impact LIS journal based on Impact Factor. According to the latest edition of Journal Citations Reports4, the top journal for the category of “Information Science & Library Science” is Management Information Systems Quarterly, a venue that fits poorly with our librarian’s information literacy research. The next two ranked journals, Journal of Information Technology and Journal of the American Medical Informatics Association, offer versions of the same conundrum; neither is appropriate for the scope of the librarian’s work. Thus, the librarian is essentially locked out of the top three ranked journals in the field, not because his/her research is suspect, but because the research doesn’t match a popular LIS speciality. Scenarios like this are very common, and offer weight to the argument that LIS is in need of alternative tools for communicating and contextualizing scholarly impact to external evaluators.

Non-librarians are of course also a key demographic within the LIS field, and have their own set of practices for using metrics for evaluation and review. In fact, for many LIS professionals outside of academia, the use of non-citation based metrics hardly merits a discussion, so accepted are they for tracking value and use. For instance, graduates in programming positions will undoubtedly recognize GitHub, an online code repository and hosting service in which users are rewarded for the number of “forks,” watchers, and stars their projects generate over time. Similarly, LIS grads who work with social media may utilize Klout scores —a web-based ranking that assigns influence scores to users based on data from sites like Twitter, Instagram, and Wikipedia. The information industry has made great strides in flexing its definition of impact to include more social modes of communication, collaboration, and influence. However, as we have seen, academia continues to linger on the notion of citation-based metrics.

This brings us back to the lure of altmetrics for librarians: namely, its potential to redefine, or at least broaden, how higher education thinks about impact, and how impact can be distinguished from additional evaluative notions like “quality.” Our own experiences and observations have led us to believe strongly that this must be done, but also done with eyes open to the ongoing strengths and weaknesses of altmetrics. With this in mind, let us take a closer look at the field of altmetrics, including how it has developed as a movement.

The organization of altmetrics

One of the first key points to know about altmetrics is how it can be organized and understood. Due to its online, entrepreneurial nature, altmetrics as a field can be incredibly quick-changing and dynamic—an issue and obstacle that we’ll return to a bit later. For now, however, let’s take a look a the categories of altmetrics as they currently stand in the literature.

To date, several altmetrics providers have taken the initiative to create categories that are used within their tools. For instance, PlumX, one of the major altmetrics tools, sorts metrics into five categories, including “social media” and “mentions”, both measures of the various likes, favorites, shares and comments that are common to many social media platforms. Impactstory, another major altmetrics tool, divides its basic categories into “Public” and “Scholar” sections based on the audience that is most likely to be represented within a particular tool. To date, there is no best practice when it comes to categorizing these metrics. A full set of categories currently in use by different altmetrics toolmakers can be found in Figure 1.

Figure 1. Altmetrics categories in use by major altmetrics tools.

One challenge to applying categories to altmetrics indicators is the ‘grass roots’ way in which the movement was built. Rather than having a representative group of researchers get together and say  “let’s create tools to measure the ways in which scholarly research is being used/discussed/etc.”, the movement started with a more or less concurrent explosion of online tools that could be used for a variety of purposes, from social to academic. For example, when a conference presentation is recorded and uploaded to YouTube, the resulting views, likes, shares, and comments are all arguably indications of interest in the presentation’s content, even though YouTube is hardly designed to be an academic impact tool. A sample of similarly flexible tools from which we can collect data relevant to impact is detailed in Figure 2.

Figure 2. Examples of altmetrics data sources used by different toolmakers.

Around the mid-2000s5, people began to realize that online tools could offer valuable insights into the attention and impact of scholarship. Toolmakers thus began to build aggregator resources that purposefully gather data from different social sites and try to present this data in ways that are meaningful to the academic community. However, as it turns out, each tool collects a slightly different set of metrics and has different ideas about how to sort its data, as we saw in Table 1. The result is that there is no inherent rhyme or reason as to why altmetrics toolmakers track certain online tools and not others, nor to what the data they produce looks like. It’s a symptom of the fact that the tools upon which altmetrics are based were not originally created with altmetrics in mind.

Altmetrics tools

Let’s now take a look at some of the altmetrics tools that have proved to be of most use to librarians in pursuit of information about their impact and scope of influence.

Mendeley

Mendeley is a citation management tool, a category that also includes tools like EndNote, Zotero and RefWorks. These tools’ primary purpose is to help researchers organize citations, as well as to cite research quickly in a chosen style such as APA. Mendeley takes these capabilities one step further by helping researchers discover research through its social networking platform, where users can browse through articles relevant to their interests or create/join a group where they can share research with other users.6 Unique features: When registering for a Mendeley account, a user must submit basic demographic information, including a primary research discipline. As researchers download papers into their Mendeley library, this demographic information is tracked, so we can see overall interest for research articles in Mendeley, along with a discipline-specific breakdown of readers, as shown in Figure 3.7

Figure 3. Mendeley Readership Statistics from one of our articles.

Impactstory

Impactstory is an individual subscription service ($60/year as of writing) that creates a sort of ‘online CV’ supplement for researchers. It works by collecting and displaying altmetrics associated with the scholarly products entered by the researcher into their Impactstory account. As alluded to before, one of the biggest innovations in the altmetrics realm in the past few years has been the creation of aggregator tools that collect altmetrics from a variety of sources and present metrics in a unified way. Unique features: Impactstory is an example of a product targeted specifically at authors, e.g. displaying altmetrics for items authored by just one person. It is of particular interest to many LIS researchers because it can track products that aren’t necessarily journal articles. For example, it can track altmetrics associated with blogs, SlideShare presentations, and YouTube videos, all examples of ways in which many librarians like to communicate and share information relevant to librarianship.

Figure 4. The left-side navigation shows the different types of research products for one Impactstory profile.

PlumX

PlumX is an altmetrics tool specifically designed for institutions. Like Impactstory, it collects scholarly products produced by an institution and then displays altmetrics for individuals, groups (like a lab or a department), and for the entire institution. Unique features: Since PlumX’s parent company Plum Analytics is owned by EBSCO, PlumX is the only tool that incorporates article views and downloads from EBSCO databases. PlumX also includes a few sources that other tools don’t incorporate, such as GoodReads ratings and WorldCat library holdings (both metrics sources that work well for books). PlumX products can be made publicly available, such as the one operated by the University of Pittsburgh at http://plu.mx/pitt.

Altmetric

Altmetric is a company that offers a suite of products, all of which are built on the generation of altmetrics geared specifically at journal articles. Their basic product, the Altmetric Bookmarklet, generates altmetrics data for journal articles with a DOI8, with a visual ‘donut’ display that represents the different metrics found for the article (see Figure 5). Unique features: One product, Altmetric Explorer, is geared toward librarians and summarizes recent altmetrics activity for specific journals. This information can be used to gain more insight into a library’s journal holdings, which can be useful for making decisions about the library’s journal collection.

Figure 5. This Altmetric donut shows altmetrics from several different online tools for one journal article.

Current issues & initiatives

Earlier, we mentioned that one of the primary characteristics of altmetrics is its lack of consistency over time. Indeed, the field has already changed significantly since the word altmetrics first appeared in 2010. Some major changes include: the abandonment of ReaderMeter9, one of the earliest altmetrics tools; shifting funding models, including the acquisition of PlumX by EBSCO in January 2014 and the implementation of an Impactstory subscription fee in July 2014; and the adoption of altmetrics into well-established scholarly tools and products such as Scopus, Nature journals, and most recently, Thomson Reuters10.

One exciting initiative poised to bring additional clarity to the field is the NISO (National Information Standards Organization) Altmetrics Initiative. Now in its final stage, the Initiative has three working groups collaborating on a standard definition of altmetrics, use cases for altmetrics, and standards associated with the quality of altmetrics data and the way in which altmetrics are calculated. Advocates of altmetrics (ourselves included) have expressed hope that the NISO Initiative will help bring more stability to this field, and answer confusion associated with the lack of altmetrics standardization.

Criticisms also make up a decent proportion of the conversation about altmetrics. One of the most well know is the possibility of ‘gaming’, or of users purposefully inflating altmetrics data. For example, a researcher could ‘spam’ Twitter with links to their article, or could load the article’s URL many times, which would both increase the metrics associated with their article. We’ve heard about such fears from users before, and they are definitely worth keeping in mind when evaluating altmetrics. However, it’s also fair to say that toolmakers are taking measures to counteract this worry—Altmetric, for example, automatically eliminates tweets that appear auto-generated. Still, more sophisticated methods for detecting and counteracting this kind of activity will eventually help build confidence and trust in altmetrics data.

Another criticism associated with altmetrics is one it shares with traditional citation-based metrics: the ability to accurately and fairly measure the scholarly impact of every discipline. As we’ve seen, many altmetrics tools still focus on journal articles as the primary scholarly output, but for some disciplines, articles are not the only way (or even the main way) in which researchers in that discipline are interacting. Librarianship is a particularly good example of this disciplinary bias. Since librarianship is a ‘discipline of practice’, so to speak, our day-to-day librarian responsibilities are often heavily influenced by online webinars, conference presentations, and even online exchanges via Twitter, blogs, and other social media. Some forms of online engagement can be captured with altmetrics, but many interactions are beyond the scope of what can be measured. For example, when the two of us present at conferences, we try our best to collect a few basic metrics: audience count; audience assessments; and Twitter mentions associated with the presentation. We also upload presentation materials to the Web when possible, to capture post-presentation metrics (e.g. presentation views on SlideShare). In one case, we did a joint talk that was uploaded to YouTube, which meant we could monitor video metrics over time. However, one of the most poignant impact indicators, evidence that a librarian has used the information presented in their own work, is still unlikely to be captured by any of these metrics. Until researchers can say with some certainty that online engagement is an accurate reflection of disciplinary impact, these metrics will always be of limited use when trying to measure true impact.

Finally, there is a concern growing amongst academics regarding the motivations of those pushing the altmetrics movement forward—namely, a concern that the altmetrics toolmakers are the ‘loudest’ voice in the conversation, and are thus representing business concerns rather than the larger concerns of academia. This criticism is actually one that we think strongly speaks to the need for additional librarian involvement in altmetrics on behalf of academic stakeholders, to ensure that their needs are addressed. One good example of librarians representing academia is at the Charleston Conference, where vendors and librarians frequently present together and discuss future trends in the field.

There are thus many uncertainties inherent to the current state of altmetrics. Nevertheless, such concerns do not overshadow the real shift that altmetrics represents in the way that academia measures and evaluates scholarship. Put in this perspective, it is little wonder that many researchers and librarians have found the question of how to improve and develop altmetrics over time to be ultimately worthwhile.

Role of LIS graduates and librarians

As mentioned at the beginning of this article, new academic librarians are in need of altmetrics for the same reasons as all early career faculty: to help track their influence and demonstrate the value of their diverse portfolios. However, the role of LIS graduates relative to altmetrics is also a bit unique, in that many of us also shoulder a second responsibility, which may not be obvious at first to early career librarians. This responsibility goes back to the central role that librarians have played in the creation, development, and dissemination of research metrics since the earliest days of citation-based analysis. Put simply: librarians are also in need of altmetrics in order to provide robust information and support to other researchers—researchers who, more often than not, lack LIS graduates’ degree of training in knowledge organization, information systems, and scholarly communication.

The idea that LIS professionals can be on the front lines of support for impact measurement is nothing new to experienced academic librarians, particularly those in public services roles. According to a 2013 survey of 140 libraries at institutions across the UK, Ireland, Australia and New Zealand, 78.6% of respondents indicated that they either offer or plan to offer, “bibliometric training” services to their constituents as part of their support for research11. Dozens of librarian-authored guides on the subject of “research impact,” “bibliometrics,” and “altmetrics” can likewise be found through Google searches. What’s more, academic libraries are increasingly stepping up as providers of alternative metrics. For example, many libraries collect and display usage statistics for objects in their institutional repositories.

Still, for early career librarians, the thought of jumping into the role of “metrics supporter” can be intimidating, especially if undertaken without a foundation of practical experience on the subject. This, again, is a reason that the investigation of altmetrics from a personal perspective is key for new LIS graduates. Not only does it help new librarians consider how different definitions of impact can shape their own careers, but it also prepares them down the line to become advocates for appropriate definitions of impact when applied to other vulnerable populations of researchers, academics, and colleagues.

Obstacles & opportunities

Admittedly, there are several obstacles for new librarians who are considering engaging with research metrics. One of the biggest obstacles is the lack of discourse in many LIS programs concerning methods for measuring research impact. Metrics are only one small piece of a much larger conversation concerning librarian status as academic institutions, and the impact that status has on scholarship expectations, so it’s not a shock that this is a subject that isn’t routinely covered by LIS programs. Regardless, it’s an area for which many librarians may feel underprepared.

Another barrier that can prevent new librarians from engaging with altmetrics is the hesitation to position oneself as an expert in the area when engaging with stakeholders, including researchers, vendors, and other librarians12. These kinds of mental barriers are nearly universal among professionals13 and are somewhere within the domains of nearly every institution14. At Rachel’s institution, for example, the culture regarding impact metrics has been relatively conservative and dominated by Impact Factor, so she’s been cautious with introducing new impact-related ideas, serving more as a source of information for researchers who seek assistance rather than a constant activist for new metrics standards.

Luckily, for every barrier that new LIS grads face in cultivating a professional relationship with altmetrics, there is almost always a balancing opportunity. For example, newly hired academic librarians almost inevitably find themselves in the position of being prized by colleagues for their “fresh perspective” on certain core issues, from technology to higher education culture15. During this unique phase of a job, newly-hired librarians may find it surprisingly comfortable—even easy!—to bring new ideas about research impact or support services to the attention of other librarians and local administrators.

Another advantage that some early career librarians have in pursuing and promoting altmetrics is position flexibility. Librarians who are new to an institution tend to have the option to help shape their duties and roles over time. Early career librarians are also generally expected by their libraries to devote a regular proportion of their time to the goal of professional development, an area for which the investigation of altmetrics fits nicely, both as a practical skill and a possible topic of institutional expertise.

For those librarians who are relatively fresh out of a graduate program, relationships with former professors and classmates can also offer a powerful opportunity for collecting and sharing knowledge about altmetrics. LIS cohorts have the advantage (outside of some job competition) of entering the field at more or less the same time, a fact that strengthens bonds between classmates, and can translate into a long term community of support and information sharing. LIS teaching faculty may also be particularly interested in hearing from recent graduates about the skills and topics they value as they move through the first couple years of a job. Communicating back to these populations is a great way to affect change across existing networks, as well as to prepare for the building of new networks around broad LIS issues like impact.

Making plans to move forward

Finding the time to learn more about altmetrics can seem daunting as a new librarian, particularly how it relates to other “big picture” LIS topics like scholarly communication, Open Access, bibliometrics, and data management. However, this cost acknowledged, altmetrics is a field that quickly rewards those who are willing to get practical with it. Consequently, we recommend setting aside some time in your schedule to concentrate on three core steps for increasing your awareness and understanding of altmetrics.

#1. Pick a few key tools and start using them.

A practical approach to altmetrics means, on a basic level, practicing with the tools. Even librarians who have no desire to become power users of Twitter can learn a lot about the tool by signing up for a free account and browsing different feeds and hashtags. Likewise, librarians curious about what it’s like to accumulate web-based metrics can experiment with uploading one of their PowerPoint presentations to SlideShare, or a paper to an institutional repository. Once you have started to accumulate an online professional identity through these methods, you can begin to track interactions by signing up for a trial of an altmetrics aggregation tool like Impactstory, or a reader-oriented network like Mendeley. Watching how your different contributions do (or do not!) generate altmetrics over time will tell you a lot about the pros and cons of using altmetrics—and possibly about your own investment in specific activities. Other recommended tools to start: see Table One for more ideas.

#2. Look out for altmetrics at conferences and events.

Low stakes, opportunistic professional development is another excellent strategy for getting comfortable with altmetrics as an early career librarian. For example, whenever you find yourself at a conference sponsored by ALA, ACRL, SLA, or another broad LIS organization, take a few minutes to browse the schedule for any events that mention altmetrics or research impact. More and more, library and higher education conferences are offering attendees sessions related to altmetrics, whether theoretical or practical in nature. Free webinars that touch on altmetrics are also frequently offered by technology-invested library sub-groups like LITA. Use the #altmetrics hashtag on Twitter to help uncover some of these opportunities, or sign up for a email listserv and let them them come to you.

#3. Commit to reading a shortlist of altmetrics literature.

Not surprisingly, reading about altmetrics is one of the most effective things librarians can do to become better acquainted with the field. However, whether you choose to dive into altmetrics literature right away, or wait to do so until after you have experienced some fundamental tools or professional development events, the important thing to do is to give yourself time to read not one or two, but several reputable articles, posts, or chapters about altmetrics. The reason for doing this goes back to the set of key issues at stake in the future of altmetrics. Exposure to multiple written works, ideally authored by different types of experts, will give librarians new to altmetrics a clearer, less biased sense of the worries and ambitions of various stakeholders. For example, we’ve found the Society for Scholarly Publishing’s blog Scholarly Kitchen to be a great source of non-librarian higher education perspectives on impact measurement and altmetrics. Posts on tool maker blogs, like those maintained by Altmetric and Impactstory, have likewise proved to be informative, as they tend to respond quickly to major controversies or innovations in the field. Last but not least, scholarly articles on altmetrics are now widely available now—and can be easily discovered via online bibliographies or good old fashioned search engine sleuthing.

As you can imagine, beginning to move forward with altmetrics can take as little as a few minutes—but becoming well-versed in the subject can take the better part of an academic year. In the end, the decision of how and when to proceed will probably shift with your local circumstances. However, making altmetrics part of your LIS career path is an idea we hope you’ll consider, ideally sooner rather than later.

Conclusion

Over the last five years, altmetrics has emerged in both libraries and higher education as a means of tracing attention and impact; one that reflects the ways that many people, both inside and outside of academia, seek and make decisions about information every day. For this reason, this article has argued that librarians should consider the potential value of altmetrics to their careers as soon as possible (e.g. in their early careers), using a variety of web-based indicators and services to help inform them of their growing influence as LIS professionals and scholars.

Indeed, altmetrics as a field is in a state of development not unlike that of an early career librarian. It’s future, for instance, is also marked by some questions and uncertainties—definitions that have yet to emerge, disciplines with which it has yet to engage, etc. And yet, despite these hurdles, both altmetrics and new academic librarians share the power over time to change the landscape of higher education in ways that have yet to be fully appreciated. It is this power and value that the reader should remember when it comes to altmetrics and their use.

And so, new graduates, please meet altmetrics. We think the two of you are going to get along just fine.

Thanks to the In the Library with the Lead Pipe team for their guidance and support in producing this article. Specific thanks to our publishing editor, Erin Dorney, our internal peer reviewer, Annie Pho, and to our external reviewer, Jennifer Elder. You each provided thoughtful feedback and we couldn’t have done it without you.

Recommended Resources

Altmetrics.org. http://altmetrics.org/

Altmetrics Conference. http://www.altmetricsconference.com/

Altmetrics Workshop. http://altmetrics.org/altmetrics15/

Chin Roemer, Robin & Borchardt, Rachel. Meaningful Metrics: A 21st Century Librarian’s Guide to Bibliometrics, Altmetrics, & Research Impact. ACRL Press, 2015.

Mendeley Altmetrics Group. https://www.mendeley.com/groups/586171/altmetrics/

NISO Altmetrics Initiative. http://www.niso.org/topics/tl/altmetrics_initiative/  

The Scholarly Kitchen. http://scholarlykitchen.sspnet.org/tag/altmetrics/

WUSTL Becker Medical Library, “Assessing the Impact of Research.” https://becker.wustl.edu/impact-assessment

  1. There are in fact many extant definitions of altmetrics (formerly alt-metrics). However, this definition is taken from one of the earliest sources on the topic, Altmetrics.org
  2. http://www.nature.com/nature/journal/v500/n7463/full/nj7463-491a.html
  3. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002003
  4. This ranking is according to the 2014 edition of Journal Citation Reports, as filtered for the category “Information Science & Library Science.” There is no way to disambiguate this category into further specialty areas.
  5. While the first reference to “altmetrics” arose out of a 2010 tweet by Jason Priem, the idea behind the value of altmetrics was clearly present in the years leading up to the coining of the term.
  6. One such group in Mendeley is the Altmetrics group, with close to 1000 members as of July 2015.
  7. Another feature that may be of interest to librarians undertaking a lit review – Mendeley can often automatically extract metadata from an article PDF, and can even ‘watch’ a computer folder and automatically add any new PDFs from that folder into the Mendeley library, making research organization relatively painless.
  8. The bookmarklet can be downloaded and installed here: http://www.altmetric.com/bookmarklet.php
  9. http://readermeter.org/ “It’s been a while… but we’re working to bring back ReaderMeter” has been displayed for several years.
  10. Thomson Reuters is currently beta-testing inclusion of “Item Level Usage Counts” into Web of Science, similar to EBSCO’s tracking of article views and downloads. More information on this feature can currently be found in the form of webinars and other events.
  11. Sheila Corrall. and Mary Anne Kennan. and Waseem Afzal. “Bibliometrics and Research Data Management Services: Emerging Trends in Library Support for Research.” Library Trends 61.3 (2013): 636-674. Project MUSE. Web. 14 Jul. 2015. http://muse.jhu.edu/journals/library_trends/v061/61.3.corrall02.html
  12. The authors can attest that, even after publishing a book on the topic, this is an obstacle that may never truly be overcome!
  13. One recent study found that 1 in 8 librarians had higher than average levels of Imposter Syndrome, a rate which increases amongst newer librarians. 
  14. An overview of one stakeholders’ perceptions of libraries and librarians, namely that of faculty members, is explored in the Ithaka Faculty Survey.
  15. The value of new graduates’ “fresh perspective” well known anecdotally, but also well-evidenced by the proliferation of Resident Librarian positions at academic libraries across the country.

LITA: Required Reading

planet code4lib - Wed, 2015-08-12 13:00

Stop what you’re doing and pick up a copy of Program or Be Programmed: Ten Commands for a Digital Age. Douglas Rushkoff’s 21st century call to arms ought to be required reading for librarians (not just those with the word digital in their job title). This is a quick read with big impact and it deserves more than a skim.

Gear Eyes Girl by Anna Sher from the Noun Project

This book caught me at the perfect moment as I’ve just taken on a new role as Scholarly Technologies Librarian for Indiana University, where one of my main job duties will be technology training for staff. I’m in the brainstorming stages now, but I think I’ve already zeroed in on the real challenge. Learning styles and technical abilities aside, one of the biggest obstacles to teaching technology is our attitude toward the technology itself. In terms of programming in particular, Rushkoff writes, “We are intimidated by the whole notion of programming, seeing it as a chore for mathematically inclined menials than a language through which we can re-create the world on our own terms.”

I’ve witnessed this first hand and been guilty of it myself. For starters, no matter how many times I use Codeacademy or Treehouse, learning a programming language is an incredibly daunting task. It’s a whole new world that can be slightly terrifying. And shouldn’t these things be left to the elite group of nerds who already know how to program anyway? But this is a dangerous and defeatist way of thinking, for as Rushkoff points out, “The irony here is that computers are frightfully easy to learn. Programming is immensely powerful but it is really no big deal to learn.” The issue isn’t that programming is impossible, intimidation and reluctance are the real hurdles.

Computer Programmer by Thinkful from the Noun Project

Teaching technology in the library becomes less about the tool itself, than about our attitude and willingness to learn something that isn’t spelled out in our job descriptions. So how do we overcome this mindset? I can’t say for certain, but I suspect there has to be an element of excitement, an understanding that interacting with technology on a deeper level empowers us. Instead of starting with how things work, we can’t move on until we’ve answered the question, “why bother?”

Reading Program or Be Programmed has motivated me and reminded me that books and ideas can be great motivators for librarians. Seems pretty obvious, but somehow we overlook this. I look forward to incorporating big picture ideas like those presented by Rushkoff into training. Until we’re excited or curious, we’re not ready to learn. So before we start teaching, let’s start there.

Hydra Project: Hydra Connect 2015 – important updates

planet code4lib - Wed, 2015-08-12 09:23

This post contains a number of important reminders and updates from the Hydra Connect 2015 organizers.  Please read it carefully!

Booking

At the time of writing, accounting for some tickets set aside for speakers who have still to register, there are only about 40 tickets left for the event.  Our limit of 200 tickets is not extensible, so if you intend coming and you have not yet registered you need to do so very soon – it looks as though we shall sell out.  This year we require registration for any workshops that you hope to attend and so, once registered for the conference, you need to do that too.  All the details at https://wiki.duraspace.org/display/hydra/Hydra+Connect+2015 .  The page now has full details and timetables for all the ‘formal’ conference sessions.

Poster Show and Tell

We hope that all institutions with staff coming to Hydra Connect will contribute to the popular Poster Show and Tell session on the Tuesday afternoon.  There is a list of institutions at https://wiki.duraspace.org/display/hydra/Poster+sign-up+and+details+of+local+printing+arrangements – please sign up if you haven’t already done so.  The page also has details of local printing arrangements, should you wish to take advantage of them.  We have managed to secure a much cheaper deal than we first anticipated ($26 or $38 depending on size and including delivery to the conference venue).

Lightning talks

The Program Committee has left space on the Wednesday morning for up to 16 lightning talks of no more than five minutes each.  If you’d like to take one of these slots to share some ideas/concerns/rants/raves or whatever, please sign up at https://wiki.duraspace.org/display/hydra/Lightning+talk+sign-up+page .

Unconference sessions

Wednesday afternoon and Thursday are given over to ‘unconference sessions’.  These could include presentations (on technical topics, planning and management, professional development, work life balance, or anything else), panel discussions, hackfests, or meetups (it would be a good time for Interest and Working Groups to meet face-to-face).  This year we have a piece of software called Sessionizer which allows people to propose sessions and/or express interest in attending other people’s sessions and which will ultimately produce a timetable to ensure that the most people can attend the greatest possible number of sessions that they have expressed interest in.  Please feel free to register and to propose sessions at http://connect2015.curationexperts.com/home .  Please also keep an eye on the page and register for things that you find interesting.  The system will remain open until the Tuesday of the conference at which time we need to do the timetabling.

Social events

Apologies that there was a little bit of confusion on the EventBrite site until recently:  the optional conference dinner is on Tuesday 22nd September. (EventBrite had Wednesday in at least one place.)  We shall soon produce some sign-up lists for attendees to create informal dinner groups on the Monday and Wednesday evenings.  These are a good way to find some amenable company for an evening – especially if you are new to the Hydra Community and don’t yet know many others.

We hope to see you in less than six weeks!

Christina Harlow: Openrefine Reconciliation Workshop C4lmdc

planet code4lib - Wed, 2015-08-12 00:00

Here are my slides, handouts, and other information from a OpenRefine Reconciliation Workshop given at the Code4Lib Maryland, DC, and Commonwealth of Virginia 2015 meeting. I want to make them available for others who may be interested in the topic, or those who attended (since this is a lot of information to cover in one workshop). It was built off of experimentation for particular use cases from a data munger’s viewpoint, not a developer’s viewpoint, so any corrections, updates, or additions are very much welcome.

If you have questions, please let me know - @cm_harlow or cmharlow@gmail.com.

Folks should feel free to repurpose this work for other such OpenRefine Reconciliation Service workshops/events/whatever.

Links to the slides (available as HTML file or PDF doc), breakout session handouts/guides, and some sample data: [http://www.github.com/cmh2166/c4lMDCpres]

Draft of My Workshop Talking

I’ve included my rough speaker’s notes below, following each slide.

About this workshop

I’m Christina Harlow, @cm_harlow on twitter and a data munger in Tennessee.

This workshop came together rather late in the day, so there were no installation requirements or preliminary guidance from you on where the focus should be. As such, I want to get a feeling for what you want to hear. Raise your hand for the following questions:

  1. How many people are here with laptops with OpenRefine running?
  2. How many people are here with laptops on which they want to install OpenRefine?
  3. How many people are here just to learn by watching, discussion?

Different focus:

  1. How many people here want to learn how to use OpenRefine generally?
  2. How many people here want to learn specifically how to use reconciliation services?
  3. How many people here want to learn how to build reconciliation services?

Depending on this, I may/may not cover more of the general OpenRefine functions or more on how OpenRefine handles reconciliation objects.

Finally, what is presented comes from the perspective of a data munger, namely myself, who was building out OpenRefine as part of a metadata remediation workflow involving non-technical metadata workers. I’m not a developer. I’m not a programmer. I’m guessing at how this works in a lot of places, and why it works that way, based off of breaking it a lot. Basically…

Left-sharking it

I’m left-sharking it. Please shout out if you understand something that I don’t. I should also mention that this work builds off of some really awesome and talented library tech folks, including…

LibTech Developers to Thank

these 4 people, a few of which are hear today. Thanks to them is first and foremost for experimenting, building, sharing and documenting their work. I built off of what they did. Their Twitter handles are there, so go ahead and tweet at them to tell them how awesome they are.

Slides, Examples, + Install

Everything from today’s presentation - slides, sample data, example workflows - should be in this GitHub repository - http://github.com/cmh2166/c4lMDCpres. Go ahead and clone/download it on your computer for following along with some of the examples and using the sample data if needed.

If you don’t have OpenRefine running but would like to play with it either while I’m going through examples or during the more interactive sessions at the end, go ahead to this site - http://openrefine.org/download.html - and download the installer for your OS. Note: OpenRefine requires certain versions of java packages (detailed on openrefine.org/downloads page). Downloading these specific Java packages if you don’t have them or have to upgrade will be what takes the longest. If you download OpenRefine now and run into an install error, check your java version then prepare for the long wait for the update to download. Also interrupt me to ask for help.

Additionally, we will briefly cover the DERI RDF Extension options for reconciliation. If you are interested in working with that, you’ll need to go to the DERI RDF Extension site to download it, then put that code into the extensions subdirectory in the OpenRefine files (wherever you are keeping the OpenRefine application files on your computer, go there and find the extensions directory. Depending on version/OS, it may be in a lib subdirectory).

Nota bene: the DERI RDF Extension, and LODRefine (or Linked Open Data Refine), which is a fork of OpenRefine with this and other helpfuls extensions baked in, are very helpful but no longer actively supported. Just be aware of this and that some issues you might run into will not necessarily have anyone there who can answer or update the codebase.

Agenda

With modifications from what I asked you at the beginning, this is the rough agenda. We will start with a very brief introduction, more an overview, to OpenRefine, then review 3 different OpenRefine reconciliation options, with the bulk of the time focused on working with and building or modifying the Standard Reconciliation Service API. I’m reviewing 3 options because they all have valid use cases, as well as considerations or issues to keep in mind. They also all required different levels of expertise and can help you get a feel for the data sources you’re working with.

Then I’m proposing we have breakout sessions, where folks can work through what was shown. If people want to learn more about other OpenRefine functions, we can have a breakout that is just about working with library metadata in OpenRefine, and I’m happy to guide that. In the GitHub repository, I’ve included files that have some procedures to walk through for both working with reconciliation services as well as a marked up python file explaining how an example Standard Reconciliation Service API was built.

Quick Intro to OpenRefine

OpenRefine is a power tool for cleaning up data. It has gained a lot of popularity as of late for library data work, as it offers a nice GUI for doing data normalization, enhancement, review and cleanup. It has gone through a number of iterations in the past few years, including management by Google - I mention this because you will still see GoogleRefine instances running and still working, as well as some other Google naming holdovers. Since 2012, it has been an open source and community supported project, with a website at openrefine.org and a pretty active github repository at github.com/OpenRefine/Openrefine. Because it is community-sourced now, the documentation is somewhat haphazard; parts are very good, other parts, lacking or out-of-date. The documentation on the standard reconciliation service API is out-of-date, and although somewhat helpful, read it with a grain of salt.

More about the technology, OpenRefine is built with java and jetty, a server applet that handles HTTP requests between the backend of the program and the javascript/jquery-based UI that runs in a web browser. While it does act like a web service, it is not on the Internet; it runs locally, and the projects and data worked within OpenRefine are stored locally. This structure is how OpenRefine can handle the reconciliation options we’ll be discussing below, all reliant on using HTTP requests to either talk to the external data sources or an intermediary API we create.

The GUI in the web browser is predominately javascript and jquery, while the backend (like I mentioned above) is java. There is no database; all the data you are working with is stored in memory. This means that with larger or really complex datasets, depending on your computer, you will run into performance issues. I generally start to see work-stopping performance issues working with ~50,000 or more ‘standard complexity’ MODS records. Your mileage may vary. There is a way to allocate more memory for the work at startup - check the OpenRefine GitHub wiki docs for doing this in your particular OS.

OpenRefine Functionality Checklist

So most of the people at the workshop declared that they had worked with OpenRefine before and were primarily interested in learning about building the standard reconciliation service API. As such, I’m going to breeze through this part.

OpenRefine has a lot of utility for data cleanup and munging, and these are the core functionalities you’ll hear about and/or use.

For the Import/Export options, OpenRefine out of the box supports working with a number of data formats - CSV, Excel, JSON, XML, RDF/XML, other… for import. For export, there is support for exporting a full OpenRefine project, CSV, or possibly JSON/XML with the templating option (this requires some wrangling and is not always possible for the data you’re working with to be faithfully represented upon export to that format). Some extensions, such as the DERI/RDF extension, so allow export to RDF serializations - just RDF/XML and RDF N-triples at the moment - with you setting up a RDF skeleton for how to map the records to nodes.

Yet, keep in mind that all of this requires that you transform the original data to a tabular model for working with it as an OpenRefine project, then eventually getting that tabular model data back to or transformed to the data model you want. This is easy if working with CSV, Excel, or other files already in that tabular format. Working with JSON or XML, however, you’ll need to do some data massaging to make OpenRefine work efficiently for you. For heavily-nested JSON or XML, this can be a problem, and OpenRefine may not always be the best option. There are some handy tools, packages, script libraries, and other for moving between data models, and I’m happy to talk about these with you afterwards. For now, the sample datasets in the GitHub repo are already flattened and ready for easy import into OpenRefine.

Cleans - OpenRefine allows a good UI for cleaning, normalizing and updating - either in batch or manually, though batch is better support - your data.

Facets - these are very helpful for both finding data value outliers, again, normalizing values, as well as just getting a handle of what is the state of your data.

Clusters - when faceting, you can then access a number of grouping algorithms for clustering the values. This can help you normalize data, again, and see what values probably should have the same label/datapoint.

GREL - Google Refine Extension Language. This is a sort of Javascript-y, application-specific language for performing normalization and data munging work in OpenRefine. Some GREL functions can be very powerful for data cleanup, and it is recommended you check out the OpenRefine GitHub repository, which covers GREL very well.

Extensions - these are written by folks who want to add a functionality to OpenRefine, such as with the DERI researchers and the DERI RDF Extension. The OpenRefine documentation has somewhat a list of what is avaiable, though most you will need to do further searching.

If folks want to learn more about just working with library data in OpenRefine, that can be a breakout, or you can ask me directly at some other point.

OpenRefine Reconciliation

So what do I mean when I say ‘reconciliation’: I want to take values in my dataset in OpenRefine, compare them with data values in an external dataset, and if they are a match (decided through a number of ways and algorithms, matching functions, etc.), I then can change my data value to be the same as the external data value, or link the two (perhaps by just pulling in a URI from the external dataset), or possibly just pull in extra information about the external data value into my project.

OpenRefine Reconciliation Options

There are at least 3 ways to perform reconciliation work in OpenRefine. You can…

  1. Add a column by fetching URL… This is an option within the UI of an OpenRefine project that generates a HTTP GET request to an external data API and then posts the entire data response from that external API in a new column in your project. This method does take a really long time to execute, and it does require that you parse the data response for what you want to pull out in the OpenRefine UI. It does have some uses cases, however.
  2. There is the Standard Reconciliation API… which is a RESTful API built to negotiate between OpenRefine and external data. While there are examples and templates for building these APIs, it does require tinkering knowledge of API construction and the related programming language you’ll be working with. These can be hosted for easier use in OpenRefine, or run locally for faster work.
  3. DERI RDF Extension… while this is no longer actively supported, it builds off of the Standard Recon Service API to work with RDF documents and SPARQL Endpoints in a similar way to the standard reconciliation API. For this to work, though, the RDF document you are reconciling is held in memory, meaning that the size of the RDF data will be limited to what OpenRefine can actually handle. Additionally, the SPARQL reconciliation is very much dependent currently on the SPARQL server details - for example, the Getty SPARQL endpoint is currently not able to work with this extension.
Add a column by fetching URL…

This option accesses an external data set by building an external data API query for each cell value in a chosen column in your OpenRefine projects. OpenRefine then issues a HTTP Get request to that URL/API Query, and stores the full result in a new column beside the seed column. This mostly works with RESTful APIs, as you cannot change or add to the HTTP request information beyond generating an API query as a URL. It also takes a long time to perform as there is an individual API call made for each cell with a value in the seed column.

To use this method, find the column that contains the data you wish to query the external data API with, go to Top arrow > Edit column > Add column by fetching URLs. In the box that appears, you want to enter the appropriate GREL function(s) to create the API query URL with the cell values. Once you’ve got that constructed, click on ‘Add column’, and wait as the calls issues/responses stored. This can take a while. Once it is complete, you’ll have a new column with the response data, and you can use GREL on that column to parse the results for what you want to find.

This method is useful if:

  1. You have data with very specific links or references to an external data API (such as a column of identifiers), and you want to pull in additional information using that.
  2. Don’t have the time or comfort (yet) for constructing your own standard reconciliation API.
  3. Want to get a better feeling for how a data API works with your dataset as queries.

To work with this method further in the break-outs, you can follow the procedures documented in the GitHub repo file ‘addcolumnexamples.md’ or you can review the Mountain West Digital Library, a DPLA service hub, workflows working with this method and using the Geonames API. I’ll warn you about the sample workflows included in the GitHub repo, they are based off legacy documentation that I wrote up for my last job when considering OpenRefine for reconciliation work. They are more of a proof-of-concept and the use cases modeled in there can be better handled elsewhere now. But they do walk you through this process.

Standard Recon Service API

The Standard Reconciliation Service API in OpenRefine takes the data you wish to reconcile from the OpenRefine UI, uses that to query an external data set (by either connecting to that external dataset’s API or constructing another way to connect in your recon service API), performs auto-matching, ranking, and other work according to your specifications, then generates an OpenRefine reconciliation object, with required metadata, to return to the OpenRefine UI. This reconciliation object is used in the UI to create reconciliation options for each value, as well as offer to the OpenRefine user the ability to pull out the reconciliation IDs, names/labels, or other match factors (matched or not, ranking values).

Another way to explain it is that the OpenRefine standard reconciliation API is a HTTP-based RESTful JSON-formatted API that connects the OpenRefine project to external datasets. It negotiates HTTP POST and GET requests between those. This API can be constructed in a number of languages and frameworks, though I primarily see use of python and the flask ‘micro’framework in the examples and templates I work with.

This reconciliation service API work is originally built off of the Freebase extension in OpenRefine - and this Freebase extension no longer works. Because of this, however, there is a lot of Freebase-specific decisions in how we construct this API to work with OpenRefine, which we’ll discuss as we go over the parts.

Standard Reconciliation Service API Parts

When building an OpenRefine standard reconciliation service API, you’ll need to have a way:

  • a reconciliation service endpoint – the URL that can handle HTTP GET from OpenRefine asking for information or data like the recon service information/metadata, can send HTTP GET requests to an external data source to get matches, can handle HTTP POST requests from that external data source or from OpenRefine with the original values to be reconciled, and sending a HTTP POST requests back to OpenRefine with the reconciliation objects.
  • standard reconciliation service metadata: this defines how a reconciliation service API works in OpenRefine, as well as adding additional UI functionalities like preview boxes and further searching of an external data source in the OpenRefine UI itself.
  • entity types. This is a holderover from Freebase, which also required things like that each reconciliation service only work in a chosen namespace (many of the recon service APIs you’ll see don’t have a relevant namespace). You can however use the entity types for other purposes, like defining how to use the same reconciliation service but choose between external data API search indices, or to choose in a reconciliation service what rdf:property to find in the DERI RDF extension reconciliation work.
  • query/response handling: this is where you can construct the external data query, as well as decide what parts of the external data response should become then the id and name for the reconciliation object returned to OpenRefine. How this is performed is heavily dependent on the language you decide to construct your recon service API in, as well as the external data service you’re querying.
  • Other bells and whistles that’ll we will discuss in looking at an example.
Recon Service API Metadata

On the slide is a quote taken from the OpenRefine GitHub wiki documentation on the standard reconciliation service API. This lets us know that for each standard recon API we build, we are required to put in some basic service metadata. When setting a recon service up in the OpenRefine UI, OpenRefine will immediately send a request to the service API’s endpoint and expect back a JSON object with ‘name’, ‘identifierSpace’, and ‘schemaSpace’. Again, these are largely based off of how this worked for Freebase, and often just putting something that serves your own data usecase is the best option.

There are some other service metadata options that can be used, including metadata defining a preview window for a reconciliation object in OpenRefine, as well as building a way to query in OpenRefine that external data source. These are seen in the following example of a standard recon service API metadata. Note this example shows all the possible metadata options; the FAST reconciliation service, which we will check out in a minute, has just the minimum to get it up and running.

API Metadata Example Part 1

In this section you can see the required service metadata fields: the name, which will appear in the OpenRefine UI, the identifier space, and the schema space. The view is what builds the URL/URI for looking at the reconciliation match in the external data source’s system. The preview array defines a pop out box in OpenRefine UI for seeing the reconciliation match in the external data source’s interface.

API Metadata Example Part 2

In this section (separated just for slide space reasons), you can see the suggest array, which constructs a search further option within the OpenRefine UI but querying the external data source. Note that this search further option does required a type-ahead style search, as well as that the external data source has a API to query (or that you construct that in your API).

Finally, the defaultTypes are for the entity types options discussed above, and will be defined in the recon service API used as an example below.

Query JSON Example

When OpenRefine is talking to the standard reconciliation service you build, here is a simple example of the JSON query object sent to the recon service API. This just has the query, pulled from the cell value, a limit for how many matches can be returned (this can be defined further in the service API), the entity type to query (here modeled to be the different search indices possible for the external data query), and type_strict field that I’m not honestly sure what it does as the external data API does the matching and ranking for us.

Reconciled JSON Example

And here is an example of the JSON reconciliation object returned from the API to OpenRefine post-reconciliation queries. This object has a result array with the required metadata of id, name, type, and match. The id and name are parsed from the external data response in our recon service API, the type is determined according to the original JSON query object sent to the recon service API by OpenRefine, and the match field is determined by us in the reconciliation service API (we can decide to match high-confidence results).

Note that OpenRefine sends multiple queries and can receive multiple responses to/from the recon service API; I believe this still works in batches of 10, which is why this method is far more efficient than other methods of reconciliation.

Entity Types

The entity types for reconciliation are usually determined by the external dataset you are working with, as this is a Freebase holdover. The example we will look like has used the entity type options to define queries to different indices in the FAST API. You can only select one entity type per reconciliation work on a column, however, you can rerun reconciliation work on a column multiple times.

Let’s take some time now to run through an example of using a Reconciliation Service in the OpenRefine interface, to get a feeling for how this works for the end user.

[see this workshop’s GitHub repo docs for a walk through of using a Recon Service]

Standard Reconciliation Service API Templates

Alright, if just getting a handle on how OpenRefine handles reconciliation and talking with an external API is enough to take on, some great developers have helped with the API construction part. There are some standard OpenRefine reconciliation service API templates you can use to then plug in information about a particular data API then test. This lets you get a recon service API up and running with limited time, OpenRefine, and programming knowledge.

Most of the templates/examples use python with flask, which is what we will be reviewing. There are examples of python with django, as well as PHP, if you’re interested I can point those out to you.

Standard Reconciliation Service Examples

All of the following links are given in the GitHub repo’s reconServiceAPI.md file.

  • The FAST Reconciliation Service, using python and flask and not hosted (so you will have to run the API application locally before querying in OpenRefine). This queries data using the OCLC FAST API, with options for querying different indices. It does also use a text.py file for normalizing the OpenRefine queries before sending them off to the OCLC FAST API; this is not required for building an OpenRefine reconciliation service API, but can be very, very helpful in boosting matches.
    • If you look at the GitHub repository for this workshop, you’ll see ‘pythonflaskmarkedupexample.md’. This is the heart of this reconciliation service, but with my extended comments explaining what is happening. I’ll walk through this here and also with folks who need help in break outs.
  • LCNAF - VIAF Reconciliation Service - this has both hosted and non-hosted options for reconciliation with the LCNAF through the use of the OCLC VIAF API (note that means this can only match to a subset of the LCNAF, as VIAF just contains personal names). This I think uses django, but it is definitely constructed in python and work exploring.
  • VIAF Reconciliation Service Example - this reconciliation service API is currently hosted, and you’d need to do some work to run it locally. It is built in PHP, and for the link in the examples file, it is for the whole site, a part of which is this hosted reconciliation service. Look for the viaf.php file in that example link to see how it is built.
  • Linked in the reconServiceAPI.md file are also a basic and an extended python/flask template for an OpenRefine standard recon service API.

The marked up FAST recon service API example is built off of the basic python/flask template linked to.

Now to walk through pythonflaskmarkedupexample.md and take questions.

(just see the file for the comments)

DERI RDF Extension Recon

This again can be used to reconcile against SPARQL Endpoints and RDF data. While it is no longer actively supported, and it has the issues I outlined previously, it can still be very helpful. For one example, I’ve generated a very simple RDF document for the subset of LC Genre Terms we use at my job for our form terms. I can then reconcile our form metadata against that RDF doc and normalize, pull in URIs, and link.

I’m not going to discuss this extensively, but if you want to work on this in a breakout, I’ve included a file in the workshop GitHub repository that walks through using the extension to reconcile against a RDF document downloaded from the Library of Congress Linked Data Service. I would recommend that you take a look at the refine.deri.ie documentation, which is good but brief and not library data-specific.

Breakouts

So now, you guys can work on whatever was discussed or something completely different with my help. There is some starter documentation in the GitHub repo, but a lot of how I’ve built these recon services and service workflows in the past has been a lot of trial and error. We’ll pull together towards the end to take questions.

Links + Contact

Some helpful links for this work, as well as my Twitter handle. Please hit me up with questions or improvements to this workshop.

Harvard Library Innovation Lab: Link roundup August 11, 2015

planet code4lib - Tue, 2015-08-11 19:57

A set of links for a rainy day

Herman Miller: The Picnic Posters – Design Milk

Steve’s Picnic Posters

New Smartwatch Will Turn Texts Into Braille | Mental Floss

A smartwatch that turns texts into Braille

Four Oh Four! | FT Labs

404 pages with style (h/t @ethanz)

Very Old Tweets (@VeryOldTweets) | Twitter

A present-day twitterbot that’s retweeting the first 7500 tweets

The Internet Archive Wants To Digitize 40,000 VHS And Betamax Tapes | Fast Company | Business + Innovation

40,000 tape collection of recorded television news is being digitized

District Dispatch: Will Librarians Win in Overtime . . . Rulemaking?

planet code4lib - Tue, 2015-08-11 18:14

Photo credit: Bill Brooks

In an effort to potentially raise the wages of more than 10 million Americans, the U.S. Department of Labor (DOL) has proposed changes to current rules under the Fair Labor Standards Act that govern when and to whom employers must pay overtime for work weeks longer than 40 hours. Under the current rule, which has been revised just twice in the past 40 years, professional workers of many kinds with salaries (not merely total income) of more than $23,660 annually are “exempt” from the overtime pay requirements.

Under the new rule proposed on July 6, 2015 now open for public comment, most professional workers would still fail to qualify for overtime based on the nature of their duties, but – were the rule to go into effect as drafted – any salaried librarian earning less than about $50,440 in 2016 would become newly eligible for overtime. Though not scientifically collected, ALA-APA Salary data anecdotally suggest that many salaried professional librarians could become eligible for overtime under the new rule . . . if and when they worked more than 40 hours per week, that is.

The proposed rule change thus could bring good financial news to some library professionals (and other non-salaried library workers). However, it also theoretically could pose special difficulties for smaller libraries and library systems that employ just one or a few professionals who typically log more than 40 hours per week to keep their facilities accessible to the public. Faced with a choice of required extra pay or fewer hours, an unknown number of systems might need to opt for less service in order to stay within small, inflexible budgets.

These and other complications the new rule could create for non-profit organizations, coupled with strong opposition from the U.S. Chamber of Commerce and many other businesses and their trade associations, made the DOL’s proposed new overtime rules immediately controversial. In addition, key Congressional leaders like Senate “HELP” Committee Chair Lamar Alexander (R-TN) and House Education & the Workforce Chair John Kline (R-MN2) both have publicly opposed the proposed rule changes.  Meanwhile, more than 140 Democratic Members of the Senate and House recently cheered them in an unusual joint letter to the President. It is unclear whether some in Congress would attempt to block the new rules from taking effect if they ultimately are adopted after the Department of Labor concludes its rulemaking proceeding.

Initial comments on the proposed rule changes currently are due on September 4, 2015.  Hundreds of requests for an extension of that deadline, however, have been filed already. ALA headquarters and the Washington Office are closely tracking the matter. Stay tuned….

Additional Resources

Department of Labor’s FAQ and separate Fact Sheet on the proposed rule
National Law Review article (July 2015) about the rulemaking
DoL Fact Sheet on the Fair Labor Standards Act’s current overtime exemption for professionals

The post Will Librarians Win in Overtime . . . Rulemaking? appeared first on District Dispatch.

FOSS4Lib Recent Releases: TemaTres Vocabulary Server - 2.0

planet code4lib - Tue, 2015-08-11 15:45

Last updated August 11, 2015. Created by Peter Murray on August 11, 2015.
Log in to edit this page.

Package: TemaTres Vocabulary ServerRelease Date: Monday, August 10, 2015

David Rosenthal: Patents considered harmful

planet code4lib - Tue, 2015-08-11 15:00
Although at last count I'm a named inventor on at least a couple of dozen US patents, I've long believed that the operation of the patent system, like the copyright system, is profoundly counter-productive. Since "reform" of these systems is inevitably hijacked by intellectual property interests, I believe that at least the patent system, if not both, should be completely abolished. The idea that an infinite supply of low-cost, government enforced monopolies is in the public interest is absurd on its face. Below the fold, some support for my position.

The Economist is out with a trenchant leader, and a fascinating article on patents, and they agree with me. What is more, they point out that they made this argument first on July 26th, 1851, 164 years ago:
the granting of patents “excites fraud, stimulates men to run after schemes that may enable them to levy a tax on the public, begets disputes and quarrels betwixt inventors, provokes endless lawsuits [and] bestows rewards on the wrong persons.” In perhaps our first reference to what are now called “patent trolls”, we fretted that “Comprehensive patents are taken out by some parties, for the purpose of stopping inventions, or appropriating the fruits of the inventions of others.”Every one of these criticisms is as relevant today. Alas, even after pointing out the failure of previous reforms, The Economist's current leader ends up arguing for yet another round of "reform".

Even more than in the past those who extract rents via patents are able to divert a minuscule fraction of their ill-gotten gains to rewarding politicians who prevent or subvert reform. Although The Economist's proposals, including a "use-it-or-lose-it" rule, stronger requirements for non-obviousness, and shorter terms, are all worthy, they would all encounter resistance almost as fierce as abolition:
Six bills to reform patents in some way ... have been proposed to the current American Congress. None seeks abolition: any lawmaker brave enough to propose doing away with them altogether, or raising similar questions about the much longer monopolies given to copyright holders, would face an onslaught from the intellectual-property lobby. The Economist's article draws heavily on the work of Michele Boldrin and David Levine:
Reviewing 23 20th-century studies [they] found “weak or no evidence that strengthening patent regimes increases innovation”—all it does is lead to more patents being filed, which is not the same thing. Several of these studies found that “reforms” aimed at strengthening patent regimes, such as one undertaken in Japan in 1988, for the most part boosted neither innovation nor its supposed cause, R&D spending.The exception was interesting:
A study of Taiwan’s 1986 reforms found that they did lead to more R&D spending in the country and more American patents being granted to Taiwanese people and enterprises. This shows that countries whose patent protection is weaker than others’ can divert investment and R&D spending to their territory by strengthening it. But it does not demonstrate that the overall amount of spending or innovation worldwide has been increased.It is clear that far too many patents are being filed and granted. First, companies need them for defense:
In much of the technology industry companies file large numbers of patents (see chart 2), but this is mostly to deter their rivals: if you sue me for infringing one of your thousands of patents, I’ll use one of my stash of patents to sue you back.The number of patents in the stash, rather than the quality of those patents, is what matters for this purpose. And second, counting patents has become a (misleading) measure of innovation:
In some industries and countries they have become a measure of progress in their own right—a proxy for innovation, rather than a spur. Chinese researchers, under orders to be more inventive, have filed a flurry of patents in recent years. But almost all are being filed only with China’s patent office. If they had real commercial potential, surely they would also have been registered elsewhere, too. Companies pay their employees bonuses for filing patents, irrespective of their merits. In the same way that rewarding authors by counting papers leads to the Least Publishable Unit phenomenon, and bad incentives in science, rewarding inventors by counting patents leads to the Least Patentable Unit phenomenon. It is made worse by the prospect of litigation. Companies file multiple overlapping patents on the same invention not just to inflate their number of trading beans, but also to ensure that, even if some are not granted, their eventual litigators will have as many avenues of attack against alleged infringement as possible.

Companies forbid their engineers and scientists to read patents, in case they might be accused of "willful infringement" and be subject to triple damages. Thus the idea that patents provide:
the tools whereby others can innovate, because the publication of good ideas increases the speed of technological advance as one innovation builds upon another.is completely obsolete. The people who might build new innovations on the patent disclosures aren't allowed to know about them. The people providing the content as patent lawyers write new patents aren't allowed to know about related patents already issued. Further:
The evidence that the current system encourages companies to invest in research in a way that leads to innovation, increased productivity and general prosperity is surprisingly weak. A growing amount of research in recent years, including a 2004 study by America’s National Academy of Sciences, suggests that, with a few exceptions such as medicines, society as a whole might even be better off with no patents than with the mess that is today’s system. The Economist noted that the original purpose of patents was that the state share in the rent they allowed to be extracted:
in the early 17th century King James I was raising £200,000 a year from granting patents. The only thing that's changed is that the state now hands patents out cheaply, rather than charging the market rate for their rent-extraction potential. How about limiting the number of patents issued each year and auctioning the slots?

DPLA: New DPLA Contract Opportunity: Metadata Ingestion Development Contractor

planet code4lib - Tue, 2015-08-11 14:45

The Digital Public Library of America (http://dp.la) invites interested and qualified individuals or firms to submit a proposal for development related to Heidrun and Krikri, DPLA’s metadata ingestion and aggregation systems.

Proposal Deadline: 5:00 PM EDT (GMT-04:00), August 31, 2015

Background

DPLA aggregates metadata for openly available digital materials from America’s libraries, archives and museums through its ingestion process. The ingestion process has three steps: 1) harvesting metadata from partner sources, 2) mapping harvested records to the DPLA Metadata Application Profile, an RDF model based on the Europeana Data Model, 3) and enriching the mapped metadata to clean and add value to the data (e.g. normalization of punctuation, geocoding, etc.). New metadata providers are subject to a quality assurance process that allows DPLA staff to identify the accuracy of metadata mappings, enrichments, and indexing strategies.

DPLA technology staff has implemented these functions as part of Krikri, a Ruby on Rails engine which provides the core functionality for the DPLA ingestion process. Krikri includes abstract classes and implementations of harvester modules, a metadata mapping domain specific language, and a framework for building enrichments (Audumbla). DPLA deploys Krikri as part of Heidrun. More information about Heidrun can be found on its project page. Krikri uses Apache Marmotta as a backend triple store, PostgreSQL as a backend database, Redis and Resque for job queuing, and Apache Solr and Elasticsearch as search index platforms.

Krikri, Heidrun, Audumbla, and metadata mappings are released as free and open source software under the MIT License. All metadata aggregated by DPLA is released under a CC0 license.

Statement of Needs

The selected contractor will provide programming staff as needed for DPLA related to development of Krikri and Heidrun. These resources will be under the direction of Mark A. Matienzo, DPLA Director of Technology.

DPLA staff is geographically distributed, so there is no requirement for the contractor to be located in a particular place. Responses may provide options or alternatives so that DPLA gets the best value for the price. If the contractor’s staff is distributed, the response should include detail on how communications will be handled between the contractor and DPLA staff. We expect the contractor will provide a primary technical/operations contact and a business contact; these contacts may be the same person, but they must be identified in the response. In addition, we expect that the technical/operations contact and the business contact will be available for occasional meetings between 9:00 AM and 5:00 PM Eastern Time (GMT-04:00).

Core implementation needs include the following two tracks, with Track 1 being the primary deliverable. Work in Track 2, and other work identified by DPLA staff, is subject to available resources remaining in the contract.

  • Track 1 (highest priority; work to be completed by December 24, 2015):
    • 1a. Development of mappings for 20 DPLA hubs (providers) to be used by Heidrun using the Krikri metadata mapping DSL.
      • Harvested metadata includes, but is not limited to, the following schemas and formats: MARCXML, MODS, OAI Dublin Core, Qualified Dublin Core, JSON-LD.
      • Sample mappings can be found in our GitHub repository.
      • Mappings are understood to be specific to each DPLA hub.
      • As needed, revisions and/or refactoring of the metadata mapping DSL implementation may be necessary to support effective mapping.
    • 1b. Development of 5 harvesters for site-specific application programming interfaces, static file harvests, etc.
      • Krikri currently supports harvesting from OAI-PMH providers, CouchDB, and a sample generic API harvester. Heidrun includes an implementation of an existing site-specific API harvester.
    • 1c. As needed, development or modification of enrichment modules for Krikri.
  • Track 2 (additional development work to be completed as resources allow after Track 1):
    • Refactoring and development to allow Krikri applications to more effectively queue batches of jobs to improve concurrency and throughput.
    • Refactoring for support for Rails 4.2 and Blacklight 5.10+.
    • Expanding the Krikri “dashboard” staff-facing application, which currently supports the quality assurance process, to allow non-technical staff to start, schedule, and enqueue harvest, mapping, enrichment, and indexing activities.

All code developed as part of this contract is subject to code review by DPLA technology staff, using GitHub. In addition, implemented mappings will be subject to quality assurance processes led by Gretchen Gueguen, DPLA Data Services Coordinator.

Proposal guidelines

All proposals must adhere to the following submission guidelines and requirements.

  • Proposals are due no later than 5:00PM EDT, Monday, August 31, 2015.
  • Proposals should be sent via email to ingest-contract@dp.la, as a single PDF file attached to the message. Questions about the proposal can also be sent to this address.
  • Please format the subject line with the phrase “DPLA Metadata Ingestion Proposal – [Name of respondent]”.

All proposals should include the following:

  • Pricing, as an hourly rate in US Dollars, and as costs for each work item to be completed in Track 1
  • Proposed staffing plan, including qualifications of project team members (resumes/CVs, links or descriptions of previous projects such as open source contributions)
  • References, listing all clients/organizations with whom the proposer has done business like that required by this solicitation with the last 3 years
  • Qualifications and experience, including
    • General qualifications and development expertise
      • Information about development and project management skills and philosophy
      • Examples of successful projects, delivered on time and on budget
      • Preferred tools and methodologies used for issue tracking, project management, and communication
      • Preferences for change control tools and methodologies
    • Project specific strategies
      • History of developing software in the library, archives, or museum domain
      • Evidence of experience with Ruby on Rails, search platforms such as Solr and Elasticsearch, domain specific language implementations, and queuing systems
      • Information about experience with extract-transform-load workflows and/or metadata harvesting, mapping, and cleanup at scale, using automated processes
      • Information about experience with RDF metadata, triple stores, and implementations of W3C Linked Data Platform specification
Timeline
  • RFP issued: August 11, 2015
  • Work is to be performed no sooner than September 1, 2015.
  • Work for Track 1 must be completed by December 24, 2015.
  • Any additional work, such as Track 2 or other work mutually agreed upon by DPLA and contractor, is to be completed no later than March 31, 2016.
Contract guidelines
  • Proposals must be submitted by the due date.
  • Proposers are asked to guarantee their proposal prices for a period of at least 60 days from the date of the submission of the proposal.
  • Proposers must be fully responsible for the acts and omissions of their employees and agents.
  • DPLA reserves the right to include a mandatory meeting via teleconference to meet with submitters of the proposals individually before acceptance. Top scored proposals may be required to participate in an interview and/or site visit to support and clarify their proposal.
  • The contractor will enter into a contract with DPLA that is consistent with DPLA’s standard contracting policies and procedures.
  • DPLA reserves the right to negotiate with each contractor.
  • There is no allowance for project expenses, travel, or ancillary expenses that the contractor may incur.
  • Individuals or companies based outside the US are eligible to submit proposals, but will have to comply with US and host country labor and tax laws.
About DPLA

The Digital Public Library of America strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science. Since launching in April 2013, it has aggregated 11 million items from 1,600 institutions. The DPLA is a registered 501(c)(3) non-profit.

FOSS4Lib Recent Releases: Koha - Security and maintenance releases - v 3.20.2, 3.18.9, 3.16.13

planet code4lib - Tue, 2015-08-11 12:09
Package: KohaRelease Date: Thursday, July 30, 2015

Last updated August 11, 2015. Created by David Nind on August 11, 2015.
Log in to edit this page.

Monthly security and maintenance releases for Koha.

See the release announcements for the details:

SearchHub: Basics of Storing Signals in Solr with Fusion for Data Engineers

planet code4lib - Tue, 2015-08-11 11:36

In April we featured a guest post Mixed Signals: Using Lucidworks Fusion’s Signals API, which is a great introduction to the Fusion Signals API. In this post I work through a real-world e-commerce dataset to show how quickly the Fusion platform lets you leverage signals derived from search query logs to rapidly and dramatically improve search results over a products catalog.

Signals, What’re They Good For?

In general, signals are useful any time information about outside activity, such as user behavior, can be used to improve the quality of search results. Signals are particularly useful in e-commerce applications, where they can be used to make recommendations as well as to improve search. Signal data comes from server logs and transaction databases which record items that users search for, view, click on, like, or purchase. For example, clickstream data which records a user’s search query together with the item which was ultimately clicked on is treated as one “click” signal and can be used to:

  • enrich the results set for that search query, i.e., improve the items returned for that query
  • enrich the information about the item clicked on, i.e., improve the queries for that item
  • uncover similarities between items, i.e., cluster items based on other clicks on for queries
  • make recommendations of the form:
    • “other customers who entered this query clicked on that”
    • “customers who bought this also bought that”
Signals Key Concepts
  • A signal is a piece of information, event, or action, e.g., user queries, clicks, and other recorded actions that can be related back to a document or documents which are stored in a Fusion collection, referred to as the “primary collection”.
    • A signal has a type, an id, and a timestamp. For example, signals from clickstream information are of type “click” and signals derived from query logs are of type “query”.
    • Signals are stored in an auxiliary collection and naming conventions link the two so that the name of the signals collection is the name the primary collection plus the suffix “_signals”.
  • An aggregation is the result of processing a stream of signals into a set of summaries that can be used to improve the search experience. Aggregation is necessary because in the usual case there is a high volume of signals flowing into the system but each signal contains only a small amount of information in and of itself.
    • Aggregations are stored in an auxiliary collection and naming conventions link the two so that the name of the aggregations collection is the name the primary collection plus the suffix “_signals_aggr”.
    • Query pipelines use aggregated signals to boost search results.
    • Fusion provides an extensive library of aggregation functions allowing for complex models of user behavior. In particular, date-time functions provide a temporal decay function so that over time, older signals are automatically downweighted.
  • Fusion’s job scheduler provides the mechanism for processing signals and aggregations collections in near real-time.
Some Assembly Required

In a canonical e-commerce application, your primary Fusion collection is the collection over your products, services, customers, and similar. Event information from transaction databases and server logs would be indexed into an auxiliary collection of raw signal data and subsequently processed into an aggregated signals collection. Information from the aggregated signals collection would be used to improve search over the primary collection and make product recommendations to users.

In the absence of a fully operational ecommerce website, the Fusion distribution includes an example of signals and a script that processes this signal data into an aggregated signals collection using the Fusion Signals REST-API. The script and data files are in the directory $FUSION/examples/signals (where $FUSION is the top-level directory of the Fusion distribution). This directory contains:

  • signals.json – a sample data set of 20,000 signal events. These are ‘click’ events.
  • signals.sh – a script that loads signals, runs one aggregation job, and gets recommendations from the aggregated signals.
  • aggregations_definition.json – examples of how to write custom aggregation functions. These examples demonstrate several different advanced features of aggregation scripting, all of which are outside of the scope of this introduction.

The example signals data comes from a synthetic dataset over Best Buy query logs from 2011. Each record contains the user search query, the categories searched, and the item ultimately clicked on. In the next sections I create the product catalog, the raw signals, and the aggregated signals collections.

Product Data: the primary collection ‘bb_catalog’

In order to put the use of signals in context, first I recreate a subset of the Best Buy product catalog. Lucidworks cannot distribute the Best Buy product catalog data that is referenced by the example signals data, but that data is available from the Best Buy Developer API, which is a great resource both for data and example apps. I have a copy of previously downloaded product data which has been processed into a single file containing a list of products. Each product is a separate JSON object with many attribute-value pairs. To create your own Best Buy product catalog dataset, you must register as a developer via the above URL. Then you can use the Best Buy Developer API query tool to select product records or you can download a set of JSON files over the complete product archives.

I create a data collection called “bb_catalog” using the Fusion 2.0 UI. By default, this creates collections for the signals and aggregated signals as well.

Although the collections panel only lists collection “bb_catalog”, collections “bb_catalog_signals” and “bb_catalog_signals_aggr” have been created as well. Note that when I’m viewing collection “bb_catalog”, the URL displayed in the browser is: “localhost:8764/panels/bb_catalog”:

By changing the collection name to “bb_catalog_signals” or “bb_catalog_signals_aggr”, I can view the (empty) contents of the auxiliary collections:

Next I index the Best Buy product catalog data into collection “bb_catalog”. If you choose to get the data in JSON format, you can ingest it into Fusion using the “JSON” indexing pipeline. See blog post Preliminary Data Analysis in Fusion 2 for more details on configuring and running datasources in Fusion 2.

After loading the product catalog dataset, I check to see that collection “bb_catalog” contains the products referenced by the signals data. The first entry in the example signals file “signals.json”is a search query with query text: “Televisiones Panasonic 50 pulgadas” and docId: “2125233”. I do a quick search to find a product with this id in collection “bb_catalog”, and the results are as expected:

Raw Signal Data: the auxiliary collection ‘bb_catalog_signals’

The raw signals data in the file “signals.json” are the synthetic Best Buy dataset. I’ve modified the timestamps on the search logs in order to make them seem like fresh log data. This is the first signal (timestamp updated):

{ "timestamp": "2015-06-01T23:44:52.533Z", "params": { "query": "Televisiones Panasonic 50 pulgadas", "docId": "2125233", "filterQueries": [ "cat00000", "abcat0100000", "abcat0101000", "abcat0101001" ] }, "type": "click" },

The top-level attributes of this object are:

  • type – As stated above, all signals must have a “type”, and as noted in the earlier post “Mixed Signals”, section “Sending Signals”, the value should be applied consistently to ensure accurate aggregation. In the example dataset, all signals are of type “click”.
  • timestamp – This data has timestamp information. If not present in the raw signal, it will be generated by the system.
  • id – These signals don’t have distinct ids; they will be generated automatically by the system.
  • params – This attribute contains a set of key-value pairs, using a set of pre-defined keys which a appropriate for search-query event information. In this dataset, the information captured includes the free-text search query entered by the user, the document id of the item clicked on, and the set of Best Buy site categories that the search was restricted to. These are codes for categories and sub-categories such as “Electronics” or “Televisions”.

In summary, this dataset is an unremarkable snapshot of user behaviors between the middle of August and the end of October, 2011 (updated to May through June 2015).

The example script “signals.sh” loads the raw signal via a POST request to the Fusion REST-API endpoint: /api/apollo/signals/<collectionName> where <collectionName> is the name of the primary collection itself. Thus, to load raw signal data into the Fusion collection “bb_catalog_signals”, I send a POST request to the endpoint: /api/apollo/signals/bb_catalog.

Like all indexing processes, an indexing pipeline is used to process the raw signal data into a set of Solr documents. The pipeline used here is the default signals indexing pipeline named “_signals_ingest”. This pipeline consists of three stages, the first of which is a Signal Formatter stage, followed by a Field Mapper stage, and finally a Solr Indexer stage.

(Note that in a production system, instead of doing a one time upload of some server log data, raw signal data could be streamed into a signals collection an ongoing basis by using a Logstash or JDBC connector together with a signals indexing pipeline. For details on using a Logstash connector, see blog post on Fusion with Logstash).

Here is the curl command I used, running Fusion locally in single server mode on the default port:

curl -u admin:password123 -X POST -H 'Content-type:application/json' http://localhost:8764/api/apollo/signals/bb_catalog?commit=true --data-binary @new_signals.json

This command succeeds silently. To check my work, I use the Fusion 2 UI to view the signals collection, by explicitly specifying the URL “localhost:8764/panels/bb_catalog_signals”. This shows that all 20K signals have been indexed:

Further exploration of the data can be done using Fusion dashboards. To configure a Fusion dashboard using Banana 3, I specify the URL “localhost:8764/banana”. (For details and instructions on Banana 3 dashboards, see this this post on log analytics). I configure a signals dashboard and view the results:

The top row of this dashboard shows that there are 20,000 clicks in the collection bb_catalog_signals that were recorded in the last 90 days. The middle row contains a bar-chart showing the time at which the clicks came in and a pie chart display of top 200 documents that were clicked on. The bottom row is a table over all of the signals – each signal contains only click.

The pie chart allows us to visualize a simple aggregation of clicks per document. The most popular document got 232 clicks, roughly 1% of the total clicks. The 200th most popular document got 12 clicks, and the vast majority of documents only got one click per document. In order to use information about documents clicked on, we need to make this information available in a form that Solr can use. In other words, we need to create a collection of aggregated signals.

Aggregated Signals Data: the auxiliary collection ‘bb_catalog_signals_aggr’

Aggregation is the “processing” part of signals processing. Fusion runs queries over the documents in the raw signals collection in order to synthesize new documents for the aggregated signals collection. Synthesis ranges from counts to sophisticated statistical functions. The nature of the signals collected determines the kinds of aggregations performed. For click signals from query logs, the processing is straightforward: an aggregated signal record contains a search query, a count of the number of raw signals that contained that search query; and aggregated information from all raw signals: timestamps, ids of documents clicked on, search query settings, in this case, the product catalog categories over which that search was carried out.

To aggregate the raw signals in collection “bb_catalog_signals” from the Fusion 2 UI, I choose the “Aggregations” control listed in the “Index” section of the “bb_catalog_signals” home panel:

I create a new aggregation called “bb_aggregation” and define the following:

  • Signal Types = “click”
  • Time Range = “[* TO NOW]” (all signals)
  • Output Collection = “bb_catalog_signals_aggr”

The following screenshot shows the configured aggregation. The circled fields are the fields which I specified explicitly; all other fields were left at their default values.

Once configured, the aggregation is run via controls on the aggregations panel. This aggregation only takes a few seconds to run. When it has finished, the number of raw signals processed and aggregated signals created are displayed below the Start/Stop controls. This screenshot shows that the 20,000 raw signals have been synthesized into 15,651 aggregated signals.

To check my work, I use the Fusion 2 UI to view the aggregated signals collection, by explicitly specifying the URL “localhost:8764/panels/bb_catalog_signals_aggr”. Aggregated click signals have a “count” field. To see the more popular search queries, I sort the results in the search panel on field “count”:

The most popular searches over the Best Buy catalog are for major electronic consumer goods: TVs and computers, at least according to this particular dataset.

Fusion REST-API Recommendations Service

The final part of the example signals script “signals.sh” calls the Fusion REST-API’s Recommendation service endpoints “itemsForQuery”, “queriesForItem”, and “itemsForItems”. The first endpoint, “itemsForQuery” returns the list of items that were clicked on for a query phrase. In the “signals.sh” example, the query string is “laptop”. When I do a search on query string “laptop” over collection “bb_catalog”, using the default search pipeline, the results don’t actually include any laptops:

With properly specified fields, filters, and boosts, the results could probably be improved. With aggregated signals, we see improvements right away. I can get recommendations using the “itemsForQuery” endpoint via a curl command:

curl -u admin:password123 http://localhost:8764/api/apollo/recommend/bb_catalog/itemsForQuery?q=laptop

This returns the following list of ids: [ 2969477, 9755322, 3558127, 3590335, 9420361, 2925714, 1853531, 3179912, 2738738, 3047444 ], most of which are popular laptops:

When not to use signals

If the textual content of the documents in your collection provides enough information such that for a given query, the documents returned are the most relevant documents available, then you don’t need Fusion signals. (If it ain’t broke, don’t fix it.) If the only information about your documents is the documents themselves, you can’t use signals. (Don’t use a hammer when you don’t have any nails.)

Conclusion

Fusion provides the tools to create, manage, and maintain signals and aggregations. It’s possible to build extremely sophisticated aggregation functions, and to use aggregated signals in many different ways. It’s also possible to use signals in a simple way, as I’ve done in this post, with quick and impressive results.

In future posts in this series, we will show you:

  • How to write query pipelines to harness this power for better search over your data, your way.
  • How to harness the power of Apache Spark for highly scalable, near-real-time signal processing.

The post Basics of Storing Signals in Solr with Fusion for Data Engineers appeared first on Lucidworks.

William Denton: Jesus, to his credit

planet code4lib - Tue, 2015-08-11 01:30

A quote from a sermon given by an Anglican minister a couple of weeks ago: “Jesus, to his credit, was a lot more honourable than some of us would have been.”

Pages

Subscribe to code4lib aggregator