You are here

Feed aggregator

Code4Lib Journal: Practical Digital Forensics at Accession for Born-Digital Institutional Records

planet code4lib - Thu, 2016-01-28 16:07
Archivists have developed a consensus that forensic disk imaging is the easiest and most effective way to preserve the authenticity and integrity of born-digital materials. Yet, disk imaging also has the potential to conflict with the needs of institutional archives – particularly those governed by public records laws. An alternative possibility is to systematically employ digital forensics tools during accession to acquire a limited amount of contextual metadata from filesystems. This paper will discuss the development of a desktop application that enables records creators to transfer digital records while employing basic digital forensics tools records’ native computing environment to gather record-events from NTFS filesystems.

Code4Lib Journal: RSS Feed 2.0: The Crux of a Social Media Strategy

planet code4lib - Thu, 2016-01-28 16:07
This article explains how the University of Nebraska Kearney Calvin T. Ryan Library improved their social media strategy by using an RSS 2.0 feed to update and sync social media tools and create a slideshow on the library's home page. An example of how to code a well-formed RSS 2.0 feed with XML is given, in addition to PHP, HTML, and JQuery utilized to automate the library home page slideshow.

Code4Lib Journal: Video Playback Modifications for a DSpace Repository

planet code4lib - Thu, 2016-01-28 16:07
This paper focuses on modifications to an institutional repository system using the open source DSpace software to support playback of digital videos embedded within item pages. The changes were made in response to the formation and quick startup of an event capture group within the library that was charged with creating and editing video recordings of library events and speakers. This paper specifically discusses the selection of video formats, changes to the visual theme of the repository to allow embedded playback and captioning support, and modifications and bug fixes to the file downloading subsystem to enable skip-ahead playback of videos via byte-range requests. This paper also describes workflows for transcoding videos in the required formats, creating captions, and depositing videos into the repository.

Galen Charlton: Wherein I complain about Pearson’s storage of passwords in plaintext and footnote my snark

planet code4lib - Thu, 2016-01-28 13:54

From a security alert 1 from Langara College:

Langara was recently notified of a cyber security risk with Pearson online learning which you may be using in your classes. Pearson does not encrypt user names or passwords for the services we use, which puts you at risk. Please note that they are an external vendor; therefore, this security flaw has no direct impact on Langara systems.

This has been a problem since at least 20112; it is cold comfort that at least one Pearson service has a password recovery page that outright says that the user’s password will be emailed to them in clear text3.

There have been numerous tweets, blog posts, and forum posts about this issue over the years. In at least one case4, somebody complained to Pearson and ended up getting what reads like a canned email stating:

Pearson must strike a reasonable balance between support methods that are accessible to all users, and the risk of unauthorized access to information in our learning applications. Allowing customers to retrieve passwords via email was an industry standard for non-financial applications.

In response to the changing landscape, we are developing new user rights management protocols as part of a broader commitment to tighten security and safeguard customer accounts, information, and product access. Passwords will no longer be retrievable; customers will be able to reset passwords through secure processes.

This is a risible response for many reasons; I can only hope that they actually follow through with their plan to improve the situation in a timely fashion. Achieving the industry standard for password storage as of 1968 might be a good start5.

In the meantime, I’m curious whether there are any libraries who are directly involved in the acquisition of Pearson services on behalf of their school or college. If so, might you have a word with your Pearson rep?

Adapted from an email I sent to the LITA Patron Privacy Interest Group’s mailing list. I encourage folks interested in library patron privacy to subscribe; you do not have to be a member of ALA to do so.

Footnotes

1. Pearson Cyber Security Risk
2. Report on Plain Text Offenders
3. Pearson account recovery page
4. Pearson On Password Security
5. Wilkes, M V. Time-sharing Computer Systems. New York: American Elsevier Pub. Co, 1968. Print.. It was in this book that Roger Needham first proposed hashing passwords.

Hydra Project: And UCSD makes 30!

planet code4lib - Thu, 2016-01-28 09:57

Declan Fleming and his team at the University of California, San Diego, have long been Hydra Partners in everything but name; among many other contributions in 2012 they hosted the first Hydra Power Steering meeting (at that time called our ‘Strategic Retreat’) and in 2014 they hosted the first Hydra Connect conference.  We are now delighted to be able to announce that they have officially become Hydra’s 30th Partner.

In UCSD’s letter of intent, Brian Schottlaender (the Audrey Geisel University Librarian) writes:

“Working with the Hydra community has already proven to be fruitful, helping us produce a new DAMS product release that manages simple and complex digital objects, and RDF metadata, reading and writing through Hydra. We believe that adopting Hydra into our stack will continue to save significant development time.

“We look forward to becoming Hydra Partners, and to adding our experience to the group and benefitting from that of the other members.”

Welcome UCSD!!!

Ed Summers: API Studies

planet code4lib - Thu, 2016-01-28 05:00

Because of my work in the iSchool and at MITH, and the influence of various people in both, I’m getting increasingly interested in looking at social media through the lens of software and platform studies.

When you think about it, it’s not hard to conceive of the application programming interface or API as a contract or blueprint for what a social media platform allows and does not allow. If you want to get information into or out of a platform and there’s no API endpoint for it, you can’t do it…at least not easily without scraping or other hijinks. So the API is a material expression of a platform’s politics, its governance and business models.

What’s really quite interesting is how these APIs are situated in time as documents. In some ways they are similar to the Terms of Service documents. But since I’m a software developer and not a lawyer, the API documentation feels more explicit or actionable. Either the API call works, or it doesn’t. If you want to test whether it works you try it out. The API docs clearly define the limits of what is possible with the social media platform. This is unlike the ToS, which one can be unintentionally oblivious of, or choose to blithely disregard or even transgress.

But as most people involved in information technology know, there can often be significant gaps between our technical documentation and how things actually work. Sometimes this is the result of drift: the code changes but the documentation does not. Sometimes I think it’s also because the behavior of a complex system isn’t understood well enough to be able to make it understandable in prose.

If this sort of thing interests you too you might be interested in a little service called API Changelog. They monitor the documentation of over 100 different APIs (Twitter, Facebook, Slack, GitHub, Weibo, etc) and email you little alerts when there have been changes. Because of some work I’m involved in I’m particularly interested in the Twitter API, so I’ve been watching the changes for a few months now.

For example consider this change that arrived in my inbox this morning.

There are lots of little changes highlighted here in a diff style format. One thing that popped out at me was the addition of this phrase:

The Twitter Search API searches against a sampling of recent Tweets published in the past 7 days.

Like most people who work with the Twitter API I’ve known about the 7 day limit for some time. What poppoed out at me (and what I emphasized in the quote) was the use of the word sampling … which indicated to me that not all tweets in that time period are being searched. Of course this raises questions about what sort of sample is being employed, and recalls research into the bias present in Twitter’s search and streaming APIs (González-Bailón, Wang, Rivero, Borge-Holthoefer, & Moreno, 2014), (Driscoll & Walker, 2014).

Now this might sound academic, but really it’s not, right? I’m actually kind of curious how API ChangeLog itself works. It might be a straight up diff, but there is some smarts to what they are doing since it seems sensitive to the textual content but not the styling of the documentation. It’s interesting how a service for developers that use the APIs and businesses that provide APIs can have a mirror purpose for those that are studying APIs.

References

Driscoll, K., & Walker, S. (2014). Working within a black box: Transparency in the collection and production of big twitter data. International Journal of Communication, 8, 1745–1764. Retrieved from http://ijoc.org/index.php/ijoc/article/view/2171

González-Bailón, S., Wang, N., Rivero, A., Borge-Holthoefer, J., & Moreno, Y. (2014). Assessing the bias in samples of large online networks. Social Networks, 38, 16–27. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2185134

DuraSpace News: All About the Portland Common Data Model (PCDM) in a "Quickbyte"

planet code4lib - Thu, 2016-01-28 00:00

Austin, Texas  Find out how the Portland Common Data Model (PCDM) gets complex systems with complex data to interoperate in this new DuraSpace Quickbyte —no more than you can chew in a three-minute broadcast. The Portland Common Data Model (PCDM) aims to accomplish data exchange with interoperable data models, which are useful in many contexts, including digital repositories.

DuraSpace News: OR2016 Proposal Deadline Extended

planet code4lib - Thu, 2016-01-28 00:00

From the Open Repositories 2016 Conference organizers

Dublin, Ireland  The final deadline for submitting proposals for the Eleventh International Conference on Open Repositories (@OR2016 and #or2016) has been extended until Monday, Feb. 8, 2016. The conference is scheduled to take place June 13-16 in Dublin and is being hosted by Trinity College Dublin, along with collaborators in the Royal Irish Academy and University College Dublin.

NYPL Labs: Nomadic Classification: Classmark History and New Browsing Tool

planet code4lib - Wed, 2016-01-27 19:58

In the past few months,  NYPL Labs has embarked upon a series of investigations into how legacy classification systems at the library can be used to generate new data and power additional forms of discovery. What follows is some background on the project and some of the institutional context that has prompted these examinations. One of the tools we’re introducing here is “BILLI:  Bibliographic Identifiers for Library Location Information;” read on for more background, and be sure to try the tool out for yourself.

Then there is a completely other type of distribution or hierarchy which must be called nomadic, a nomad nomos, without property, enclosure or measure. Here, there is no longer a division of that which is distributed but rather a division among those who distribute themselves in an open space — a space which is unlimited, or at least without precise limits… Even when it concerns the serious business of life, it is more like a space of play, or a rule of play… To fill a space, to be distributed within it, is very different from distributing the space. —Gilles Deleuze, Difference & Repetition

Classification, the basic process of categorization, is simple in theory but becomes complex in practice. Examples of classification can be seen all around us, from the practical use of organizing the food found in your local grocery store into aisles, to the very specialized taxonomy system that separates the hundreds of different species of the micro-animal Tardigrada. At their core these various systems of categorization are simply based on good faith judgments. Whoever organized your local grocery thought: “Cookies seem pretty similar to crackers, I will put them together in the same aisle.”

A similar, but more evidence based process developed the system that categorizes hundreds of thousands of biological species. Classification systems are usually logical but are inherently arbitrary. Uniform application of classification is what makes a system useful. Yet, uniformity is difficult to maintain over long periods of time. Institutional focus shifts, the meaning of words drift, and even our culture itself changes when measuring time in decades. Faced with these challenges, in the age of barcodes, databases, and networks, the role of traditional classification systems are not diminished but could benefit by thinking how they could practically leverage this new environment.

**Nerd Alert! If 19th century classification history is not your thing you might want to skip to A Linked Space.** Problem Space

Libraries are founded on the principle of classification, the most common and well known form being the call number. This is the code that appears on the spine of a book keeping track of where it should be stored and usually the subject of its content. The most well known form of call number is, of course, the iconic Dewey Decimal System. But there are many other systems employed by libraries due to the nature of the resources being organized and the strengths and weaknesses of a specific classification system. Just as there is no single tool for every job there is no universal system for classification.

The New York Public Library is a good example of that realization as seen in the adoption of multiple call number systems over its 120-year history. The very first system used at the library was developed in 1899 by NYPL’s first president, John Shaw Billings. He wanted to develop a system that could efficiently organize the materials being stored in the library’s soon-to-open main branch, the Stephen A. Schwarzman building. In fact, Billings also contributed to how the new main building should be physically designed:

John Shaw Billings, sketch of the main building layout. Image ID: 465480

The classification scheme he came up with, known as the Billings system, could be thought of as a reflection the physical layout of the main branch circa 1911. Billings was very practical in the description of his creation, writing:

Upon this classification it may be remarked that it is not a copy of any classification used elsewhere; that it is not especially original; that is is not logical so far as the succession of different departments in relation to the operations of the human mind is concerned; that it is not recommended for any other library, and that no librarian of another library would approve of it.

The system groups materials together by assigning each area of research a letter, A-Z (minus the letter J, more on that later). This letter, the first part of the call number, is known as a classmark.

For example, books cataloged under this system that are Biographies would have the first letter of their classmark, “A”. History is “B”, Literature is “N”, and so on through “Z”, which is Religion. More letters can be added to denote more specific areas within that subject. For example, “ZH” is about Ritual and Liturgy and “ZHW” is more specifically about Ancient and Medieval Ritual and Liturgy.

While this system was used to classify the materials held in the main branch’s stacks, there were also materials held in the reading rooms or special collections around the building. To organize these materials, he reused the same system but added an asterisk in front of the letter to make what he called the star groups — i.e., a classmark starting with a “K” is about Geography, but “*K” is a resource kept in the Rare Books division though not necessarily about Geography. With these star groups,  the Billings system became a conflation of a subject, location, and material-based systems. This overview document gives a good idea of the large range classification that the Billings system covered:

Top level Billings classmarks


In the 1950s, the uptick in the acquisition of materials made the Billings system too inefficient to quickly catalog materials. While parts of the Billing system continued being used, even through to today, a general shift to a new fixed order system was made in 1956 and then refined in 1970.

The idea behind the fixed order systems is to group materials together by size to most efficiently store them. The library decided that discovery of materials could be achieved not by the classmark but by the resource’s subject headings. Subject headings are added to the record while it is being cataloged and provide a vector of discovery if the same subjects are uniformly applied.

In the United States the most common vocabulary of subject headings is the Library of Congress Subject Headings. Enabling resource  discovery through subject headings obviates the need for call numbers to organize materials. The call number can just be an identifier to physically locate the resource. The first fixed order system grouped items only by size:

Old fixed order system classmarks

The call numbers would look something like “B-10 200”, meaning it was the 200th 17cm monograph cataloged. This system was refined in 1970 to included a bit more contextual information about what the materials were about:

Revised fixed order system classmarks

Since the J letter was previously unused in the Billing system it was used here as a prefix to the size system to add more context to the fixed order classification. For example, a “JFB 00-200” means it is still a 17cm monograph but is generally about the Humanities & Social Sciences.

While this fixed order system is used for the majority of the monographs acquired by the library there are special collections in the research divisions that use their own call number systems. Resources like archival collections or prints and photographs have records in the catalog that help locate them in their own division. For example, archival collections often have a call number that starts with “MssCol” while rare books at the Schomburg Center start with “Sc Rare”. This final diverse category of classification at NYPL drives home the obvious problem: the sheer number of classification systems at work reduces the call number to an esoteric identifier—especially for obsolete and legacy systems—useful to only the most veteran staff member. These identifiers have great potential to develop new avenues of discovery if they can be leveraged.

A Linked Space

An ambitious 19th-century librarian faced with this problem might come up with a simple solution: Let’s invent a new classification system that incorporates all these various types of call numbers into one centralized system. With the emergence of linked data in library metadata practice, however, when relationships between resources are gaining increased importance, a 21st-century librarian has an even better idea: Let’s link everything together!

In most existing library metadata systems, the call number is a property of the record; it is a field in the metadata that helps to describe the resource. The linked data alternative is to make the classmark its own entity and give it an identifier (URI), which allows us to describe the classmark and start making statements about it. This is nothing new in the library linked data world; the Library of Congress for example, started doing this for some of their vocabularies (including some LCC classmarks) years ago. But to accomplish this task at NYPL it took a combination of technical work, institutional knowledge sleuthing, and a lot of data cleanup.

The first step is to simply get a handle on what classmarks are in use at the library. By aggregating over 24 million catalog records and identifying the unique classmarks, we are able to create a dataset that contains all possible classmarks at the library. This new bottom-up approach to our call number data enables the next step of organization and description.

Institutional knowledge is hard to retain over 120 years of history. It takes the form of old documents and outdated technical memos. The Billings classification was first documented in a schedule (a monograph book) that lists each classmark and its meaning. Some of these original bound resources are still around the library and have gone through their own data migration journey.

Page from the bound Billings schedule

While the bound books are still in use today, most with copious marginalia, over the years this resource was converted to a typed document and then converted into the digital realm in the form of a MS Word document. This data was stewarded by our colleagues in BookOps who are the authority for cataloging at NYPL. We took this data and converted it into an RDF SKOS vocabulary and reconciled it with the raw classmark data aggregated from the catalog. This new dataset is comprehensive because it aligns what classmarks are supposed to be in use, from the documentation, with what is actually in use, from the data aggregation. Each classmark now has its own URI persistent identifier which we can begin making statements about:

Example triple statements for *H classmark - Libraries This confusing looking jumble of text is a bunch of RDF triples in the Turtle Syntax talking about the *H (Libraries) classmark. It is describing the name of the classmark, the type of classmark it is, what narrower classmarks are related to it, how many resources use it in the catalog and some mappings to other classifications systems among other data. Now that we have all the classmarks in a highly structured semantic data model we can publish all these statements and start doing even more interesting things with them. BILLI Home page

All of these statements about our classmarks are hosted on a system we are calling BILLI:  Bibliographic Identifiers for Library Location Information, an homage to NYPL’s mustachioed first president John Shaw Billings.

The BILLI system allows you to explore the classmarks in use at the library, traverse the hierarchies, and link out to resources in the catalog. But the real power of having our classmark information in this linked data form is the ability to start building relationships by creating new data statements.

A logged in staff member can add notes or change the description of a classmark but, more importantly, is able to start linking to other linked data resources. Right now, staff members are able to connect our classmarks to Wikidata and DBpedia, two linked data sources connected to Wikipedia:

Staff interface for mapping classmarks
The system auto-suggests some possible connections, here for the *H Library classmark, and the staff member can select or search for a more appropriate entity. Once connected, we pull in some basic information to enrich the classmark page: Classmarks page displaying external data


As well as the the mapping relations:

New mapping relationships created

Now that a link has been established we can use the metadata found in those two sources in our own discovery system and even apply some of it to the resources in our catalog that use this classmark. While the system is currently only able to build these connections with Wikidata/DBpedia, we can explore other resources that it would make sense to map to and expand the network to library and non-library based linked data sources.

While these classmark pages are here for us humans to interact with, computers can also “read” these pages automatically as they are machine readable through content negotiation. To a computer requesting *H data, the page would appear like this: http://billi.nypl.org/classmark/*H/nt

A Space of Play

If classification is arbitrary it is true there are a number of other systems that the library’s materials could be organized under. The only limitation is that it would be cost prohibitive to apply a new system to millions of resources. But in this virtual space, existing and even newly-invented systems can be easily overlaid and applied to our materials.

The first classmarks listed on BILLI are grouped under “LCC Range.” This classmark system is based on the existing Library of Congress Classification, a system which has historically not been used at NYPL. LCC is traditionally used at most research libraries, but because NYPL used the Billings and then Fixed Order systems it was never adopted here. Due to new linked data services, however, we were able to retroactively reclassify our entire catalog with LCC classmarks.

Using OCLC’s Classify—a service that returns aggregate data about a resource from all institutions in the OCLC consortium—we are able to find out a resource’s LCC classmark from other libraries that holds the same title. While we were not able to match 100% of our resources to a Classify result we were able to apply a significant number of LCC classmarks to our materials.

Of the materials for which we were unable to obtain a LCC, we can use some simple statistics to draw concordances between Billings and LCC. For example, if we have enough resources that have the same LCC and Billings classmark we can assume that the two are equivalent and apply that LCC to all the materials with that Billings classmark. One caveat is that LCC numbers can be very specific in their classification, much more so than Billings. What we need to do in order to map Billings to LCC is create more generalized, or coarser, LCC classmarks.

To accomplish this we parsed the freely available PDF LCC outlines found online into a new dataset with each classmark representing a large range of LCC numbers. Using this invented, yet still related, LCC range classmark we can map our Billings classmarks to their coarse LCC equivalents. Now, a researcher who is perhaps more familiar with LCC can easily navigate to corresponding Billings resources.

The LCC range classmark example represents an exciting opportunity to organize our materials in new ways within a virtual space. In this networked environment classification shifts from rigid hierarchy to a fluid interconnected mappings—the difference between dividing space and filling it.

LITA: Jobs in Information Technology: January 27, 2016

planet code4lib - Wed, 2016-01-27 19:42

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week:

SIU Edwardsville, Electronic Resources Librarian, Asst or Assoc Professor, Edwardsville, IL

Olin College of Engineering, Community and Digital Services Librarian, Needham, MA

Great Neck Library, Information Technology Director, Great Neck, NY

Art Institute of Chicago, Senior Application Developer for Collections, Chicago, IL

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

DPLA: Apply to our 4th Class of Community Reps

planet code4lib - Wed, 2016-01-27 16:00

We’re thrilled to announce today our fourth call for applications for the DPLA Community Reps program! The application for this fourth class of Reps will close on Friday, February 19, 2016.

What is the DPLA Community Reps program? In brief, we’re looking for enthusiastic volunteers who are willing to help us bring DPLA to their local communities through outreach activities or support DPLA by working on special projects. A local community could be a school, a library, an academic campus, a professional network in your space, or another group of folks who you think might be interested in DPLA and what it has to offer. Reps give a small commitment of time to community engagement, collaboration with fellow Reps and DPLA staff, and check-ins with DPLA staff. We have three terrific classes of reps from diverse places and professions.

With the fourth class, we are hoping to strengthen and expand our group geographically and professionally. The single most important factor in selection is the applicant’s ability to clearly identify communities they can serve and plan relevant outreach activities or DPLA-related projects for them. We are looking for enthusiastic, motivated people from the US and the world with great ideas above all else!

To answer general inquiries about what type of work reps normally engage in and to provide information about the program in general, we’re offering an open information and Q&A session with key DPLA staff members and current community reps.

Reps Info Session: Tuesday, February 9, 5:00 PM Eastern

If you would like to join this webinar, please register.

For more information about the DPLA Community Reps program, please contact info@dp.la.

Apply to the Community Reps program

LITA: Intro to Youth Coding Programs, a LITA webinar

planet code4lib - Wed, 2016-01-27 14:00

Attend this informative and fast paced new LITA webinar:

How Your Public Library Can Inspire the Next Tech Billionaire: an Intro to Youth Coding Programs

Thursday March 3, 2016
noon – 1:00 pm Central Time
Register Online, page arranged by session date
(login required)

Kids, tweens, teens and their parents are increasingly interested in computer programming education, and they are looking to public and school libraries as a host for the informal learning process that is most effective for learning to code. This webinar will share lessons learned through youth coding programs at libraries all over the U.S. We will discuss tools and technologies, strategies for promoting and running the program, and recommendations for additional resources. An excellent webinar for youth and teen services librarians, staff, volunteers and general public with an interest in tween/teen/adult services.

Takeaways

  • Inspire attendees about kids and coding, and convince them that the library is key to the effort.
  • Provide the tools, resources and information necessary for attendees to launch a computer coding program at their library.
  • Cultivate a community of coding program facilitators that can share ideas and experiences in order to improve over time.

Presenters:

Kelly Smith spent hundreds of hours volunteering at the local public library before realizing that kids beyond Mesa, Arizona could benefit from an intro to computer programming. With a fellow volunteer, he founded Prenda – a learning technology company with the vision of millions of kids learning to code at libraries all over the country. By day, he designs products for a California technology company. Kelly has been hooked on computer programming since his days as a graduate student at MIT.

Crystle Martin is a postdoctoral research scholar at the Digital Media and Learning Research Hub at the University of California, Irvine. She explores youth and connected learning in online and library settings and is currently researching implementation of Scratch in underserved community libraries, to explore new pathways to STEM interests for youth. Her 2014 book, titled “Voyage Across a Constellation of Information: Information Literacy in Interest-Driven Learning Communities,” reveals new models for understanding information literacy in the 21st century through a study of information practices among dedicated players of World of Warcraft. Crystle holds a PhD in Curriculum & Instruction specializing in Digital Media, with a minor in Library and Information Studies from the University of Wisconsin–Madison; serves on the Board of Directors for the Young Adult Library Services Association; and holds an MLIS from Wayne State University in Detroit, MI.

Justin Hoenke is a human being who has worked in youth services all over the United States and is currently the Executive Director of the Benson Memorial Library in Titusville, Pennsylvania. Before that, he was Coordinator of Teen Services at the Chattanooga Public Library in Chattanooga, TN where Justin created The 2nd Floor, a 14,000 square foot space for ages 0-18 into a destination that brings together learning, fun, the act of creating and making, and library service. Justin is a member of the 2010 American Library Association Emerging Leaders class and was named a Library Journal Mover and Shaker in 2013. His professional interests include public libraries as community centers, working with kids, tweens, and teens, library management, video games, and creative spaces. Follow Justin on Twitter at @justinlibrarian and read his blog at http://www.justinthelibrarian.com.

Register for the Webinar

Full details
Can’t make the date but still want to join in? Registered participants will have access to the recorded webinar.

Cost:

  • LITA Member: $45
  • Non-Member: $105
  • Group: $196

Registration Information:

Register Online, page arranged by session date (login required)
OR
Mail or fax form to ALA Registration
OR
call 1-800-545-2433 and press 5
OR
email registration@ala.org

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty,mbeatty@ala.org

SearchHub: example/files – a Concrete Useful Domain-Specific Example of bin/post and /browse

planet code4lib - Wed, 2016-01-27 03:31
The Series

This is the third in a three part series demonstrating how it’s possible to build a real application using just a few simple commands.  The three parts to this are:

In the previous /browse article, we walked you through to the point of visualizing your search results from an aesthetically friendlier perspective using the VelocityResponseWriter. Let’s take it one step further.

example/files – your own personal Solr-powered file-search engine

The new example/files offers a Solr-powered search engine tuned specially for rich document files. Within seconds you can download and start Solr, create a collection, post your documents to it, and enjoy the ease of querying your collection. The /browse experience of the example/files configuration has been tailored for indexing and navigating a bunch of “just files”, like Word documents, PDF files, HTML, and many other formats.

Above and beyond the default data driven and generic /browse interface, example/files features the following:

  • Distilled, simple, document type navigation
  • Multi-lingual, localizable interface 
  • Language detection and faceting
  • Phrase/shingle indexing and “tag cloud” faceting
  • E-mail address and URL index-time extraction
  • “instant search” (as you type results)
Getting started with example/files

Start up Solr and create a collection called “files”:

bin/solr start bin/solr create -c files -d example/files

Using the -d flag when creating a Solr collection specifies the configuration from which the collection will be built, including indexing configuration and scripting and UI templates.

Then index a directory full of files:

bin/post -c files ~/Documents

Depending on how large your “Documents” folder is, this could take some time. Sit back and wait for a message similar to the following:

23731 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/files/update… Time spent: 0:11:32.323

And then open /browse on the files collection:

open http://localhost:8983/solr/files/browse The UI is the App

With example/files we wanted to make the interface specific to the domain of file search.  With that in mind, we implemented a file-domain specific ability to facet and filter by high level “types”, such as Presentation, Spreadsheet, and PDF.   Taking a UI/UX-first approach, we also wanted “instant search” and a localizable interface.

The rest of this article explains, from the outside-in, the design and implementation from UI and URL aesthetics down to the powerful Solr features that make it possible.

URLs are UI too!

“…if you think about how you design them” – Cool URIs

Besides the HTML/JavaScript/CSS “app” of example/files, care was taken on the aesthetics and cleanliness of the other user interface, the URL.  The URLs start with /browse, describing the user’s primary activity in this interface – browsing a collection of documents.

Browsing by document type

Results can be filtered by document “type” using the links at the top.

  

As you click on each type, you can see the “type” parameter changing in the URL request.

For the aesthetics of the URL, we decided filtering by document type should look like this: /browse?type=pdf (or type=html, type=spreadsheet, etc).  The interface also supports two special types: “all” to select all types and “unknown” to select documents with no document type.

At index-time, the type of a document is identified.  An update processor chain (files-update-processor) is defined to run a script for each document.  A series of regular expressions determine the high-level type of the document, based off of the inherent “content_type” (MIME type) field set for each rich document indexed.  The current types are doc, html, image, spreadsheet, pdf, and text.  If a high-level type is recognized,  a doc_type field is set to that value.

No doc_type field is added if the content_type does not have an appropriate higher level mapping, an important aspect to the filtering technique specifics.  The /browse handler definition was enhanced with the following parameters to enable doc_type faceting and filtering using our own “type=…” URL parameter to filter by any of the types, including “all” or “unknown”:

  • facet.field={!ex=type}doc_type
  • facet.query={!ex=type key=all_types}*:*
  • fq={!switch v=$type tag=type case=’*:*’ case.all=’*:*’ case.unknown=’-doc_type:[* TO *]’ default=$type_fq}

There are some details of how these parameters are set worth mentioning here.  Two parameters, facet.field and facet.query, are specified in params.json utilizing the “paramset” feature of Solr.  And the fq parameter is appended in the /browse definition in solrconfig.xml (because paramsets don’t currently allow appending, only setting, parameters). 

The faceting parameters exclude the “type” filter (defined on the appended fq), such that the counts of the types shown aren’t affected by type filtering (narrowing to “image” types still shows “pdf” type counts rather than 0).  There’s a special “all_types” facet query specified, that provides the count for all documents, within the query and other filtering constrained set.  And then there’s the tricky fq parameter, leveraging the “switch” query parser that controls how the type filtering works from the custom “type” parameter.  When no type parameter is provided, or type=all, the type filter is set to “all docs” (via *:*), effectively not filtering by type.  When type=unknown, the special -doc_type:[* TO *] (note the dash/minus sign to negate), matching all documents that do not have a doc_type field.  And finally, when a “type” parameter other than all or unknown is provided, the filter used is defined by the “type_fq” parameter which is defined in params.json as type_fq={!field f=doc_type v=$type}.  That type_fq parameter specifies a field value query (effectively the same as fq=doc_type:pdf, when type=pdf) using the field query parser (which will end up being a basic Lucene TermQuery in this case). 

That’s a lot of Solr mojo just to be able to say type=image from the URL, but it’s all about the URL/user experience so it was worth the effort to implement and hide the complexity.

Localizing the interface

The example/files interface has been localized in multiple languages. Notice the blue global icon in the top right-hand corner of the /browse UI.  Hover over the globe icon and select a language in which to view your collection.

Each text string displayed is defined in standard Java resource bundles (see the files under example/files/browse-resources).  For example, the text (“Find” in English) that appears just before the search input box is specified in each of the language-specific resource files as:

English: find=Find French: find=Recherche German: find=Durchsuchen

The VelocityResponseWriter’s $resource tool picks up on a locale setting.  In the browse.vm (example/files/conf/velocity/browse.vm) template, the “find” string is specified generically like this:

$resource.find: <input name=”q”…/>

From the outside, we wanted the parameter used to select the locale to be clean and hide any implementation details, like /browse?locale=de_DE.  

The underlying parameter needed to control the VelocityResponseWriter $resource tool’s locale is v.locale, so we use another Solr technique (parameter substitution) to map from the outside locale parameter to the internal v.locale parameter.

This parameter substitution is different than “local param substitution” (used with the “type” parameter settings above) which only applies as exact param substitution within the {!… syntax} as dollar signed non-curly bracketed {!… v=$foo} where the parameter foo (&foo=…) is substituted in. The dollar sign curly bracketed syntax can be used as an in-place text substitution, allowing a default value too like ${param:default}.

To get the URLs to support a locale=de_DE parameter, it is simply substituted as-is into the actual v.locale parameter used to set the locale within the Velocity template context for UI localization. In params.json we’ve specified v.locale=${locale}

Language detection and faceting

It can be handy to filter a set of documents by its language.  Handily, Solr sports two(!) different language detection implementations so we wired one of them up into our update processor chain like this:

<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">content</str> <str name="langid.langField">language</str> </lst> </processor>

With the language field indexed in this manner, the UI simply renders its facets (facet.field=language, in params.json), allowing filtering too.

Phrase/shingle indexing and “tag cloud” faceting

Seeing common phrases can be used to get the gist of a set of documents at a glance.  You’ll notice the top phrases change as a result of the “q” parameter changing (or filtering by document type or language).  The top phrases reflect phrases that appear most frequently in the subset of results returned for a particular query and applied filters. Click on a phrase to display the documents in your results set that contain the phrase. The size of the phrase corresponds to the number of documents containing that phrase.

Phrase extraction of the “content” field text occurs by copying to a text_shingles field which creates phrases using a ShingleFilter.  This feature is still a work in progress and needs improvement in extracting higher quality phrases; the current rough implementation isn’t worth adding a code snippet here to imply folks should copy/paste emulate it, but here’s a pointer to the current configuration – https://github.com/apache/lucene-solr/blob/branch_5x/solr/example/files/conf/managed-schema#L408-L427 

E-mail address and URL index-time extraction

One, currently unexposed, feature added for fun is the index-time extraction of e-mail addresses and URLs from document content.  With phrase extraction as described above, the use is to allow for faceting and filtering, but when looking at an individual document we didn’t need the phrases stored and available. In other words, text_shingles did not need to be a stored field, and thus we could leverage the copyField/fieldType technique.  But for extracted e-mail addresses and URLs, it’s useful to have these as stored (multi-valued), not just indexed terms… which means our indexing pipeline needs to provide these independently stored values.  The copyField/fieldType-extraction technique won’t suffice here.  However, we can use a field type definition to help, and take advantage of its facilities within an update script.  Update processors, like the script one used here, allow for full manipulation of an incoming document, including adding additional fields, and thus their value can be “stored”.  Here are the configuration pieces that extract e-mail addresses and URLs from text:

The Solr admin UI analysis tool is useful for seeing how this field type works. The first step, through the UAX29URLEmailTokenizer, tokenizes the text in accordance with the Unicode UAX29 segmentation specification with the special addition to recognize and keep together e-mail addresses and URLs. During analysis, the tokens produced also carry along a “type”. The following screenshot depicts the Solr admin analysis tool results of analyzing an “e-mail@lucidworks.com https://lucidworks.com” string with the text_email_url field type. The tokenizer tags e-mail addresses with a type of, literally, “<EMAIL>” (angle brackets included), and URLs as “<URL>”. There are other types of tokens that URL/email tokenizer emits, but for this purpose we only want to screen out everything but e-mail addresses and URLs. Enter TypeTokenFilter, allowing only a strictly specified set of token type values to pass through. In the screenshot you’ll notice the text “at” was identified as type “<ALPHANUM>”, and did not pass through the type filter. An external text file (email_url_types.txt) contains the types to pass through, and simply contains two lines with the values “<URL>” and “<EMAIL>”.

So now we have a field type that can do the recognition and extraction of e-mail address and URLs. Let’s now use it from within the update chain, conveniently possible in update-script.js. With some scary looking JavaScript/Java/Lucene API voodoo, it’s achieved with the code shown above in update-script.js.  That code is essentially how indexed fields get their terms, we’re just having to do it ourselves to make the values *stored*.

This technique was originally described in the “Analysis in ScriptUpdateProcessor” section of this this presentation: http://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks

example/files demonstration video

Thanks go to Esther Quansah who developed much of the example/files configuration and produced the demonstration video during her internship at Lucidworks.

What’s next for example/files?

An umbrella Solr JIRA issue has been created to note these desirable fixes and improvements – https://issues.apache.org/jira/browse/SOLR-8590 – including the following items:

  • Fix e-mail and URL field names (<email>_ss and <url>_ss, with angle brackets in field names), also add display of these fields in /browse results rendering
  • Harden update-script: it currently errors if documents do not have a “content” field
  • Improve quality of extracted phrases
  • Extract, facet, and display acronyms
  • Add sorting controls, possibly all or some of these: last modified date, created date, relevancy, and title
  • Add grouping by doc_type perhaps
  • fix debug mode – currently does not update the parsed query debug output (this is probably a bug in data driven /browse as well)
  • Filter out bogus extracted e-mail addresses

The first two items were fixed and patch submitted during the writing of this post.

Conclusion

Using example/files is a great way of exploring the built-in capabilities of Solr specific to rich text files. 

A lot of Solr configuration and parameter trickery makes /browse?locale=de_DE&type=html a much cleaner way to do this: /select?v.locale=de_DE&fq={!field%20f=doc_type%20v=html}&wt=velocity&v.template=browse&v.layout=layout&q=*:*&facet.query={!ex=type%20key=all_types}*:*&facet=on… (and more default params) 

Mission to “build a real application using just a few simple commands” accomplished!   It’s so succinct and clean that you can even tweet it!

https://lucidworks.com/blog/2016/01/27/example_files:$ bin/solr start; bin/solr create -c files -d example/files; bin/post -c files ~/Documents #solr

 

   

 

 

The post example/files – a Concrete Useful Domain-Specific Example of bin/post and /browse appeared first on Lucidworks.com.

Karen G. Schneider: Holding infinity

planet code4lib - Wed, 2016-01-27 03:21
To see a World in a Grain of Sand And a Heaven in a Wild Flower  Hold Infinity in the palm of your hand  And Eternity in an hour This weekend Sandy and I had a scare which you have heard about if you follow me on Facebook. I won’t repeat all of it here, but our furnace was leaking carbon monoxide, the alarms went off, firefighters came, then left, then came back later to greet us as we sat on our stoop in our robes and pajamas, agreeing the second time that it wasn’t bad monitor batteries as they walked slowly through our home, waving their magic CO meter; they stayed a very long time and aired out rooms and closets and… well. I could see that big crow walking over our graves, its eyes shining, before its wingspan unfurled and it rose into the night, disgruntled to have lost us back to the living. After a chilly (but not unbearable) weekend in an unheated house, our landlord, who is a doll, immediately and graciously replaced the 26-year-old furnace with a spiffy new model that is quiet and efficient and not likely to kill us anytime soon. Meanwhile, we both had colds (every major crisis in my life seems to be accompanied by head colds), and I was trying valiantly to edit my dissertation proposal for issues major and minor that my committee had shared with me. Actually, at first it was a struggle, but then it became a refuge. Had I known I would be grappling with the CO issue later on Saturday, I would not have found so many errands to run that morning, my favorite method of procrastination. But by the next morning, editing my proposal seemed like a really, really great thing to be doing, me with my fully-alive body. I had a huge batch of posole cooking on the range, and the cat snored and Sandy sneezed and when I got tired of working on the dissertation I gave myself a break to work on tenure and promotion letters or to contemplate statewide resource-sharing scenarios (because I am such a fun gal). I really liked my Public Editor idea for American Libraries and would like to see something happen in that vein, but after ALA I see that it is an idea whose idea needs more than me as its champion, at least through this calendar year. There’s mild to moderate interest, but not enough to warrant dropping anything I’m currently involved in to make it happen. It’s not forgotten, it’s just on a list of things I would like to make happen. That said, this ALA in Boston–ok, stand back, my 46th, if you count every annual and midwinter–was marvelous for its personal connections. Oh yes, I learned more about scholarly communications and open access and other Things. But the best ideas I garnered came from talking with colleagues, and the best moments did too. Plus two delightful librarians introduced me to Uber and the Flour Bakery in the same madcap hour. I was a little disappointed they weren’t more embarrassed when I told the driver it was my first Uber ride. I am still remembering that roast lamb sandwich. And late-night conversations with George. And early-evening cocktails with Grace. And a proper pub pint with Lisa. And the usual gang for our usual dinner. And a fabulous GLBTRT social. And breakfast with Brett. And how wonderful it was to stay in a hotel where so many people I know were there. And the hotel clerk who said YOU ARE HALF A BLOCK FROM THE BEST WALGREENS IN THE WORLD and he was right. It’s hard to explain… unless you remember the truly grand Woolworth stores of yesteryear, such as the store at Powell and Market that had a massive candy counter, a fabric and notions section, every possible inexpensive wristwatch one could want for, and a million other fascinating geegaws. Sometimes these days I get anxious that I need to get such-and-such done in the window of calm. It’s true, it’s better to be an ant than a grasshopper. I would not have spent Saturday morning tootling from store to store in search of cilantro and pork shoulder had I known I would have spent Saturday afternoon and evening looking up “four beeps on a CO monitor” and frantically stuffing two days’ worth of clothes into a library tote bag (please don’t ask why I didn’t use the suitcase sitting right there) as we prepared to evacuate our home. But I truly don’t have that much control over my life. I want it, but I don’t have it. Yes, it’s good to plan ahead. We did our estate planning (hello, crow!) and made notebooks to share with one another (hi crow, again!) and try to be mindful that things happen on a dime. But if I truly believed life was that uncertain, I couldn’t function. On some level I have to trust that the sounds I hear tonight–Sandy whisking eggs for an omelette, cars passing by our house on a wet road, the cat padding from room to room, our dear ginger watchman–will be the sounds I hear tomorrow and tomorrow. Even if I know–if nothing else, from the wide shadow of wings passing over me–that will not always be the case. Onward into another spring semester. There aren’t many students in the library just yet. They aren’t frantically stuffing any tote bags, not for their lives, not for their graduations, not for even this semester. They’ll get there. It will be good practice. Bookmark to:

Mashcat: Upcoming webinars in early 2016

planet code4lib - Tue, 2016-01-26 17:19

We’re pleased to announce that several free webinars are scheduled for the first three months of 2016. Mark your calendars!

Date/Time Speaker Title 26 January 2016 (14:00-17:00 UTC / 09:00-12:00 EST) Owen Stephens Installing OpenRefine This webinar will be an opportunity for folks to see how OpenRefine can be installed and to get help doing so, and serves as preparation for the webinar in March.  There will also be folks at hand in the Mashcat Slack channel to assist.

Recording / Slides (pptx)

19 February 2016 (18:00-19:00 UTC / 13:00-14:00 EST) Terry Reese Evolving MarcEdit: Leveraging Semantic Data in MarcEdit. Library metadata is currently evolving — and whether you believe this evolution will lead to a fundamental change in how Libraries manage their data (as envisioned via BibFrame) or more of an incremental change (like RDA); one thing that is clear is the merging of traditional library data and semantic data.  Over the next hour, I’d like to talk about how this process is impacting how MarcEdit is being developed, and look at some of the ways that Libraries can not just begin to embed semantic data into their bibliographic records right now — but also begin to new services around semantic data sources to improve local workflows and processes. 14 March 2016 (16:00-17:30 UTC / 11:00-12:30 EST Owen Stephens (Meta)data tools: Working with OpenRefine OpenRefine is a powerful tool for analyzing, fixing, improving and enhancing data. In this session the basic functionality of OpenRefine will be introduced, demonstrating how it can be used to explore and fix data, with particular reference to the use of OpenRefine in the context of library data and metadata.

The registration link for each webinar will be communicated in advance. Many thanks to Alison Hitchens and the University of Waterloo for offering up their Adobe Connect instance to host the webinars.

David Rosenthal: Emulating Digital Art Works

planet code4lib - Tue, 2016-01-26 16:00
Back in November a team at Cornell led by Oya Rieger and Tim Murray produced a white paper for the National Endowment for the Humanities entitled Preserving and Emulating Digital Art Objects. It was the result of two years of research into how continuing access could be provided to the optical disk holdings of the Rose Goldsen Archive of New Media Art at Cornell. Below the fold, some comments on the white paper.

Early in the project their advisory board strongly encouraged them to focus on emulation as a strategy, advice that they followed. Their work thus parallels to a considerable extent the German National Library's (DNB's) use of Freiburg's Emulation As A Service (EAAS) to provide access to their collection of CD-ROMs. The Cornell team's contribution includes surveys of artists, curators and researchers to identify their concerns about emulation because, as they write:
emulation is not always an ideal access strategy: emulation platforms can introduce rendering problems of their own, and emulation usually means that users will experience technologically out-of-date artworks with up-to-date hardware. This made it all the more important for the team to survey media art researchers, curators, and artists, in order to gain a better sense of the relative importance of the artworks' most important characteristics for different kinds of media archives patrons. The major concern they reported was experiential fidelity:
Emulation was controversial for many, in large part for its propensity to mask the material historical contexts (for example, the hardware environments) in which and for which digital artworks had been created. This part of the artwork's history was seen as an element of its authenticity, which the archiving institution must preserve to the best of its ability, or lose credibility in the eyes of patrons. We determined that cultural authenticity, as distinct from forensic or archival authenticity, derived from a number of factors in the eyes of the museum or archive visitor. Among our survey respondents, a few key factors stood out: acknowledgement of the work's own historical contexts, preservation of the work's most significant properties, and fidelity to the artist's intentions, which is perhaps better understood as respect for the artist's authority to define the work's most significant properties. As my report pointed out (Section 2.4.3), hardware evolution can significantly impair the experiential fidelity of legacy artefacts, and (Section 3.2.2) the current migration from PCs to smartphones as the access device of choice will make the problem much worse. Except in carefully controlled "reading room" conditions the Cornell team significantly underestimate the problem:
Accessing historical software with current hardware can subtly alter aspects of the work's rendering. For example, a mouse with a scroll wheel may permit forms of user interactivity that were not technologically possible when a software-based artwork was created. Changes in display monitor hardware (for example, the industry shift from CRT to LED display) brings about color shifts that are difficult to calibrate or compensate for. The extreme disparity between the speed of current and historical processors can lead to problems with rendering speed, a problem that is unfortunately not trivial to solve. The overestimate a different part of the problem when they write:
emulators, too, are condemned to eventual obsolescence; as new operating systems emerge, the distance between "current" and "historical" operating systems must be recalculated, and new emulators created to bridge this distance anew. We attempted to establish archival practices that would mitigate these instabilities. For example, we collected preservation metadata specific to emulators that included documentation of versions used, rights information about firmware, date and source of download, and all steps taken in compiling them, including information about the compiling environment. We were also careful to keep metadata for artworks emulator-agnostic, in order to avoid future anachronism in our records. If the current environment they use to access a digital artwork is preserved, including the operating system and the emulator the digital artwork currently needs, future systems will be able to emulate the current environment. Their description of the risk of emulator obsolescence assumes we are restricted to a single layer of emulation. We aren't. Multi-layer emulations have a long history, for example in the IBM world, and in the Internet Archive's software collection.

Ilya Kreymer's oldweb.today shows that another concern the Cornell team raise is also overblown:
The objective of a 2013 study by the New York Art Resources Consortium (NYARC)was to identify the organizational, economic, and technological challenges posed by the rapidly increasing number of web-based resources that document art history and the art market. 18 One of the conclusions of the study was that regardless of the progress made, "it often feels that the more we learn about the ever-evolving nature of web publishing, the larger the questions and obstacles loom." Although there are relevant standards and technologies, web archiving solutions remain to be costly, and harvesting technologies as of yet lack maturity to completely capture the more complex cases. The study concluded that there needs to be organized efforts to collect and provide access to art resources published on the web. The ability to view old web sites with contemporary browsers provided by oldweb.today should allay these fears.

Ultimately, as do others in the field, the Cornell team takes a pragmatic view of the potential for experiential fidelity, refusing to make the best be the enemy of the good.
The trick is finding ways to capture the experience - or a modest proxy of it - so that future generations will get a glimpse of how early digital artworks were created, experienced, and interpreted. So much of new media works' cultural meaning derives from users' spontaneous and contextual interactions with the art objects. Espenschied, et al.point out that digital artworks relay digital culture and "history is comprehended as the understanding of how and in which contexts a certain artifact was created and manipulated and how it affected its users and surrounding objects."

Terry Reese: MarcEdit update Posted

planet code4lib - Tue, 2016-01-26 06:04

I’ve posted an update for all versions – changed noted here:

The significant change was a shift in how the linked data processing works.  I’ve shifted from hard code to a rules file.  You can read about that here: http://blog.reeset.net/archives/1887

If you need to download the file, you can get it from the automated update tool or from: http://marcedit.reeset.net/downloads.

–tr

Library Tech Talk (U of Michigan): The Next Mirlyn: More Than Just a Fresh Coat of Paint

planet code4lib - Tue, 2016-01-26 00:00

The next version of Mirlyn (mirlyn.lib.umich.edu) is going to take some time to create, but let's take a peek under the hood and see how the next generation of search will work.

Library Tech Talk (U of Michigan): Designing for the Library Website

planet code4lib - Tue, 2016-01-26 00:00

This post is a brief overview of the process in designing for large web-based systems. This includes understanding what makes up an interface and how to start fresh to create a good foundation that won't be regrettable later.

Pages

Subscribe to code4lib aggregator