You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 9 hours 31 min ago

Peter Murray: Top Tech Trends, ALA Annual 2015 edition: Local and Unique; New metrics and citation tools

Fri, 2015-06-12 17:50

I threw my hat into the ring to be on the LITA Top Tech Trends panel at the ALA annual conference later this month in San Francisco, and never could I say that I was more excited not to be selected. (You can find more info on this year’s Top Tech Trends session in the ALA Conference Scheduler.) There is a great lineup of panelists this year:

  • Sarah Houghton (Moderator), Director of the San Rafael Public Library – @TheLiB
  • Carson Block, 20-year veteran of libraries and now a library technology consultant – @CarsonBlock
  • Andrea Davis, who has her fingers in so many pies that you should really check out her LinkedIn profile@detailmatters
  • Grace Dunbar, Vice President of Equinox Software, Inc
  • Bonnie Tijerina, Fellow at the Data & Society Institute in New York City – @bonlth

As part of the process, the committee asks potential panelists to explain two trends, why they are important, and how it will affect libraries. Although I’m not on the panel this year, I thought it would be useful to post my two trends here and see what others thought.

Making the local and the unique available to everyone and everywhere

A core part of libraries and other cultural heritage organizations has been to collect, preserve, and make available the resources that are unique to their users, their location, and their specialization. Up until recent years, availability of these resources spread by word-of-mouth, by citation in published literature, by large national union catalog volumes, and by short bibliographic records made available first through OCLC then by online catalogs. With the conception and execution of projects like DPLA and Europeana, broad audiences are finding digital surrogates (or digital copies) of these resources. And most recently IMLS highlighted its intent to build up capabilities through its focus on a National Digital Platform and its funding of the Hydra-in-a-Box initiative. How does that change the collection development missing? Or the research and reference mission? It isn’t so much a matter of remaining relevant as it is to serve different audiences and different needs. And I think public libraries are impacted the most. How can the experiences of large — typically academic — libraries be applied on a large scale to cultural heritage organizations of all types?

New metrics and new citation tools

Last year’s Library Horizon Report from the New Media Consortium listed “Bibliometics and Citation Technologies” with a Time-to-Adoption horizon of three years. “Alt Metrics” has taken off to the point where NISO’s Todd Carpenter suggests we should simply call them “metrics” — there is nothing alternative about them. If the tools of the day allow us to measure the impact of our scholars’ work to the article and dataset level, how will that impact the library’s mission to collect, offer, and preserve materials of local interest? If annotation frameworks like take off, what is the role of libraries in preserving and contextualizing those additions to the scholarly conversation?

So! If you were able to sit on the panel, what would your two top technology trends be for this year?

Link to this post!

Eric Lease Morgan: Some automated analysis of Ralph Waldo Emerson’s works

Fri, 2015-06-12 17:26

This page describes a corpus named emerson, and it was programmatically created with a program called the HathiTrust Research Center Workset Browser.

General statistics

An analysis of the corpus’s metadata provides an overview of what and how many things it contains, when things were published, and the sizes of its items:

  • Number of items – 62
  • Publication date range – 1838 to 1956 (histogram : boxplot)
  • Sizes in pages – 20 to 660 (histogram : boxplot)
  • Total number of pages – 11866
  • Average number of pages per item – 191

Possible correlations between numeric characteristics of records in the catalog can be illustrated through a matrix of scatter plots. As you would expect, there is almost always a correlation between pages and number of words. Are others exist? For more detail, browse the catalog.

Notes on word usage

By counting and tabulating the words in each item of the corpus, it is possible to measure additional characteristics:

Perusing the list of all words in the corpus (and their frequencies) as well as all unique words can prove to be quite insightful. Are there one or more words in these lists connoting an idea of interest to you, and if so, then to what degree do these words occur in the corpus?

To begin to see how words of your choosing occur in specific items, search the collection.

Through the creation of locally defined “dictionaries” or “lexicons”, it is possible to count and tabulate how specific sets of words are used across a corpus. This particular corpus employs three such dictionaries — sets of: 1) “big” names, 2) “great” ideas, and 3) colors. Their frequencies are listed below:

The distribution of words (histograms and boxplots) and the frequency of words (wordclouds), and how these frequencies “cluster” together can be illustrated:

Items of interest

Based on the information above, the following items (and their associated links) are of possible interest:

  • Shortest item (20 p.) – The wisest words ever written on war / by R.W. Emerson … Preface by Henry Ford. (HathiTrust : WorldCat : plain text)
  • Longest item (660 p.) – Representative men : nature, addresses and lectures. (HathiTrust : WorldCat : plain text)
  • Oldest item (1838) – An address delivered before the senior class in Divinity College, Cambridge, Sunday evening, 15 July, 1838 / by Ralph Waldo Emerson. (HathiTrust : WorldCat : plain text)
  • Most recent (1956) – Emerson at Dartmouth; a reprint of his oration, Literary ethics. With an introd. by Herbert Faulkner West. (HathiTrust : WorldCat : plain text)
  • Most thoughtful item – Transcendentalism : and other addresses / by Ralph Waldo Emerson. (HathiTrust : WorldCat : plain text)
  • Least thoughtful item – Emerson-Clough letters, edited by Howard F. Lowry and Ralph Leslie Rusk. (HathiTrust : WorldCat : plain text)
  • Biggest name dropper – A letter of Emerson : being the first publication of the reply of Ralph Waldo Emerson to Solomon Corner of Baltimore in 1842 ; With analysis and notes by Willard Reed. (HathiTrust : WorldCat : plain text)
  • Fewest quotations – The wisest words ever written on war / by R.W. Emerson … Preface by Henry Ford. (HathiTrust : WorldCat : plain text)
  • Most colorful – Excursions. Illustrated by Clifton Johnson. (HathiTrust : WorldCat : plain text)
  • Ugliest – An address delivered before the senior class in Divinity College, Cambridge, Sunday evening, 15 July, 1838 / by Ralph Waldo Emerson. (HathiTrust : WorldCat : plain text)

Eric Lease Morgan: Some automated analysis of Henry David Thoreau’s works

Fri, 2015-06-12 17:24

This page describes a corpus named thoreau, and it was programmatically created with a program called the HathiTrust Research Center Workset Browser.

General statistics

An analysis of the corpus’s metadata provides an overview of what and how many things it contains, when things were published, and the sizes of its items:

  • Number of items – 32
  • Publication date range – 1866 to 1953 (histogram : boxplot)
  • Sizes in pages – 38 to 556 (histogram : boxplot)
  • Total number of pages – 7918
  • Average number of pages per item – 247

Possible correlations between numeric characteristics of records in the catalog can be illustrated through a matrix of scatter plots. As you would expect, there is almost always a correlation between pages and number of words. Are others exist? For more detail, browse the catalog.

Notes on word usage

By counting and tabulating the words in each item of the corpus, it is possible to measure additional characteristics:

Perusing the list of all words in the corpus (and their frequencies) as well as all unique words can prove to be quite insightful. Are there one or more words in these lists connoting an idea of interest to you, and if so, then to what degree do these words occur in the corpus?

To begin to see how words of your choosing occur in specific items, search the collection.

Through the creation of locally defined “dictionaries” or “lexicons”, it is possible to count and tabulate how specific sets of words are used across a corpus. This particular corpus employs three such dictionaries — sets of: 1) “big” names, 2) “great” ideas, and 3) colors. Their frequencies are listed below:

The distribution of words (histograms and boxplots) and the frequency of words (wordclouds), and how these frequencies “cluster” together can be illustrated:

Items of interest

Based on the information above, the following items (and their associated links) are of possible interest:

Harvard Library Innovation Lab: Link roundup June 12, 2015

Fri, 2015-06-12 16:59

This is the good stuff.

Paul Ford: What is Code? | Bloomberg

well worth the very long read

How 77 Metro Agencies Design the Letter ‘M’ for Their Transit Logo – CityLab

77 different versions of the letter ‘M’ in mass transit signs around the world

Go To Hellman: Protect Reader Privacy with Referrer Meta Tags

HTML 5 referrer meta element is a new and easy way to not overshare.

Can the Swiss Watchmaker Survive the Digital Age?

The clock itself was a first step toward the “quantified self,”

‘Passports’ To Vermont Libraries Encourage Literary Exploration

Take a tour of Vermont libraries. Be sure to get your passport stamped.

District Dispatch: Congressional Republicans open new attack on FCC net neutrality rules

Fri, 2015-06-12 16:25

Yesterday the House Appropriations Subcommittee released – and today approved – its FY 2016 Financial Services Appropriations Bill providing funding for the Federal Communications Commission and other agencies. But House Republicans included a “net neutrality surprise” in its funding bill.

Tucked in this $20.2 billion funding bill is language that would prohibit the FCC from implementing the net neutrality order, issued February 26, 2015, until three specific legal challenges are fully resolved, including any available appeals. This provision could likely delay implementation of the net neutrality order for several years. (The three challenges specifically noted in the bill were Alamo Broadband Inc. v. FCC, United States Telecom Association v. FCC, and CenturyLink v. FCC.)

Overall funding for the FCC would be dramatically cut under the Republican’s austere budget. Republicans are recommending a $25 million reduction for the FCC below FY 2015 levels and $73 million below the Obama Administration’s requested level. The bill contains $315 million for the FCC. While Appropriation bills, by tradition, generally eschew including legislative language, the House appears to make an exception this year by including non expenditure-related language that would require the FCC to make proposed regulations publicly available for 21 days before a vote, and prohibit the agency from regulating rates for either wireline or wireless Internet service.

The House Appropriations Subcommittee on Financial Services approved the funding measure on Thursday. The full Committee has not announced a timetable for consideration although the committee is seeking to pass all 12 appropriations bills in the coming weeks. The Senate Appropriations Committee has not released its funding measure.

Efforts to bring any appropriations measures to the floor in the Senate will be a challenge for the Republican Majority. Numerous sources and media reports indicate that Senate Democrats intend to filibuster all Appropriations bills on the Floor unless funding levels are increased. The White House has also indicated the President will likely veto funding bills. It is expected that FY 2016 appropriations bills will not be finalized for several months and may drag on through the Fall, well past the October 1 start of the Fiscal Year.

ALA has worked tirelessly to support the FCC net neutrality order and is greatly concerned that House Republicans are using back-door methods to thwart the implementation of open Internet protections. While this initial flurry of Appropriations activity is a strong statement by Republicans against the FCC Order, ALA will continue to urge this language not be included in any final funding bill in the coming months.

The post Congressional Republicans open new attack on FCC net neutrality rules appeared first on District Dispatch.

Open Library Data Additions: Amazon Crawl: part im

Fri, 2015-06-12 10:02

Part im of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

Cynthia Ng: Accessible Format Production Part 3: Making Accessible PDF

Fri, 2015-06-12 06:19
Once again, there are numerous programs that can edit PDFs. Unfortunately, I have yet to find a free (or very cheap) one that allows you to edit even the basic pieces I talk about below. Would love to hear if anyone has recommendations. Anyway, that means I will discuss what needs to be done but … Continue reading Accessible Format Production Part 3: Making Accessible PDF

DPLA: Apply to be a new DPLA Service Hub!

Thu, 2015-06-11 16:43


The Digital Public Library of America seeks applicants to serve as Service Hubs in our growing national network.  The application and corresponding instructions are available from the link below.

Service Hub Application

A Service Hub represents a community of institutions to DPLA, provides their partners’ aggregated metadata to DPLA through a single source, and offers tiered services to create a local community of practice.  Service Hubs are geographically based, and DPLA seeks to grow the map of covered states and/or regions by inviting applications through this and future calls for applicants.

The deadline for submission is Monday, July 20, 2015.  Applicants will be notified of their selection on or before August 28, 2015, and will be expected to begin working with DPLA staff immediately following the August announcement to formalize the partnership and begin the process of harvesting metadata.

To answer general inquiries about the application process and to provide information about the hubs network and activities, an open information and Q&A session will be held with key DPLA staff members on June 22 at 4pm eastern.  If you would like to join this webinar, please register ahead of time using the link at

After registering, you will receive a confirmation email containing information about joining the webinar.

LibraryThing (Thingology): LibraryThing Recommends in BiblioCommons

Thu, 2015-06-11 16:34

Does your library use BiblioCommons as its catalog? LibraryThing and BiblioCommons now work together to give you high-quality reading recommendations in your BiblioCommons catalog.

You can see some examples here. Look for “LibraryThing Recommends” on the right side.

Quick facts:

  • As with all LibraryThing for Libraries products, LibraryThing Recommends only recommends other books within a library’s catalog.
  • LibraryThing Recommends stretches across media, providing recommendations not just for print titles, but also for ebooks, audiobooks, and other media.
  • LibraryThing Recommends shows up to two titles up front, with up to three displayed under “Show more.”
  • Recommendations come from LibraryThing’s recommendations system, which draws on hundreds of millions of data points in readership patterns, tags, series, popularity, and other data.

Not using BiblioCommons? Well, you can get LibraryThing recommendations—and much more—integrated in almost every catalog (OPAC and ILS) on earth, with all the same basic functionality, like recommending only books in your catalog, as well as other LibraryThing for Libraries feaures, like reviews, series and tags.

Check out some examples on different systems here.


BiblioCommons: email or visit See the full specifics here.
Other Systems: email or visit

District Dispatch: Senate leaders rush end-run on personal privacy

Thu, 2015-06-11 16:22

Apparently Senate Majority Leader Mitch McConnell (R-KY) and Senate Intelligence Committee Chairman Richard Burr (R-NC) learned nothing from the overwhelming outpouring of opposition to the worst of the USA PATRIOT Act earlier this month. They’re using the Senate Rules to try to ram through – as early as tomorrow – privacy-hostile legislation that’s received no public hearing or Senate floor debate as an amendment to a Department of Defense funding measure. The bill, innocuously dubbed the Cybersecurity Information Sharing Act (“CISA”), would expose enormous amounts of your personal data to federal, state, and even local law enforcement … without a warrant. Along with many coalition partners, ALA strongly opposed CISA in a letter to Senate leaders last March.

Photo by Yuri Samoilov

Sometimes, you just can’t explain anything better than a colleague, and Gabe Rottman of the ACLU’s Washington Office spoke up yesterday (with many links to detailed background information) about this issue and the immediate procedural threat in the Senate:

“So, what does the bill . . . do? It’s a surveillance bill, pure and simple. It says that any and all privacy laws, including laws requiring a warrant for electronic communications, and those that protect financial, health or even video rental records, do not apply when companies share ‘cybersecurity’ information, broadly defined, with the government.”

Please, take a minute to read his “Playing Politics With Cybersecurity and Privacy” now. Then, take just 60 more seconds to contact your Senators immediately with a simple message:

“VOTE NO on the ‘Burr Amendment’ to the defense bill – or on any such “information sharing” bill without full public hearings and Senate debate.”

The post Senate leaders rush end-run on personal privacy appeared first on District Dispatch.

Eric Lease Morgan: EEBO-TCP Workset Browser

Thu, 2015-06-11 15:25

I have begun creating a “browser” against content from EEBO-TCP in the same way I have created a browser against worksets from the HathiTrust. The goal is to provide “distant reading” services against subsets of the Early English poetry and prose. You can see these fledgling efforts against a complete set of Richard Baxter’s works. Baxter was an English Puritan church leader, poet, and hymn-writer. [1, 2, 3]

EEBO is an acronym for Early English Books Online. It is intended to be a complete collection of English literature between 1475 through to 1700. TCP is an acronym for Text Creation Partnership, a consortium of libraries dedicated to making EEBO freely available in the form of XML called TEI (Text Encoding Initiative). [4, 5]

The EEBO-TCP initiative is releasing their efforts in stages. The content of Stage I is available from a number of (rather hidden) venues. I found the content on a University Michigan Box site to be the easiest to use, albiet not necessarily the most current. [6] Once the content is cached — in the fullest of TEI glory — it is possible to search and browse the collection. I created a local, terminal-only interface to the cache and was able to exploit authority lists, controlled vocabularies, and free text searching of metadata to create subsets of the cache. [7] The subsets are akin to HathiTrust “worksets” — items of particular interest to me.

Once a subset was identified, I was able to mirror (against myself) the necessary XML files and begin to do deeper analysis. For example, I am able to create a dictionary of all the words in the “workset” and tabulate their frequencies. Baxter used the word “god” more than any other, specifically, 65,230 times. [8] I am able to pull out sets of unique words, and I am able to count how many times Baxter used words from three sets of locally defined “lexicons” of colors, “big” names, and “great” ideas. Furthermore, I am be to chart and graph trends of the works, such as when they were written and how they cluster together in terms of word usage or lexicons. [9, 10]

I was then able to repeat the process for other subsets, items about: lutes, astronomy, Unitarians, and of course, Shakespeare. [11, 12, 13, 14]

The EEBO-TCP Workset Browser is not as mature as my HathiTrust Workset Browser, but it is coming along. [15] Next steps include: calculating an integer denoting the number of pages in an item, implementing a Web-based search interface to a subset’s full text as well as metadata, putting the source code (written in Python and Bash) on GitHub. After that I need to: identify more robust ways to create subsets from the whole of EEBO, provide links to the raw TEI/XML as well as HTML versions of items, implement quite a number of cosmetic enhancements, and most importantly, support the means to compare & contrast items of interest in each subset. Wish me luck?

More fun with well-structured data, open access content, and the definition of librarianship.

  1. Richard Baxter (the person) –
  2. Richard Baxter (works) –
  3. Richard Baxter (analysis of works) –
  4. EEBO-TCP –
  5. TEI –
  6. University of Michigan Box site –
  7. local cache of EEBO-TCP –
  8. dictionary of all Baxter words –
  9. histogram of dates –
  10. clusters of “great” ideas –
  11. lute –
  12. astronomy –
  13. Unitarians –
  14. Shakespeare –
  15. HathiTrust Workset Browser –

LITA: Congratulations to the LITA UX Contest Winners

Thu, 2015-06-11 14:55

The results are in for LITA’s Contest: Great Library UX Ideas Under $100. Congratulations to winner Conny Liegl, Designer for Web, Graphics and UX at the Robert E. Kennedy Library at California Polytechnic State University for her submission entitled Guerilla Sketch-A-Thon. The LITA President’s Program Planning Team who ran the contest and reviewed the submissions loved how creative the project was and how it engaged users. From the sketches that accompanied the submission, and from looking at the before and after screenshots of the library website, it was clear the designers incorporated ideas from the student sketches.

Conny won a personal one-year, online subscription to Library Technology Reports, generously donated by ALA Tech Source. She gets to have lunch with LITA President Rachel Vacek and the LITA President’s Program speaker and UX expert Lou Rosenfeld at ALA in San Francisco. She gets a free book generously donated from Rosenfeld Media. And finally, her winning submission will be published in in Weave, an open-access, peer-reviewed journal for Library User Experience professionals published by Michigan Publishing.

There were so many entries submitted for the contest, picking a single winner was difficult. The Planning Team unanimously agreed to recognize first and second runner-up entries.

The First Runner-Up was the team at the University of Arizona Libraries who submitted their project Wayfinding in the Library. The team included people from multiple departments in their library including the User Experience department, Access & Information Services, and Library Communications. Congrats to Rebecca Blakiston, User Experience Librarian, Shoshana Mayden, Content Strategist, Nattawan Wood, Administrative Associate, Aungelique Rodriguez, Library Communications Student Assistant, and Beau Smith, Usability Testing Student Assistant. Each team member gets a book from Rosenfeld Media.

The Second Runner-Up was the team from Purdue University Libraries who submitted their project Applying Hierarchal Task Analysis Method to Discovery Tool Evaluation. The team consisted of Tao Zhang, Digital User Experiences Specialist and Marlen Promann, Graduate Research Assistant. Each team member gets a book from Rosenfeld Media.

In the coming months, interviews with the winners from each institution will be posted to the blog.

Brown University Library Digital Technologies Projects: Search relevancy tests

Thu, 2015-06-11 13:37

We are creating a set of relevancy tests for the library’s Blacklight implementation.  These tests use predetermined phrases to search Solr, Blacklight’s backend, mimicking the results a user would retrieve.  This provides useful data that can be systematically analyzed.  We use the results of these tests to verify that users will get the results we, as application managers and librarians, expect.  It also will help us protect against regressions, or new, unexpected problems, when we make changes over time to Solr indexing schema or term weighting.

This work is heavily influenced by colleagues at Stanford who have both written about their (much more thorough at this point) relevancy tests and developed a Ruby Gem to assist others with doing similar work.

We are still working to identify common and troublesome searches but have already seen benefits of this approach and used it to identify (and resolve) deficiencies in title weighting and searching by common identifiers, among other issues.  Our test code and test searches are available on Github for others to use as an example or to fork and apply to their own project.

Brown library staff who have examples of searches not producing expected results, please pass them on to Jeanette Norris or Ted Lawless.

— Jeanette Norris and Ted Lawless

Hydra Project: Booking for Hydra Connect 2015 open!

Thu, 2015-06-11 12:04

We are delighted to announce that booking for Hydra Connect 2015 is now open.  This year’s Connect takes place in Minneapolis, MN, from Monday September 21st to Thursday September 24th.  Details at   It is intended to publish a draft program in the first week of July.

Hydra Connect meetings are intended to be the major annual event in the Hydra year.  Hydra advertises them with the slogan “as a Hydra Partner or user, if you can only make it to one Hydra meeting this academic year, this is the one to attend!”

The three-day meetings are preceded by an optional day of workshops.  The meeting proper is a mix of plenary sessions, lightning talks, show and tell sessions, and unconference breakouts.  The evenings are given over to a mix of conference-arranged activities and opportunities for private meetings over dinner and/or drinks!  The meeting program is aimed at existing users, managers and developers and at new folks who may be just “kicking the tires” on Hydra and who want to know more.

We hope to see you there!


Peter Sefton: Ozmeka: extending the Omeka repository to make linked-data research data collections for (any and) all research disciplines

Thu, 2015-06-11 11:10

Ozmeka: extending the Omeka repository to make linked-data research data collections for (any and) all research disciplines by Peter Sefton, Sharyn Wise, Katrina Trewin is licensed under a Creative Commons Attribution 4.0 International License.

[Update 2015-06-11, fixing typos]

Ozmeka: extending the Omeka repository to make linked-data research data collections for (any and) all research disciplines Peter Sefton, University of Technology, Sydney, Sharyn Wise, University of Technology of Sydney, Peter Bugeia, Intersect Australia Ltd, Sydney, Katrina Trewin, University of Western Sydney, Katrina Trewin, University of Western Sydney,

There have been some adjustments to the authorship on this presentation, Peter Bugeia was on the abstract but didn’t end up contributing to the presentation, whereas Katrina Trewin withdrew her name from the proposal for a while, but then produced the Farms to Freeways collection and decided to come back in to the fold. The notes here are written in the first person, to be delivered in this instance by Peter but they come from all of the authors.

Abstract as submitted

The Ozmeka project is an Australian open source project to extend the Omeka repository system. Our aim is to support Open Scholarship, Open Science, and Cultural Heritage via repository software than can manage a wide range of Research (and Open) Data, both Open and access-restricted, providing rich repository services for the gathering, curation and publishing of diverse data sets. The Ozmeka project places a great deal of importance in integrating with external systems , to ensure that research data is linked to its context, and high quality identifiers are used for as much metadata as possible. This will include links to the ‘traditional’ staples of the Open Repositories conference series, publications repositories, and to the growing number of institutional and discipline research data repositories.

In this presentation we will take a critical look at how the Omeka system, extended with Ozmeka plugins and themes can be used to manage (a) a large cross disciplinary archive of research data about water-resources (b) an ethno-historiography built around a published book and (c) for managing large research data sets in and scientific institute, and talk about how this work paves the way for eResearch and repository support teams to supply similar services to researchers in a wide variety of fields. This work intended to reduce the cost of and complexity of creating new research data repository systems.

Slightly different scope now

I will be talking about Dharmae, the database of water-resources-themed research data, the project to put the book data into Omeka took a different turn and the scientific data repository is still being developed.

How does this presentation fit in to the conference?

Which Conference Themes are we touching on?

  • Supporting Open Scholarship, Open Science, and Cultural Heritage

  • Managing Research (and Open) Data

  • Building the Perfect Repository

  • Integrating with External Systems

Re-using Repository Content

Things we want to cover:
  • A bit about the research data projects we’ve worked on.

  • How we’ve implemented Linked Data for metadata (stamping out strings!)

  • What about this Omeka thing?

(The picture is one I took of the conference hotel)

What’s Omeka?

We like to call Omeka the “Wordpress of repositories”

It’s a PHP application which is easy to install and get up and running and yes – it is a ‘repository’, it lets you upload digital objects, describe them with Dublin Core Metadata, and no, it’s not perfect.

The Perfect Repository?

So lets talk about this phrase “the perfect repository”. I have been following Jason Scott at the Internet archive (who would make a great keynote speaker for this conference, by the way) and his work on rescuing and making available cultural heritage such as computer-related ephemera and programs for obsolete computing and gaming platforms. He uses the phrase “Perfect is the enemy of done” and talks about how making some tradeoffs and compromises and then just doing it mean that stuff, you know, actually gets done that otherwise wouldn’t.

No, we’re not calling Omeka “third best”, but one of the points of this talk is that instead of waiting for or trying to build the ‘perfect’ research data repository Omeka is a low-barrier-to-entry, cheap way to build some kinds of working-data-repositories or data-publishing websites. I have talked to quite a few people who say they have looked at Omeka and decided that it is too simple, too limited for whatever project they were doing. Indeed, it does have some limitations; the two big ones are that it does not handle access control at all and it has no approval workflow, at least not in this version.

The quote on the slide is via the wikipedia page Perfect is the Enemy of Good

The Portland Common Data Model

Omeka more-or-less implements a subset of the Portland Common Data Model, which I was introduced to yesterday in the Fedora workshop, although as I just mentioned it is not strong on Access control, having only a published/unpublished flag on items.

Why Omeka? We’ll come back to this – but the ability to do Linked Data was one of the main attractions of Omeka. We had to add some code to make the relations appear like this, and easier to apply than I the ‘trunk’ version of Omeka 2.x but that development was not hard or expensive, compared to what It might have cost on top of other repository systems with more complex application stacks. Another

(Note – if you look at the current version of Dharmae, the item relations will appear a little differently, as not all the Ozmeka enhanced code has been rolled out).

Australian national data service (ANDS) funded project … to kick-start a major open data collection

I’m going to give you a quick backgrounder on our project by way of introduction: ANDS approached us with a funding opportunity to create an open data collection. Many of you will be familiar with the frustrations of funding rules : our constraint was that we were not allowed to develop software, although we could adapt it.

The UTS team put the word out for publishable research data collections but got little response. Then, thanks to the library eScholarship team, Sharyn met Professor Heather Goodall and Jodi Frawley, who had data from a large Oral History project on the impacts of water management on stakeholders in the Murray Darling Basin – called Talking Fish.

And they had had the amazing foresight – the foresight of the historian- to obtain informed consent to publically archive the interview data.

Field science in MDB (from Dharmae)

In the image above MDB means he Murray Darling Basin, a big, long river system with hardly any water in it.

First up I’ll talk about Dharmae. was conceived as a multi-disciplinary data hub themed around water related research, with the “ecocultures” concept intended to flag that we welcome scientific data contributors (ecological or otherwise), as well as cultural data. Because they are equally crucial if we want to research to have an impact on the world.

This position is also supported in the literature of the intergovernmental science policy community and environmental sustainability and resilience research.

One paper expressed it this way – for research to have a transformative impact, its not simply more knowledge that we need, but different types of knowledge.

The literature emphasizes the need for improved connectivity between knowledge systems: those applied to researching the natural world, such as science, and those that investigate socio-cultural practices such as social sciences, history and particularly also indigenous knowledge.

But because these different knowledge systems each come with their own practices and terminologies, we have an interesting information science problem:

How to support data deposit and discovery by users from all disciplines?

Linked data & disambiguation

Essentially by using linked data. We extended the open source repository Omeka by allowing all named entities (like, places, people, species) to be linked to an authoritative source of truth.

Lets take location – it is one of the obvious correspondences between scientific and cultural data..

That still doesn’t mean its an easy thing to link on. Place names are rarely unique as we see Kendell noticing above.

But by using authoritative sources, like Geonames, we can disambiguate place names, and better still we can derive their coordinates.

Now we want users of Dharmae who are interested in finding data by location to access it in the way that makes sense to them – and that may not be name.

Lower Darling/Anabranch

In Dharmae readers can search by place name or they can use a map.

Here is one of 12 study regions from the Talking Fish data, showing the Lower Darling and Anabranch above Murray Sunset National Park.

We georeferenced these regions using a Geonode map server, but we have superimposed the researchers hand-drawn map as a layer on top to preserve the sense of human scale interaction

You can click through from here to read or listen to the oral histories completed in this region, look at photos or investigate the species identified by participants.

You can also search by Indigenous language Community if you prefer.

How else could this be useful?

Lower Darling/Anabranch:

It just so happens that we also have a satellite remote sensing dataset that corresponds reasonably well to this region above the national park.

It shows the Normalized Difference Vegetation Index for the region or the vegetation change over the decade 1996-2006.

Relative increase in vegetation shows as green and relative decrease as pink.

Could the interviews with participants from that region provide any clues as to why?

I can’t tell you that, but the point is that the more we enrich and link data, the more possible hypotheses we can generate.

The Graph

Here’s the graph of our solution: We created local records, so that the Dharmae hub could maintain its own set of ‘world-views’ while still interfacing with the semantic web knowledge graph.

This design pattern is something we want to explore more: having a local record for an entity or concept, with links to external authorities. So, for example we might use a Dbpedia URI for a person, and quote a particular ‘approved’ version of the wikipedia page about them so there is a local, stable proxy for an external URI, but the local record is still part of the global graph. With the species data, this will allow researchers to explore the way the participants in Talking Fish talked about fish and compare this to what the Atlas of Living Australia says about nomenclature and distribution.

From the Journey to Horshoe Bend Website at the University of Western Sydney:

TGH Strehlow’s biographical memoir, Journey to Horseshoe Bend, is a vivid ethno-historiographic account of the Aboriginal, settler and Lutheran communities of Central Australia in the 1920’s. The ‘Journey to Horseshoe Bend’ project elaborates on Strehlow’s book in the form of an extensive digital hub – a database and website – that seeks to ‘visualise’ the key textual thematics of Arrernte* identity and sense of “place”, combined with a re-mapping of European and Aboriginal archival objects related to the book’s social and cultural histories.

Thus far the project has produced a valuable collection of unique historical and contemporary materials developed to encourage knowledge sharing and to initiate knowledge creation. By bringing together a wide variety of media – including photographs, letters, journals, Government files, audio recordings, moving images, newspaper, newsletters, interviews, manuscripts, an electronic version of the text and annotations – the researchers hope to ‘open out’ the histories of Central Australia’s Aboriginal, settler and missionary communities.

JTHB research work entailed creating annotations relating to sections of the book text. The existing book text, marked up with TEI, was converted to HTML and the annotations were anchored within the HTML. Plan was to create an Omeka plugin to display the text and co-display or footnote the annotations relating to each part of the text.


  • The existing annotations were incomplete and the research team wished to continue adding annotations and material. This meant that the HTML would need to be continuously edited (outside Omeka), giving rise to issues around workflow, researcher skills, and version control.
  • Cultural sensitivities were also a barrier to open publication (not an Omeka issue but a MODC one)

Katrina Trewin is a data librarian, working at the University of Western Sydney. While the Journey to Horseshoe Bend project could not be completed using Omeka, due to resource constraints. Another project was able to be completed. Using Omeka, Katrina was able to build web site around an oral-history data set without needing any development. This work took place in parallel with the work on Dharmae at UTS so was not able to make use of some of the innovations introduced in that project such as enhancements to the Item Relations plugin to allow rich-interlinking between resources.

Katrina’s notes:

Material had been in care of researcher for 20+ years.

  • Audio interviews on cassette, photographs, transcripts (some electronic)
  • Digitised all the material
  • Created csv files for upload of item metadata into Omeka
  • Once collections of items were created, then used exhibit plugin to bring material relating to each interviewee together.

Worked well because collection was complete – fine to edit metadata in Omeka but items themselves need to be stable (unlike the JTHB text)

Omeka allows item-level description which is not possible via institutional repository. This could have been done in Omeka interface but was more efficient via csv upload. csv files, bundled item files, readme and Omeka xml output made available from institutional repository record for longer term availability as hosting arrangement is not in place. Chambers, Deborah; Liston, Carol; Wieneke, Christine (2015): Interview material from Western Sydney women’s oral history project: ‘From farms to freeways: Women’s memories of Western Sydney’. University of Western Sydney.

Katrina and team have published all the data as a set of files with a link to the website , in the institutional research data repository. This screenshot shows the data files available for download for re-use. My team at UTS are doing a similar thing with the Dharmae data.

At UTS we are are constructing a growing ‘grid’ of research data services. This diagram is a sketch of how Omeka fits into this bigger picture, showing the geonode mapping service which supplies map display services and can harvest maps from Omeka as well. In this architecture, all items ultimately end up in an archival repository with a catalogue description, as I showed earlier for the Farms to Freeways data.

Interested? Check out Clone our Ozmeka github repostiories


Omeka is a very simple seeming repository solution which is easy to dismiss for projects that demand the ‘perfect’ repository, but looking beyond its limitations it has some strengths that make it attractive for creating ‘micro repository services’ (Field & McSweeney 2014). Our work has made it easier to set up new research-data repositories that adhere to linked-data principles and create rich semantic web interfaces to data collections. This paves the way for a new generation of micro or workgroup-level research data repositories which link-to and re-use a wide range of data sources.


Johnson, Ian. “Heurist Scholar,”2014

Kucsma, Jason, Kevin Reiss, and Angela Sidman. “Using Omeka to Build Digital Collections: The METRO Case Study.” D-Lib Magazine 16, no. 3/4 (March 2010). doi:10.1045/march2010-kucsma.

Nahl, Diane. “A Discourse Analysis Technique for Charting the Flow of Micro-Information Behavior.” Journal of Documentation 63, no. 3 (2007): 323–39. doi:

Palmer, Carole L., and Melissa H. Cragin. “Scholarship and Disciplinary Practices.” Annual Review of Information Science and Technology 42, no. 1 (2008): 163–212. doi:10.1002/aris.2008.1440420112.

Palmer, Carole L. “Thematic Research Collections”, Chapter 24 in Schreibman, Susan, Ray Siemens, and John Unsworth. Companion to Digital Humanities (Blackwell Companions to Literature and Culture). Hardcover. Blackwell Companions to Literature and Culture. Oxford: Blackwell Publishing Professional, 2004.

Simon, Herbert. “Rational Choice and the Structure of the Environment.” Psychological Review 63, no. 2 (1956): 129–38.

Strehlow, Theodor George Henry. Journey to Horseshoe Bend. [Sydney]: Angus and Robertson, 1969.

  • Researchers:
    • Prof. Heather Goodall
    • Dr Michelle Voyer
    • Associate professor Carol Liston
    • Dr Jodi Frawley
    • Dr Kevin Davies
  • eResearch: Sharyn Wise, Peter Sefton, Mike Lynch, Paul Nguyen, Mike Lake, Carmi Cronje, Thom McIntyre and Kevin Davies, Kim Heckenberg, Andrew Leahy, Lloyd Harischandra
  • Library: Duncan Loxton (eScholarship) & Kendell Powell (Aboriginal & Torres Strait Islander Data Archive Officer), Katrina Trewin, Michael Gonzalez
  • Thanks to: State Library of NSW Indigenous Unit, Atlas of Living Australia, Terrestrial Ecosystems Research Network and our funder, ANDS.

I didn’t have this slide when I presented, and forgot to acknowledge the contribution of all of the above, and anyone who’s been left off by accident.

Peter Murray: Thursday Threads: Advertising and Privacy, Giving Away Linux, A View of the Future

Thu, 2015-06-11 10:49
Receive DLTJ Thursday Threads:

by E-mail

by RSS

Delivered by FeedBurner

In just a few weeks there will be a gathering of 25,000 librarians in the streets of San Francisco for the American Library Association annual meeting. The topics on my mind as the meeting draws closer? How patrons intersect with advertising and privacy when using our services. What one person can do to level the information access divide using free software. Where is technology in our society going to take us next. Heady topics for heady times.

On a personal note: funding for my current position at LYRASIS runs out at the end of June, so I am looking for my next challenge. Check out my resume/c.v. and please let me know of job opportunities in library technology, open source, and/or community engagement.

Feel free to send this to others you think might be interested in the topics. If you find these threads interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the right. If you would like a more raw and immediate version of these types of stories, watch my Pinboard bookmarks (or subscribe to its feed in your feed reader). Items posted to are also sent out as tweets; you can follow me on Twitter. Comments and tips, as always, are welcome.

Internet Users Don’t Care For Ads and Do Care About Privacy

In advertising, an old adage holds, half the money spent is wasted; the problem is that no one knows which half. This should be less of a problem in online advertising, since readers’ tastes and habits can be tracked, and ads tailored accordingly. But consumers are increasingly using software that blocks advertising on the websites they visit. If current trends continue, the saying in the industry may well become that half the ads aimed at consumers never reach their screens. This puts at risk online publishing’s dominant business model, in which consumers get content and services free in return for granting advertisers access to their eyeballs.

Block shock: Internet users are increasingly blocking ads, including on their mobiles, The Economist, 6-Jun-2015

A new report into U.S. consumers’ attitude to the collection of personal data has highlighted the disconnect between commercial claims that web users are happy to trade privacy in exchange for ‘benefits’ like discounts. On the contrary, it asserts that a large majority of web users are not at all happy, but rather feel powerless to stop their data being harvested and used by marketers.

The Online Privacy Lie Is Unraveling, by Natasha Lomas, TechCrunch, 6-Jun-2015

This week The Economist printed a story about how users are starting to use software in their desktop and mobile browsers to block advertisements, and what the reaction may be from websites that rely on advertising to fund their activities. I found it interesting that “younger consumers seem especially intolerant of intrusive ads” and as they get older, of course, more of the population would be using ad-blocking software. Reactions range from gentle prodding to support the website in other ways, lawsuits against the makers of ad-blocking software, and mixing advertising with editorial content.

Also this week the news outlet TechCrunch reported on a study by the Annenberg School for Communication on how “a majority of Americans are resigned to giving up their data” when they “[believe] an undesirable outcome is inevitable and [feel] powerless to stop it.” This sort of thing is coming up in the NISO Patron Privacy working group discussions that have occurred over the past couple weeks and will culminate in a day-and-a-half working meeting at ALA. It is also something that I have been blogging about recently as well.

Welcome to America: Here’s your Linux computer

So, the following Monday I delivered a lovely Core2Duo desktop computer system with Linux Mint 17.1 XFCE installed. This computer was recently surplussed from the public library where I work. Installed on the computer was:

  • LibreOffice, for writing and documenting
  • Klavaro, a touch-typing tutor
  • TuxPaint, a painting program for kids
  • Scratch, to learn computer programming
  • TeamViewer, so I can volunteer to remotely support this computer

In 10 years time, these kids and their mom may well remember that first Linux computer the family received. Tux was there, as I see it, waiting to welcome these youth to their new country. Without Linux, that surplussed computer might have gotten trashed. Now that computer will get two, four, or maybe even six more years use from students who really value what it has to offer them.

Welcome to America: Here’s your Linux computer, by Phil Shapiro,, 5-June-2015

This is a heartwarming story of making something out of nearly nothing: a surplus computer, free software, and a little effort. This is a great example of how one person can make a significant difference for a needy family.

What Silicon Valley Can Learn From Seoul

“When I was in S.F., we called it the mobile capital of the world,” [Mike Kim] said. “But I was blown away because Korea is three or four years ahead.” Back home, Kim said, people celebrate when a public park gets Wi-Fi. But in Seoul, even subway straphangers can stream movies on their phones, deep beneath the ground. “When I go back to the U.S., it feels like the Dark Ages,” he said. “It’s just not there yet.”

What Silicon Valley Can Learn From Seoul, by Jenna Wortham, New York Times Magazine, 2-Jun-2015

What is moving the pace of technology faster than Silicon Valley? South Korea. Might that country’s citizens be divining the path that the rest of us will follow?

Link to this post!

Eric Hellman: Protect Reader Privacy with Referrer Meta Tags

Thu, 2015-06-11 03:10
Back when the web was new, it was fun to watch a website monitor and see the hits come in. The IP address told you the location of the user, and if you turned on the referer header display, you could see what the user had been reading just before.  There was a group of scientists in Poland who'd be on my site regularly- I reported the latest news on nitride semiconductors, and my site was free. Every day around the same time, one of the Poles would check my site, and I could tell he had a bunch of sites he'd look at in order. My site came right after a Russian web site devoted to photographs of unclothed women.

The original idea behind the HTTP referer header (yes, that's how the header is spelled) was that webmasters like me needed it to help other webmasters fix hyperlinks. Or at least that was the rationalization. The real reason for sending the referer was to feed webmaster narcissism. We wanted to know who was linking to our site, because those links were our pats on the back. They told us about other sites that liked us. That was fun. (Still true today!)

The fact that my nitride semiconductor website ranked up there with naked Russian women amused me; reader privacy issues didn't bother me because the Polish scientist's habits were safe with me.

Twenty years later, the referer header seems like a complete privacy disaster. Modern web sites use resources from all over the web, and a referer header, including the complete URL of the referring web page, is sent with every request for those resources. The referer header can send your complete web browsing log to websites that you didn't know existed.

Privacy leakage via the referrer header plagues even websites that ostensibly believe in protecting user privacy, such as those produced by or serving libraries. For example, a request to the WorldCat page for What you can expect when you're expecting  results in the transmission of referer headers containing the user's request to the following hosts:
  • (with tracking cookies)
  • (with tracking cookies)
None of the resources requested from these third parties actually need to know what page the user is viewing, but WorldCat causes that information to be sent anyway. In principle, this could allow advertising networks to begin marketing diapers to carefully targeted WorldCat users. (I've written about AddThis and how they sell data about you to advertising networks.)

It turns out there's an easy way to plug this privacy leak in HTML5. It's called the referrer meta tag. (Yes, that's also spelled correctly.)

The referrer meta tag is put in the head section of an HTML5 web page. It allows the web page to control the referer headers sent by the user's browser. It looks like this:

<meta name="referrer" content="origin" />
If this one line were used on WorldCat, only the fact that the user is looking a WorldCat page would be sent to Google, AddThis, and BibTip. This is reasonable, library patrons typically don't expect their visits to a library to be private; they do expect that what they read there should be private.

Because use of third party resources is often necessary, most library websites leak lots of privacy in referer headers. The meta referrer policy is a simple way to stop it. You may well ask why this isn't already standard practice. I think it's mostly lack of awareness. Until very recently, I had no idea that this worked so well. That's because it's taken a long time for browser vendors to add support. Although Chrome and Safari have been supporting the referrer meta tag for more than two years; Firefox only added it in January of 2015. Internet Explorer will support it with the Windows 10 release this summer. Privacy will still leak for users with older browser software, but this problem will gradually go away.

There are 4 options for the meta referrer tag, in addition to the "origin" policy. The origin policy sends only the host name for the originating page.

For the strictest privacy, use

<meta name="referrer" content="no-referrer" />

If you use this sitting, other websites won't know you're linking to them, which can be a disadvantage in some situations. If the web page links to resources that still use the archaic "referer authentication", they'll break.

 The prevailing default policy for most browsers is equivalent to

<meta name="referrer" content="no-referrer-when-downgrade" />

"downgrade" here refers to http links in https pages.

If you need the referer for your own website but don't want other sites to see it you can use

<meta name="referrer" content="origin-when-cross-origin" />
Finally, if you want the user's browser to send the full referrer, no matter what, and experience the thrills of privacy brinksmanship, you can set

<meta name="referrer" content="unsafe-url" />
Widespread deployment of the referrer meta tag would be a big boost for reader privacy all over the web. It's easy to implement, has little downside, and is widely deployable. So let's get started!


Peter Murray: Can Google’s New “My Account” Page be a Model for Libraries?

Thu, 2015-06-11 00:30

One of the things discussed in the NISO patron privacy conference calls has been the need for transparency with patrons about what information is being gathered about them and what is done with it. The recent announcement by Google of a "My Account" page and a privacy question/answer site got me thinking about what such a system might look like for libraries. Google and libraries are different in many ways, but one similarity we share is that people use both to find information. (This is not the only use of Google and libraries, but it is a primary use.) Might we be able to learn something about how Google puts users in control of their activity data? Even though our motivations and ethics are different, I think we can.

What the Google "My Account" page gives us

Last week I got an e-mail from Google that invited me to visit the new My Account page for "controls to protect and secure my Google account."

Google’s “My Account” home page

I think the heart of the page is the "Security Checkup" tool and the "Privacy Checkup" tool. The "Privacy Checkup" link takes you through five steps:

The five areas that Google offers when you run the “Privacy Checkup”.

  1. Google+ settings (including what information is shared in your public profile)
  2. Phone numbers (whether people can find you by your real life phone numbers)
  3. YouTube settings (default privacy settings for videos you upload and playlists you create)
  4. Account history (what of your activity with Google services is saved)
  5. Ads settings (what demographic information Google knows about you for tailoring ads)

These are broad brushes of control; the settings available here are pretty global. For instance, if you want to see your search history and what links you followed from the search pages, you would need to go to a separate page. In the “Privacy Checkup” the only option that is offered is whether or not your search history is saved. Still, for someone who wants to go with an “everything off” or “everything on” approach, the Privacy Checkup is a good way to do that.

Sidebar: I would also urge you to go through the “Security Checkup” as well. There you can change you password and account recovery options, see what other services have access to your Google account data, and make changes to account security settings.

The more in-depth settings can be reached by going to the "Personal Information and Privacy" page. This is a really long page, and you can see the full page content separately.

First part of the My Account “Personal Information and Privacy” page. The full screen capture is also available.

There you can see individual searches and search results that you followed.

My Account “Search and App Activity” page

Same with activity on YouTube.

My Account ‘YouTube Activity’ page

Google clearly put some thought and engineering time into developing this. What would a library version of this look like?

Google's Privacy site

The second item in the Google announcement was its privacy site. There they cover these topics:

  • What data does Google collect?
  • What does Google do with the data it collects?
  • Does Google sell my personal information?
  • What tools do I have to control my Google experience?
  • How does Google keep my information safe?
  • What can I do to stay safe online?

Each has a brief answer that leads to more information and sometimes to an action page like updating your password to something more secure or setting search history preferences.

Does this apply to libraries?

It could. It is clearly easier for Google because they have control over all the web properties and can do a nice integration like what is on the My Account page. We will have a more difficult task because libraries use many service providers and there are not programming interfaces libraries can use to pull together all the privacy settings onto one page. There isn't even consistency of vocabulary or setting labels that service providers could use to build such a page for making choices. Coming to an agreement on:

  1. how service providers should be transparent on what is collected, and
  2. how patrons can opt-in to data collection for their own benefit, see what data has been collected, and selectively delete and/or download that activity

…would be a significant step forward. Hopefully that is the level of detail that the NISO Patron Privacy framework can describe.

Link to this post!

DuraSpace News: MOVING Content: Institutional Tools and Strategies for Fedora 3 to 4 Upgrations

Thu, 2015-06-11 00:00

Winchester, MA  The Fedora team has made tools that simplify content migration from Fedora 3 to Fedora 4 available to assist institutions in establishing production repositories. Using the Hydra-based Fedora-Migrate tool — which was built in advance of Penn State’s deadline to have Fedora 4 in production, before the generic Fedora Migration Utilities was released —  Penn State’s ScholarSphere moved all data from production instances of Fedora 3 to Fedora 4 in about 20 hours.