You are here

Feed aggregator

David Rosenthal: Another Vint Cerf Column

planet code4lib - Wed, 2016-10-05 15:00
Vint Cerf has another column on the problem of digital preservation. He concludes:
These thoughts immediately raise the question of financial support for such work. In the past, there were patrons and the religious orders of the Catholic Church as well as the centers of Islamic science and learning that underwrote the cost of such preservation. It seems inescapable that our society will need to find its own formula for underwriting the cost of preserving knowledge in media that will have some permanence. That many of the digital objects to be preserved will require executable software for their rendering is also inescapable. Unless we face this challenge in a direct way, the truly impressive knowledge we have collectively produced in the past 100 years or so may simply evaporate with time.Vint is right about the fundamental problem but wrong about how to solve it. He is right that the problem isn't not knowing how to make digital information persistent, it is not knowing how to pay to make digital information persistent. Yearning for quasi-immortal media makes the problem of paying for it worse not better, because quasi-immortal media such as DNA are both more expensive and their more expensive cost is front-loaded. Copyability is inherent in on-line information, that's how you know it is on-line. Work with this grain of the medium, don't fight it.

LITA: Social Media For My Institution – a LITA web course

planet code4lib - Wed, 2016-10-05 14:59

Don’t miss out on this informative LITA web course starting soon.

Social Media For My Institution: from “mine” to “ours”

Instructor: Dr. Plamen Miltenoff
Wednesdays, 10/19/2016 – 11/9/2016
Blended format web course

Register Online, page arranged by session date (login required)

A course for librarians who want to explore the institutional application of social media. Based on an established academic course at St. Cloud State University “Social Media in Global Context”. This course will critically examine the institutional need of social media (SM) and juxtapose it to its private use. Discuss the mechanics of choice for recent and future SM tools. Present a theoretical introduction to the subculture of social media. Show how to streamline library SM policies with the goals and mission of the institution. There will be hands-on exercises on creation and dissemination of textual and multimedia content, and patrons’ engagement. And will include brainstorming on suitable for the institution strategies regarding resources, human and technological, workload share, storytelling, and branding and related issues such as privacy, security etc.

This is a blended format web course:

The course will be delivered as 4 separate live webinar lectures, one per week on Wednesdays, October 19, 26, November 2, and 9 at 2pm Central. The webinars will also be recorded and distributed through the web course platform, Moodle, for asynchronous participation.

Details here and Registration here

Dr. Plamen Miltenoff is an information specialist and Professor at St. Cloud State University. His education includes several graduate degrees in history and Library and Information Science and in education. His professional interests encompass social Web development and design, gaming and gamification environments. For more information see

And don’t miss other upcoming LITA fall continuing education offerings:

Beyond Usage Statistics: How to use Google Analytics to Improve your Repository
Presenter: Hui Zhang
Tuesday, October 11, 2016
11:00 am – 12:30 pm Central Time
Register Online, page arranged by session date (login required)

Online Productivity Tools: Smart Shortcuts and Clever Tricks
Presenter: Jaclyn McKewan
Tuesday November 8, 2016
11:00 am – 12:30 pm Central Time
Register Online, page arranged by session date (login required)

Questions or Comments?

For questions or comments, contact LITA at (312) 280-4268 or Mark Beatty,

Islandora: Islandora CLAW MVP

planet code4lib - Wed, 2016-10-05 14:08

The Minimum Viable Product document that I have been working on with the CLAW committers over the past few weeks has been made public, and is available for review. It defines the scope of the project, what we think is required for a stable release, proposed content modeling, and the overall design of the various subsystems of the software. We will be using this as a starting point for detailed scoping before attempting future sprints.

Please feel free to review and make comments. All feedback is appreciated.

District Dispatch: Failing to finish funding bills, Congresses presses pause

planet code4lib - Wed, 2016-10-05 13:32

Unable to reach agreement on 12 appropriations bills, Congress last week passed H.R. 5325, a “Continuing Resolution” often called a CR, that keeps the Federal government operating through December 9 when Congress will be forced to return for a lame-duck session to finish the appropriations bills. The President signed the CR late last week.

The CR provides funding levels for nearly all Federal programs, such as LSTA, IAL, Library of Congress, etc., at nearly the same levels set in last year’s Omnibus funding package. There is a slight decrease to many programs of 0.549%, however due in part, to “advance” funding increases for veterans’ health approved last year.

Earlier last week, Senate Republicans failed in their efforts to force through a Republican-written CR. Facing unified Democratic opposition and with the threat of a government shutdown – unappealing to all so close to the elections – Republicans and Democrats set aside funding disagreements for the time being.  The President signed the “clean” CR hours after the House passed the funding measure that had passed the Senate Tuesday.

Congress will be forced to address a series of contentious funding issues in a post-election lame-duck session. The most significant disagreements concerned funding relief for Flint, MI, water contamination, and Louisiana flooding disaster. Democrats were successful in blocking the Republican’s CR which did not including funding for Flint. The final CR did not include funding for either need but agreements were reached last week to address both needs in the lame-duck.

ALA continues to work to ensure strong funding for library programs which fared well in both House and Senate Appropriations Committees. While overall funding for education was cut significantly, both committees recommended small increases for LSTA and the Grants to States Program as well as level funding for Innovative Approaches to Literacy.

The outlook for this year’s lame-duck session could be quite interesting for several factors. The elections may result in a change of party control in the Senate and/or the White House. The lame-duck session will also include a number of departing Representatives and Senators who will be retiring or will have lost re-election not to mention the President will also be departing in January. We will be watching, with many others, to determine if the lame-duck will be contentious or if Members will just want to quickly and quietly wrap up business.

The post Failing to finish funding bills, Congresses presses pause appeared first on District Dispatch.

Access Conference: Live and Streaming: Access 2016

planet code4lib - Wed, 2016-10-05 11:54

Can’t join us in Fredericton this year? Don’t worry! We’re live streaming the show! Join us each day, starting at 9:00 AM.

Keep watching this space for updates! We’ll add individual talks to our YouTube channel over the next couple of weeks.

Open Knowledge Foundation: Who Will Shape the Future of the Data Society?

planet code4lib - Wed, 2016-10-05 08:49

This piece was originally posted on the blog of the International Open Data Conference 2016, which takes place in Madrid, 6-7th October 2016.

The contemporary world is held together by a vast and overlapping fabric of information systems. These information systems do not only tell us things about the world around us. They also play a central role in organising many different aspects of our lives. They are not only instruments of knowledge, but also engines of change. But what kind of change will they bring?

Contemporary data infrastructures are the result of hundreds of years of work and thought. In charting the development of these infrastructures we can learn about the rise and fall not only of the different methods, technologies and standards implicated in the making of data, but also about the articulation of different kinds of social, political, economic and cultural worlds: different kinds of “data worlds”.

Beyond the rows and columns of data tables, the development of data infrastructures tell tales of the emergence of the world economy and global institutions; different ways of classifying populations; different ways of managing finances and evaluating performance; different programmes to reform and restructure public institutions; and how all kinds of issues and concerns are rendered into quantitative portraits in relation to which progress can be charted – from gender equality to child mortality, biodiversity to broadband access, unemployment to urban ecology.

The transnational network assembled in Madrid for the International Open Data Conference has the opportunity to play a significant role in shaping the future of these data worlds. Many of those present have made huge contributions towards an agenda of opening up datasets and developing capacities to use them. Thanks to these efforts there is now global momentum around open data amongst international organisations, national governments, local administrations and civil society groups – which will have an enduring impact on how data is made public.

Perhaps, around a decade after the first stirrings of interest in what we now know as “open data”, it is time to have a broader conversation around not only the opening up and use of datasets, but also the making of data infrastructures: of what issues are rendered into data and how, and the kinds of dynamics of collective life that these infrastructures give rise to. How might we increase public deliberation around the calibration and direction of these engines of change?

Anyone involved with the creation of official data will be well aware that this is not a trivial proposition. Not least because of the huge amount of effort and expense that can be incurred in everything from developing standards, commissioning IT systems, organising consultation processes and running the social, technical and administrative systems which can be required to create and maintain even the smallest and simplest of datasets. Reshaping data worlds can be slow and painstaking work. But unless we instate processes to ensure alignment between data infrastructures and the concerns of their various publics, we risk sustaining systems which are at best disconnected from and at worst damaging towards those whom they are intended to benefit.

What might such social shaping of data infrastructures look like? Luckily there is no shortage of recent examples – from civil society groups campaigning for changes in existing information systems (such as advocacy around the UK’s company register), to cases of citizen and civil society data leading to changes in official data collection practices, to the emergence of new tools and methods to work with, challenge and articulate alternatives to official data. Official data can also be augmented by “born digital” data derived from a variety of different platforms, sources and devices which can be creatively repurposed in the service of studying and securing progress around different issues.

While there is a great deal of experimentation with data infrastructures “in the wild”, how might institutions learn from these initiatives in order to make public data infrastructures more responsive to their publics? How can we open up new spaces for participation and deliberation around official information systems at the same time as building on the processes and standards which have developed over decades to ensure the quality, integrity and comparability of official data? How might participatory design methods be applied to involve different publics in the making of public data? How might official data be layered with other “born digital” data sources to develop a richer picture around issues that matter? How do we develop the social, technical and methodological capacities required to enable more people to take part not just in using datasets, but also reshaping data worlds?

Addressing these questions will be crucial to the development of a new phase of the open data movement – from the opening up of datasets to the opening up of data infrastructures. Public institutions may find they have not only new users, but new potential contributors and collaborators as the sites where public data is made begin to multiply and extend outside of the public sector – raising new issues and challenges related to the design, governance and political economics of public information systems.

The development of new institutional processes, policies and practices to increase democratic engagement around data infrastructures may be more time consuming than some of the comparatively simpler steps that institutions can take to open up their datasets. But further work in this area is vital to secure progress on a wide range of issues – from tackling tax base erosion to tracking progress towards commitments made at the recent Paris climate negotiations.

As a modest contribution to advancing research and practice around these issues, a new initiative called the Public Data Lab is forming to convene researchers, institutions and civil society groups with an interest in the making of data infrastructures, as well as the development of capacities that are required for more people to not only take part in the data society, but also to more meaningfully participate in shaping its future.

Stuart Yeates: How would we know when it was time to move from TEI/XML to TEI/JSON?

planet code4lib - Tue, 2016-10-04 21:20
This post inspired by TEI Next by Hugh Cayless.

How would we know when it was time to move from TEI/XML to TEI/JSON?

If we stand back and think about what it is we (the TEI community) need from the format :
  1. A common format for storing and communicating Texts and augmentations of Texts (Transcriptions, Manuscript Description, Critical Apparatus, Authority Control, etc, etc.).
  2. A body of documentation for shared use and understanding of that format.
  3. A method of validating Texts in the format as being in the format.
  4. A method of transforming Texts in the format for computation, display or migration.
  5. The ability to reuse the work of other communities so we don't have to build everything for ourselves (Unicode, IETF language tags, URIs, parsers, validators, outsourcing providers who are tooled up to at least have a conversation about what we're trying to do, etc)
[Everyone will have their slightly different priorities for a list like this, but I'm sure we can agree that a list of important functionality could be drawn up and expanded to requirements list at a sufficiently granular level so we can assess different potential technologies against those items. ] 
If we really want to ponder whether TEI/JSON is the next step after TEI/XML we need to compare the two approaches against such as list of requirements. Personally I'm confident that TEI/XML will come out in front right now. Whether javascript has potential to replace XSLT as the preferred method for really exciting interfaces to TEI/XML docs is a much more open question, in my mind.  
That's not to say that the criticisms of XML aren't true (they are) or valid (they are) or worth repeating (they are), but perfection is commonly the enemy of progress.

Terry Reese: MarcEdit Task Management–planned behavior changes

planet code4lib - Tue, 2016-10-04 21:08

Below are two messages from a conversation on the MarcEdit list around the Task Manager/Task Functionality in MarcEdit.  I’ve been discussing some changes to the way that this works – particularly related to how the networked task management works.  Since this is a widely used function of the application, I’m pulling these two messages out and making them available here.  If you see something that gives you pause, let me know.



Following up – I spent a bit of time working on this and as far as I can tell, everything now works as I’ve described below.  This definitely complicates the management of tasks a bit on my end – but after doing a bit of work – here’s the results.  Please, look over the notes here and below and let me know if you see any potential issues.  If I don’t hear anything, I’ll include this in my update over the weekend.

Simplifying Network Sharing:

When you setup network sharing, the tool now recognizes if your path is a networked path (or something else) and will automatically do a couple things if it’s a networked path:

Adds an import option to the Preferences (option only shows if your path is a network path)

If you click the copy option, it will create a .network option in the local tasks folder and then move a copy of all your items into the network space and into the _tasks.txt file on the network.

On startup, MarcEdit automatically will update your .network task folder with the current data in the defined network folder.

When off-line, MarcEdit automatically will use the data in the .network folder (a local cached copy of your networked data)

When off-line and using a networked path, if you select the task manager, you will see the following:

When you have a networked folder, MarcEdit creates the local .network cache as read-only.  You’ll see this is enforced in the editor.

Changes to the _tasks.txt file – file paths are no longer embedded.  Just the name of the file.  The assumption is that all files will be found in the defined tasks directory.  And the program will determine if there is an old path or just a filename, and will ignore any old path information, extracting just the filename and using it with the defined task paths now managed by the program.

Within the TASKS themselves, MarcEdit no longer stores paths to task lists.  Like the _tasks.txt file, just the name of the task file is stored, with the assumption being that the task list will be in the task defined file.  This means imports and exports can be done through the tool, or just by copying and pasting files into the managed task folder.

Finally – in Windows, networked drives can cause long delays when they are offline.  I’ve created a somewhat novel approach to checking this data – but it means that I’ve set a low timeout (~150 mms) – that means if your network has a lot of latency (i.e., not responsive), MarcEdit could think it’s offline.  I can make this value configurable if necessary, but the timeout here really has an impact on parts of the program because the tasks data is read into the UI of the Editor.  In my testing, the timeouts appear to be appropriate. 

These changes have just been implemented on the windows/linux side right now – I’ll move them to the mac version tonight. 

If you’d like to try these changes, let me know – I won’t release these till the weekend, but would be willing to build a test build for folks interested in testing how this works. 

Finally, I know how much the tasks function is used, so I’m giving time for feedback before going forward.  If you see something described that you think may be problematic, let me know.



From: Terry Reese
Subject: Proposed TASK changes…Feedback requested. [Re: Saving tasks lists on the network….]

Following up on this – I’m looking at a behavior change and want to make certain that this won’t cause anyone problems.  Let me explain how the tasks works currently (for access) and how I’ve reworked the code.  Please let me know if anyone is using a workflow where this might be problematic.


When a task is created or cloned, the task and its full path is saved to the _tasks.txt file (in your local preferences).  Example:

Main task with a whole lot of text is right here               C:\Users\reese.2179\AppData\Roaming\marcedit\macros\tasksfile-2016_03_18_052022564.txt         SHIFT+F1
AAR Library         C:\Users\reese.2179\AppData\Roaming\marcedit\macros\tasksfile-2016_03_18_052317664.txt        
TLA        C:\Users\reese.2179\AppData\Roaming\marcedit\macros\tasksfile-2016_04_19_113756870.txt        

This is a tab delimited list – with the first value being the name, the second value the path to the task, the first being a shortcut if assigned.  The task naming convention has changed a number of times through the years.  Currently, I use GUIDs to prevent any possibility of collision between users.  This is important when saving to the network.

When a user saves to the network, internally, the tool just changes where it looks for the _tasks.txt file.  But data is still saved in that location using the full network path.  So, if users potentially have different drive numbers, or someone uses a UNC path and someone else doesn’t, then you have problems.  Also, when setting up a network folder, if you have existing tasks, you have to import and export them into the folder.  Depending on how deep your task references are, that may require some manual intervention.

If you are off the network and you save to a network path, your tasks are simply unavailable unless you keep a copy in your local store (which most wouldn’t).


Here’s what I propose to do.  This will require changing the import and export processes (trivial), making a change to the preferences (trivial), and updating the run code for the macros (little less trivial). 

First, the _tasks.txt file will have all filepaths removed.  Items will only be referenced by filenames.  Example:

Main task with a whole lot of text is right here    tasksfile-2016_03_18_052022564.txt               SHIFT+F1
AAR Library         tasksfile-2016_03_18_052317664.txt      
TLA        tasksfile-2016_04_19_113756870.txt      

The program already sets the taskfile path so that it can run tasks that are on the network, this will continue that process of normalization.  Now, no file paths will be part of the string, and access to task locations will be completely tied to the information set in the preferences. 


  • This will make tasks more portable as I no longer have to fix paths.
  • Makes it easier to provide automatic import of tasks when first setting up a network location
  • Allows me, for the first time, to provide local caching of networked caches so that if a user is configured to use a network location, is offline, their tasks will continue to be available.  At this point, I’m thinking caches would be updated when the user opens MarcEdit, or makes a change to a network task.  The assumption, for simplicities sake, is that if you are using a network drive, you cannot edit local networked tasks – and I’ll likely find a way to enforce that to avoid problems with caching and updates.


  • If you use a networked task, you won’t be able to edit the offline cache when you are offline.  You’ll still have access to your tasks (which isn’t true today), but you won’t be allowed to change them.
  • This isn’t possible with the task manager, but in theory, someone could code tasks to live in locations other than the places MarcEdit manages.  This would go away.  All tasks would need to be managed within the defined network path, or the local tasks folder.
  • Using GUIDs for filenames, it shouldn’t happen – but there is always the theoretical chance of duplicate files being created.  I’ll need to ensure I keep checks to make sure that doesn’t happen, which complicates management slightly.

Second – Tasks themselves….

Presently, when a task references a task list, it uses the full path.  Again, the path will be removed to just the filename.  The same pros and cons apply here as above.

For users using tasks, you honestly shouldn’t see a difference.  The first time you’d run the program, MarcEdit would likely modify the tasks that you have (network or otherwise) and things would just go as is.  If you are creating a new networked folder, you’d see a new option in the locations preference that would allow you to automatically copy your current tasks to the network when setting that up.  And, if you are offline and a networked user, you’ll find that you now have access to your networked tasks.  Though, this part represents one more behavior change – presently, if you have networked tasks, you can take yourself offline and create and run tasks local to you.  In the above format, that goes away since MarcEdit will be caching your network tasks locally in a .network location.  If the current functionality is still desired (i.e., folks have workflows where they have are connected to the network for shared tasks, but disconnect to run local only to them tasks), I may be able to setup something so that the task runner checks both the .network and local task directories.  My preference would be to not to, but I understand that the current behavior has been around for years now, and I really would like to minimize the impact of making these changes.


David Rosenthal: RU18?

planet code4lib - Tue, 2016-10-04 20:00
LOCKSS is eighteen years old! I told the story of its birth three years ago.

There's a list of the publications in that time, and talks in the last decade, on the LOCKSS web site.

Thanks again to the NSF, Sun Microsystems, and the Andrew W. Mellon Foundation for the funding that allowed us to develop the system, and to the steadfast support of the libraries of the LOCKSS Alliance, and the libraries and publishers of the CLOCKSS Archive that has sustained it in production.

Brown University Library Digital Technologies Projects: Researchers@Brown Ranks #1 in SEO Analysis

planet code4lib - Tue, 2016-10-04 17:27

At the 2016 VIVO national conference in August 2016, Anirvan Chatterjee from the University of California, San Francisco gave a presentation on Search Engine Optimization (SEO) — strategies for increasing a site’s ranking in search results.  He analyzed 90 Research Networking Systems (RNS) to determine the proportion of faculty profile pages appearing among the top 3 search results on Google.  His analysis ranked Researchers@Brown (  #1 out of the 90 sites tested.

Chatterjee’s talk was entitled “The SEO State of the Union 2016: 5 Data-Driven Steps to Make your Research Networking Sites Discoverable by Real Users, Based on Real Google Results”

The report of the research,  “RNS SEO 2016: How 90 research networking sites perform on Google” is available here:

David Rosenthal: Panel on Software Preservation at iPRES

planet code4lib - Tue, 2016-10-04 14:10
I was one of five members of a panel on Software Preservation at iPRES 2016, moderated by Maureen Pennock. We each had three minutes to answer the question "what have you contributed towards software preservation in the past year?" Follow me below the fold for my answer.

So, what have I contributed towards software preservation in the past year? In my case, the answer won't take much of my three minutes. I published a 37-page report, funded by the Andrew W. Mellon Foundation, entitled Emulation and Virtualization as Preservation Strategies, wrote 15 blog posts on emulation, and gave 4 talks on the topic.

Obviously, you need to read each and every word of them, but for now I will condense this mass of words into six key points for you:
  1. The barriers to delivering emulations of preserved software are no longer technical. Rhizome's Theresa Duncan CD-ROMs, the Internet Archive's Software Library, and Ilya Kreymer's are examples of emulations transparently embedded in the Web. Fabrice Bellard's v86 Javascript emulator allows you not merely to run Linux or OpenBSD in your browser, but even to boot your own floppy, CD-ROM or disk image in it.
  2. The cost of preparing such emulations is still too high. Some progress has been made towards tools for automatically extracting the necessary technical metadata from CD-ROMs, but overall the state of ingest tooling is inadequate.
  3. Most ways in which emulations are embedded in Web pages are not themselves preservable. The Web page embeds not merely the software and hardware environment to be emulated, which should remain the same, but also the particular technology to be used to emulate it, which will change. That's not the only problem. One-size-fits-all emulation delivers some users a miserable experience. The appropriate emulation technology to use depends on the user's device, browser, latency, bandwidth, etc. What's needed is a standard for representing preserved system images and metadata, an emulation mime-type, and an analog of pdf.js, code that is downloaded to work out an appropriate emulation at dissemination time.
  4. Emulation of software that connects to the Internet is a nightmare, for two reasons. First, it will almost certainly use some network services, which will likely not be there when needed. Second, as the important paper Familiarity Breeds Contempt shows, the software will contain numerous vulnerabilities. It will be compromised as soon as it connects.
  5. Except for open source environments, the legal framework in which software is preserved and run in emulation is highly problematic, being governed by both copyright, and the end user license agreement, of every component in the software stack from the BIOS through the operating system to the application. Because software is copyright, national libraries with copyright deposit should be systematically collecting it to enable future emulations. The only systematic collection I'm aware of is by NIST on behalf of the FBI for forensic purposes.
  6. Even if the cost of ingest could be greatly reduced, a sustainable business model for ingesting, preserving and disseminating software is nowhere to be seen.
As usual, the full text of these remarks with links to the sources will go up on my blog shortly after this session.

I was also asked to prepare a response to one of the questions to be used to spark debate:
Economic sustainability: What evidence is required to determine commercial viability for software preservation services? Can cultural heritage institutions make a business case to rights holders that preserving software can co-exist with commercial viability?I decided to re-formulate the two questions:
  • Can closed-source software vendors be persuaded to allow their old software to be run in emulation? The picture here is somewhat encouraging. Some major vendors are cooperating to some extent. For example, Microsoft's educational licenses allow their old software to be run in emulation. Microsoft has not objected to the Internet Archive's Windows 3.x Showcase. I've had encouraging conversations with senior people at other major vendors. But this is all about use of old software without payment. It is not seen as depriving the vendor of income.
  • Is there a business model to support emulation services? This picture is very discouraging. Someone is going to have to pay the cost of emulation. As Rhizome found when the Theresa Duncan CD-ROMs went viral, if the end user doesn't pay there can be budget crises. If the end user does pay, its a significant barrier to use, and it starts looking like it is depriving the vendor of income. Some vendors might agree to cost-recovery charging. But typically there are multiple vendors involved. Consider emulating an old Autodesk on Windows environment. That is two vendors. Do they both agree to the principle of cost-recovery, and to the amount of cost-recovery?

    Update: On the fly, I pointed out the analogy between software preservation and e-journal preservation, the other area of preservation that handles expensive commercial content. Three approaches have emerged in e-journal preservation:
    • LOCKSS implements a model in which each purchaser preserves their own purchase. This does not look like it deprives the vendor of income.
    • CLOCKSS implements a model analogous to software escrow, in which software that is no longer available from the vendor is released from escrow. This does not look like it deprives the vendor of income.
    • Portico implements a subscription model, which could be seen as diverting income from the vendor to the archive. Critical to Portico's origin was agreement to this model by Elsevier, the dominant publisher. Other publishers then followed suit.
    This suggests that persuading dominant software publishers to accept a business model is critical.

Open Knowledge Foundation: Open Knowledge Finland Summer 2016 Update

planet code4lib - Tue, 2016-10-04 12:00

This blog post is part of our summer series featuring chapter updates from across the Open Knowledge Network and was written by the team of Open Knowledge Finland.

Summer is a great time in Finland. It’s so sunny that everyone seems to be on holiday! However, there was no time for extended holidays at Open Knowledge Finland – we had a very busy summer. Here is our update for the Network, with key news from the last few months.

Open Knowledge Finland has a new board!

One of the most exciting changes this year was our annual meeting. OKFFI held its annual meeting on Monday May 30 at the Helsinki office. Nearly 40 people (well over 10% of members) attended face-to-face or online – quite a good number, in fact!

Antti ‘Jogi’ Poikola was unanimously selected to continue as the chairman. The new board consists of 3 old members (Jogi, as well as Lilli Linkola and Mika Honkanen) and no less than 5 new members – Susanna Ånäs, Liisi Soroush, Raoul Plommer, Mikael Seppälä and Jessica Parland-von Essen. In its first meeting, each board member was assigned a primary and secondary role as follows:

Antti Poikola – chairman and  web communications

Mika Honkanen – vice chairman and  2nd treasurer

Lilli Linkola – secretary and working group contact

Mikael Seppälä – treasurer and working group contact

Raoul Plommer – web communications and tools and international relations

Susanna Ånäs – internal communications and international relations

Liisi Soroush – collaboration networks and member secretary

Jessica Parland-von Essen – external communications  and collaboration networks

With the new board, it is nice to see the gender split is at 50-50. It is also a great sign that there are a lot of people who want to apply for the board (13 candidates) and that we have great new people aboard to help steer the community. Congratulations and good luck to the board!

Open Knowledge Finland is growing!

Currently, 8 people are employed by Open Knowledge Finland. However, this number will soon decrease slightly as projects are ending. For this year, we have had a number of new people joining us – Emilia Hjelm, John Sperryn, Konsta Happonen. Previously active members like Heidi Laine, Mika Honkanen have received part-time contracts. On average, we have about 4-5 FTE in staff.

In terms of finances, we have managed to grow at a good pace – from just under 200k eur in 2014, to about 300k eur in 2015 – and still on the rise, a total of nearly 500 000 eur in total turnover expected in 2016. The funding is short-term, fragmented and diverse – which is both exciting as well as a cause of concern (for longer term planning).

Open Knowledge Finland currently has over 350 members – and hosts an Open Knowledge Network of nearly 4000 people in Finland.

MyData 2016 gathered about 700 international people to Helsinki – and accelerated the movement for human-centric personal data

2016 is the year of MyData. Open Knowledge Finland is all about the free flow of information. Open data, open knowledge, open collaboration – and, we believe this also includes free (user-controlled) flow of personal information. The MyData movement encompasses concepts and tools not only to build more transparent societies – but also to develop effective services and create new business in the digital domain.

Actions around the MyData conceptual framework represents the BIGGEST concentration of effort for us this year. In particular, Open Knowledge Finland’s key actions for the fall of 2016 were geared towards the MyData 2016 conference (31 Aug – 2 Sep) and the Ultrahack MyData hackathon running in parallel with the conference.

We had some 700 visitors in total – over 500 conference visitors, over 100 hackers or hack visitors, over 30 partner organisations involved. Amazingly, we had 140+ speakers, in 40+ sessions. Visitors came from about 30 countries. The feedback has been excellent – a great results for a first-timer conference!

Check out the event images on the Flickr pool: Conference video archive is available at . Please stay tuned to and @mydata2016 on Twitter. More wrap-ups and posts to follow. And yes, MyData 2017 is on the drawing board! Follow @MyData2017 to keep up on the plans for next year!

That’s not all, folks!

In addition to MyData, many of our 9 working groups have interesting ongoing projects, ranging in size, duration and scope. In a nutshell, here are a few of the active ones:

The 3 year EU project “D-CENT” (Democracy WG) is wrapping up soon. D-CENT is a Europe-wide project creating privacy-aware tools and applications for direct democracy and economic empowerment. Together with citizens and developers, we are creating a decentralised social networking platform for large-scale collaboration and decision-making. Contact :

Yhtäköyttä (Democracy WG), “Common knowledge practices in research and decision-making”,  is our first project for he Finnish Government’s analysis and assessment of research activities (VN TEAS) coordinated by the Prime Minister’s Office (VNK). The aim of the project is to find out what kind of tools and methods could be used in government in order to utilize knowledge management and research data better and to improve evidence-based decision making. This project will involve theoretical study, 30+ interviews and 4 experiments in new tools and methods such as data visualization, open impact assessment, real-time document editing, real-time fact-checking. Contact:

Cost-effective utilization of open data and basic registers: The research project’s goal is to better understand and measure the impacts of open data and the use of the basic public registers. As an outcome, we expect policy recommendations and suggestions for new methods, processes or technical changes to help improve cost-efficient publishing of open data and increase the impact of the basic registers. Contact ;

Open Citizen Science:  Citizen science has most notably been used as a method for creating observational data for life science research. Against the backdrop of current technological advancement, we need to create Citizen Science v 2.0 – open, diverse, responsible, dialogic and academically excellent. In terms of outcomes, we envision a set of concrete recommendations for national stakeholders; we willcreate understanding, awareness and discussion about citizen science as a scientific method and a community; and we will catalyze a research agenda for a new kind of open citizen science. Contact:

Creative Commons Licensing Support: As Creative Commons licenses are the official recommended license for open data in the Finnish governmental sector, awareness and instructions for using them in practice are needed across many sectors of society, including for public open bids, content creation subcontracting, and data purchasing. Contact:

Other projects…to be updated in the next blog! See also summary of OK Finland projects in a few slides.

Get in touch!

During the autumn, we will also be having an extra general meeting and plan to change our rules to better accommodate for scaling. Stay tuned – more to follow!

Want to get in touch? Contact executive director Teemu Ropponen, or international collaboration team, board members Raoul Plommer & Susanna Ånä.

LITA: Using Google Statistics for your Repository

planet code4lib - Mon, 2016-10-03 19:06

Don’t miss this new LITA webinar!

Beyond Usage Statistics: How to use Google Analytics to Improve your Repository

Presenter: Hui Zhang
Tuesday, October 11, 2016
11:00 am – 12:30 pm Central Time

Register Online, page arranged by session date (login required)

Librarians and repository managers are increasingly asked to take a data-centric approach for content management and impact measurement. Usage statistics, such as page views and downloads, have been widely used for demonstrating repository impacts. However, usage statistics restrict your capacity of identifying user trends and patterns such as how many visits are contributed by crawlers, originated from a mobile device, or redirected by a search engine. Knowing these figures will help librarians to optimize the digital contents for better usability and discoverability. This 90 minute webinar will teach you the concepts of metrics and dimensions along with hands-on activities of how to use Google Analytics (GA) on library data from an institutional repository. Be sure to check the details page for takeaways and prerequisites.

Details here and Registration here

Hui Zhang is the Digital Application Librarian at Oregon State University Libraries and Press. He has years of experience in generating impact reports with major platforms such as DSpace and Hydra Sufia using Google Analytics or local statistics index. Other than repository development, his interests include altmetrics, data visualization, and linked data

And don’t miss other upcoming LITA fall continuing education offerings:

Social Media For My Institution; from “mine” to “ours”
Instructor: Plamen Miltenoff
Starting Wednesday October 19, 2016, running for 4 weeks
Register Online, page arranged by session date (login required)

Online Productivity Tools: Smart Shortcuts and Clever Tricks
Presenter: Jaclyn McKewan
Tuesday November 8, 2016
11:00 am – 12:30 pm Central Time
Register Online, page arranged by session date (login required)

Questions or Comments?

For questions or comments, contact LITA at (312) 280-4268 or Mark Beatty,

John Mark Ockerbloom: Forward to Libraries update, and some thoughts on sustainability and scale

planet code4lib - Mon, 2016-10-03 18:35

It’s been a while since I posted about Forward to Libraries, but if you’ve been following my Github repo, you may have noticed that it’s had a steady stream of updates and growth.  If making connections across library collections or between libraries and Wikipedia interests you, the service is more comprehensive and wide-ranging than ever.  Here’s an update:

Number of libraries

We now support forwarding to over 1,000 library systems worldwide, running dozens of off-the-shelf and custom-developed catalog and discovery systems.  I’ve expanded coverage in all 50 states, Canada, the UK, and Australia, and have also added links to more countries in all inhabited continents.  (Because the system currently focuses on Library of Congress headings, it works best with Anglo-American catalogs, but it yields acceptable results in some other catalogs as well, like the big Norway research library union catalog I just added today.)  While major research libraries and big-city public libraries are well-represented, I’ve also been trying to add HBCUs, community colleges, rural library networks, and other sometimes-overlooked communities.  And I respond quickly to requests from users to add their libraries.

Wikipedia coverage

Forward to Libraries can be used to make links between library collections, and to and from my millions-strong Online Books Page listings.  But it’s also known for its interoperation with Wikipedia, where it knows how to link between over half a million distinct Library of Congress name and subject headings and their corresponding English language Wikipedia articles.  The majority of these mappings are name mappings provided by OCLC’s VIAF, which in its most recent data dump includes over 485,000 VIAF identifiers that include both a Library of Congress Name Authorities identifier and an English Wikipedia article link.  An additional 50,000 or so topical, demographic, and geographical LC subject headings are also mapped.  These mappings derive from exact matches, algorithmic matching for certain kinds of heading similarities, and a manually curated mappings file that has grown over time to include more than 22,000 correspondences.

What you can do with this data

If you’re browsing a topic or an author’s works on the Online Books Page, you can follow links to searches for the same topic or author in any of the 1000+ libraries currently in my dataset.  (If your library’s not one of them, just ask me to add it.)  If the topic or the author is covered in one of the 500,000 corresponding Wikipedia articles the system knows of, you’ll also be offered a link to the relevant article.

If you’re browsing certain Wikipedia articles, you’ll also find links from them back to library searches in your favorite of those 1000+ library systems (or any other of those systems you wish to search).  Right now those links use templates that must be manually placed, so there are only about 2500 Wikipedia articles with those links, but any Wikipedia editor can add the templates to additional articles.  (A bot could potentially add more quickly, but that would require some negotiation with the Wikipedia community that I haven’t undertaken to date.)  If you’re involved in a Wikipedia-library collaboration project (like this one), you may want to add one of these templates when editing articles on topics that are likely to have relevant source materials in multiple libraries.  (The most common template used is the Library Resources Box, generally added to the External Links or Further Reading section of an article.)

If you’re interested in offering a similar library or Wikipedia linking service from your own catalog or discovery system, I’d be interested in hearing from you.  You can either point to my forwarding service (using a standard linking syntax) or implement your own forwarder based on my code and data on Github.  Right now it requires some effort and expertise to implement either method, but I’m happy to work with interested libraries or developers to make forwarding easier to implement.

Scaling and sustaining the service

The Forward to Libraries service still runs largely as a 1-person part-time project.  (The Wikipedia templates are largely placed by others, and the service fundamentally depends on regularly updated data sets from OCLC, the Library of Congress, and Wikipedia, but I maintain the code and coordinate the central referral database myself.)

Part-time, 1-person projects raise some common questions:  “Will they be sustained?”  and “Can they scale?”  I wasn’t sure myself of the answers to those questions when I started developing this service.  Fortunately, I went ahead and introduced it anyway, and three-going-on-four years later, I’m happy to say that the basic service *is* in fact more sustainable and scalable than I’d thought it might be.   The code is fairly simple, and doesn’t require a lot of updating to keep running.  The main scale issues for the basic service have to do with the number of library systems and the number of topic mappings in the system, and those are both manageable.

The number of libraries turns out to be the more challenging factor to manage.  Libraries change their catalogs and discovery systems on a regular basis, and when they do, search links that worked for their old catalogs often don’t work for the new ones.  I have an automated tool that I run occasionally to flag libraries that return an error code to my search links; it’s not very sophisticated, but it does alert me to many libraries whose profiles I need to update, without my having to check all of them manually.  (If you find any libraries where forwarding is no longer working, you can also use the suggestion form to alert me to the problem.)  The more pressing scaling problem at the moment is the user experience: right now, when you’re asked to choose a library, the program shows a list of links to all 1000+ libraries currently in the system.  That can be a bit much to handle, especially for users whose data is metered.  Updating the library choice form to only show local libraries after the user selects the state, province or country they’re in will cut down on the data sent to the user; that may cost the user an extra click, but the tradeoff seems worth it at this point.

The number of topic mappings, on the other hand, has been easier to manage than I’d thought.  VIAF publishes updated data files for names about once a month, and I can run a script over it to automatically update my service’s name heading mappings when a new file comes out. Likewise, Wikipedia is now providing twice-a-month dump files of its English language encyclopedia.  I can download one of Wikipedia’s files in a couple of hours, and then run a script that flags any topical articles I map to that have gone away, changed their title, or redirected to other articles.  I can then change my mappings file accordingly in under an hour. Library of Congress subject headings change as well, but they don’t change very fast.  New and changed topical headings are published online about once a month, and one can generally add or change mappings from one of their updates within a few hours.  I spend a few spare-time hours each month adding mappings for older subject headings, so if the current rate of LCSH growth holds, in theory I could cover *all* LCSH topical headings in my manual mappings file after some number of years.  In practice, I don’t have to do this, especially as topical mappings also start to get collaboratively managed in places like Wikidata.  (I’m not currently working with their data set, but hope to do so in the future.)

Broadening the service

While I can maintain and gradually grow the basic service with my current resources, broadening it would require more.  Not every library, particularly outside the US, uses Library of Congress Subject Headings, so it would be nice to offer mappings to more subject ontologies and languages.  Similarly, not everyone likes to work with Wikipedia (often with good reason), and it’d be nice to support links to and from alternative knowledge bases as well.  The basic architecture of Forward to Libraries is capable of handling multiple library and web-based ontologies, but additional coding and data maintenance would be required.  There are also various ways to build on the service to engage more deeply with libraries and current topics of interest; I’ve explored some ideas along these lines, but haven’t had the time to implement them.

Things continuing as they are, though, I should be able to maintain and grow the basic Forward to Libraries service for quite some time to come.  I’m thankful to the people and the data providers that have made this service possible. And if you’re interested in doing more with it, or helping develop it in new directions, I’d be very glad to talk with you.

LibUX: User Experience Debt

planet code4lib - Mon, 2016-10-03 14:59

An IA Summit presentation by Andrew Wright that demonstrates a way to think about the gap between the current user experience of a site and its potential.

User Experience Debt: Creating Awareness and Acting on Missed Opportunities

FOSS4Lib Recent Releases: Sufia - 7.2.0

planet code4lib - Mon, 2016-10-03 13:27

Last updated October 3, 2016. Created by Peter Murray on October 3, 2016.
Log in to edit this page.

Package: SufiaRelease Date: Sunday, October 2, 2016

Terry Reese: MarcEdit Updates

planet code4lib - Mon, 2016-10-03 05:00

The following updates have been made to all versions of MarcEdit:

MarcEdit MacOS:

** 1.9.20
* Update: Linked Data Rules File: Rules file updated to add databases for the Japanese Diet library, 880 field processing, and the German National Library.
* Enhancement: Task Manager: Added a new macro/delimiter.  {current_file} will print the current filename if set.
* Enhancement: Task Manager: Added a new macro/delimiter.  {current_filename} will print the current filename if set.
* Bug Fix: RDA Helper: Abbreviation expansion is failing to process specific fields when config file is changed.
* Bug Fix: MSXML Engine: In an effort to allow the xsl:strip-whitespace element, I broke this process.   The work around has been to use the engine.  However, I’ll correct this.  Information on how you emulate the xsl:strip-whitespace element will be here:
* Bug Fix/Enhancement: Open Refine Import: OpenRefine’s release candidate changes the tab delimited output slightly.  I’ve added some code to accommodate the changes.
* Enhancement: MarcEdit Linked Data Platform: adding enhancements to make it easier to add collections and update the rules file
* Enhancement: MarcEdit Linked Data Platform: updating the rules file to include a number of new endpoints
* Enhancement: MarcEdit Linked Data Platform: adding new functionality to the rules file to support the recoding of the rules file for UNIMARC.
* Enhancement: Edit Shortcut: Adding a new edit short cut to find fields missing words
* Enhancement: OCLC API Integration: added code to integrate with the validation.  Not sure this makes its way into the interface yet, but code will be there.
* Enhancement: Saxon.NET version bump
* Enhancement: Autosave option when working in the MarcEditor.  Saves every 5 minutes.  Will protect against crashes.
* Enhancement: MarcEditor: Remove Blank Lines function

MarcEdit Windows/Linux:

* Enhancement: MarcEditor: Remove Blank Lines
* Bug Fix: COM Object — updating array signature
* Update: Automatic Updater: Due to a change in my webhost, automatic updating stopped working.  That should be corrected after this update is applied.

The Windows update includes fixes to the Automated Update mechanism caused due to my webhost making some unannounced changes.  User will likely need to manually download the update from:  Following this update, this shouldn’t be necessary any longer.


Ed Summers: Nicolini (8)

planet code4lib - Sun, 2016-10-02 04:00

In chapter 8 Nicolini brings attention to the study of discourse in practice theory. Here’s how he situates this movement to looking at discourse in relation to the other practice theories that have been introduced so far:

As we have seen in previous chapters, different theories of practice grant very different roles to language and discourse. For example,while Bourdieu almost ignores the role of linguistic matters (see my criticisms in Chapter 3), Schatzki describes practices as nexuses of‘sayings and doings’,warning that no priority should be granted to either of the terms (see Chapter 7). Thirdly, and most importantly, the study of discourse can offer some fundamental insights into the general understanding of practice. This is because, over the last five decades, a number of research programmes have developed the idea that discourse is,first and foremost, a form of action, a way of making things happen in the world, and not a mere way of representing it.From this perspective, language is seen as a discursive practice, a form of social and situated action. (p. 189)

He cites Gee (2011) to distinguish lowercase discourse from uppercase Discourse. discourse is taken as the mundane analysis of everyday language in order to accomplish some activity. Discourse on the other hand is concerned with the way linguistic and non-linguistic systems are integrated, and how this can make notions of identity, knowledge and power legible. Nicolini treats this more like a continuum with the local and the global at different poles. He takes a look at three different approaches to discourse analysis, which are relevant to practice theory.

Conversation Analysis (CA)

CA grew out of ethno-methodology, but has established itself as a separate study to some extent. It makes the following assumptions:

  • all interaction is structurally organized (patterns, repetitions)
  • sequencing of interactions is important (context, temporality)
  • examination of audio/video recordings is extremely important in order to study interactions as they happened and in detail

It takes a very disciplined view of discourse and attempts to keep second order theoretical considerations such as notions of power, politics, ideology at a distance. There are some techniques or strategies such as the turn-taking machine which provides a way of examining transcripts to look for patterns in speech exchange.

Critical Discourse Analysis (CDA)

CDA was largely established by Michel Foucault whose approach to discourse shifted from an archaeological approach to on focused on genealogy (Davidson, 1986). One feature of CDA is the idea of discursive formations or sets of rules that determine what can and cannot be said as well a, what is allowed and not allowed to be said. These formations are situated in relation to other formats, which take as a whole form an order of discourse. Foucault’s view of discourse can be summarized as:

  • discourse does reflect the state of affairs in the world as much as it does define and structure the actual state of affairs
  • CDA is not limited to studying the mundane use of language, but takes into consideration all sorts of relations between discursive and non-discursive elements.
  • discourse necessarily involves the deployment of power and systems of domination
  • the study of power and discourse must be studied from the bottom up, deriving the discursive formations from analysis at the local level, rather than using using notions of established power structures.

Rather than looking at repetition as with a turn-taking-machine Foucault is interested in looking at why and how practices establish and maintain regularities across time and space. Fairclough (2013) could provide a jumping off point to learn more about the application of CDA. Intertextuality or the ways in which texts are interconnected is an important concept in CDA. It focuses not only on content and structure, but on the material chains and stabilized networks along which texts move, or do not move along. He uses this diagram from Fairclough (1992), which might also be worth looking at:

This focus on the text or documents as they are situated in social practices seems like it could be a useful way of looking at the construction of web archives. I guess Caswell (2016) could be a useful touchstone here if I continue down this particular road. It also looks like Engeström, who I’ve encountered earlier in Chapter 5, had some criticisms of the local/global dichotomy that CDA sets up. I’ve been meaning to follow up on some of his work, so maybe his connection to this idea of intertextuality is another vote to head in that direction.

Mediated Discourse Analysis (MDA)

As he said previously, Mediated Discourse Analysis charts a middle path between the linguistic focus of CA and the politically minded CDA by focusing on practice and social action. Nicolini cites Scollon (2002) when outlining the 5 aspects to MDA, which draw heavily from practice theory:

  • mediated action: mundane or micro actions
  • mediational means: tools and other artifacts that amplify/constrain actions
  • practice: actions with a history
  • site of engagement: a place where the actions occur
  • nexus of practice: relationships between practices

MDA was pioneered by Ron Scollon near the end of his life and, according to Nicolini, much work remains in establishing it as a theory/model for actual work. In particular Nicolini thinks MDA could be brought into conversation with Actor Network Theory (Latour, 2005; Law, 2009). Scollon worked at Georgetown University, which makes me wonder if his papers are there. He was a linguist who worked closely with his wife Suzanne Wong Scollon on intercultural communication and discourse analysis. It’s interesting that he would end his career as a linguist shifting attention to actions and practices in this way.


Caswell, M. (2016). ’The archive’Is not an archives: On acknowledging the intellectual contributions of archival studies. Reconstruction. Retrieved from

Davidson, A. I. (1986). Foucault: A critical reader. In D. C. Hoy (Ed.),. Oxford: Wiley-Blackwell.

Fairclough, N. (1992). Discourse and social change. Oxford: Polity press.

Fairclough, N. (2013). Critical discourse analysis: The critical study of language. Routledge.

Gee, J. P. (2011). An introduction to discourse analysis: Theory and method. Routledge.

Latour, B. (2005). Reassembling the social: An introduction to actor-network-theory. Oxford University Press.

Law, J. (2009). The new blackwell companion to social theory. In B. S. Turner (Ed.), (pp. 141–158). Wiley-Blackwell.

Scollon, R. (2002). Mediated discourse: The nexus of practice. London: Routledge.

Jason Ronallo: IIIF Examples #1: Wellcome Library

planet code4lib - Sun, 2016-10-02 03:31

As I’m improving the implementation of IIIF on the NCSU Libraries Rare and Unique Digital Collections site, I’m always looking for examples from other implementations for how they’re implementing various features. This will mostly be around the Presentation and Content Search APIs where there could be some variability.

This is just a snapshot look at some features for one resource on the Wellcome Library site, the example may not be good or correct, and could be changed by the time that you read this. I’m also thinking out loud here and asking lots of questions about my own gaps in knowledge and understanding. In any case I hope this might be helpful to others.


The example I’m looking at is the “[Report of the Medical Officer of Health for Wandsworth District, The Board of Works (Clapham, Putney, Streatham, Tooting & Wandsworth)]”. You ought to be able to see the resource an embedded view below with UniversalViewer:

If you visit the native page and scroll down it you’ll see some some other information. They’ve taken some care to expose tables within the text and show snippets of those tables. This is a really nice example of how older texts might be able to be used in new research. And even though Universal Viewer provides some download options they provide a separate download option for the tables. Clicking on one of the tables displays the full table. Are they using OCR to extract tables? How do they recognize tables and then ensure that the text is correct?

Otherwise the page includes the barest amount of metadata and the “more information” panel in UV does not provide much more.

This collection includes a good page with more information about these medical reports including a page about how to download and use the report data.

Presentation API

I’m now going to go down and read through sections of the presentation manifest for this resource and note what I find interesting.


The manifest links back to the HTML page with a related property at the top level like this:

"related": { "@id": "", "format": "text/html" }

There has been some discussion where the right place is to put a link back from the manifest to an HTML page for humans to see the resource. The related property may not be the right one and that YKK should be used instead.


The manifest has a seeAlso for an RDF (turtle) representation of the resource:

"seeAlso": { "@id": "", "format": "text/turtle" }

What other types of seeAlso links are different manifests providing?


Several different services are given. Included are standard services for Content Search and autocomplete. (I’ll come back to to those in a bit.) There are also a couple services outside of the context.

The first is an extension around access control. Looking at the context you can see different levels of access shown–open, clickthrough, credentials. Not having looked closely at the Authentication specification, I don’t know yet whether these are aligned with that or not. The other URLs here don’t resolve to anything so I’m not certain what their purpose is.

{ "@context": "", "@id": "", "profile": "", "accessHint": "open" }

There’s also a service that appears to be around analytics tracking. From the context document it appears that there are other directives that can be given to UV to turn off/on different features. I don’t remember seeing anything in the UV documentation on the purpose and use of these though.


One thing I’m interested in is how organizations name and give ids (usually HTTP URLs) for resources that require them. In this case the id of the only sequence is a URL that ends in “/s0” and resolves to an error page. The label for the sequence is “Sequence s0” which could have been automatically generated when the manifest was created. This lack of an id and label value for a sequence is understandable since these types of things wouldn’t regularly get named in a metata workflow or

This leaves me with the question of which ids ought to have something at the other end of the URI? Should every URI give you something useful back? And is there the assumption that HTTP URIs ought to be used rather than other types of identifiers–ones that might not have the same assumptions about following them to something useful? Or are these URIs just placeholders for good intentions of making something available there later on?


Both PDF and raw text renderings are made available. It was examples like this one that helped me to see where to place renderings so that UniversalViewer would display them in the download dialog. The PDF is oddly extensionless, but it does work and it has a nice cover page that includes a persistent URL and a statement on conditions of use. The raw text would be suitable for indexing the resource, but is one long string of text without break so not really for reading.


The viewingHint given here is “paged.” I’ll admit that one of the more puzzling things to me about the specification is exactly what viewing experience is being hinted at with the different viewing hints. How do each of these effect, or not, different viewers? Are there examples of what folks expect to see with each of the different viewing hints?


The canvases don’t dereference and seem to follow the same sort of pattern as sequences by adding “/canvas/c0” to the end of an identifier. There’s a seeAlso that points to OCR as ALTO XML. Making this OCR data available with all of the details of bounding boxes of lines and words on the text is potentially valuable. The unfortunate piece that there’s no MIME type for ALTO XML so the format here is generic and does not unambiguously indicate that the type of file will contain ALTA OCR.

"seeAlso": { "@id": "", "format": "text/xml", "profile": "", "label": "METS-ALTO XML" }

Even more interesting for each canvas they deliver otherContent that includes an annotation list of the text of the page. Each line of the text is a separate annotation, and each annotation points to the relevant canvas including bounding boxes. Since the annotation list is on the canvas the on property for each annotation has the same URL except for a different fragment hash for the bounding box for the line. I wonder if there is code for extracting annotation lists like this from ALTO? Currently each annotation is not separately dereferenceable and uses a similar approach as seen with sequences and canvases of incrementing a number on the end of the URL to distinguish them.

In looking closer at the images on the canvas, I learned better about how to think about the width/height of the canvas as opposed to the width/height of the image resource and what the various ids within a canvas, image, and resource ought to mean. I’m getting two important bits wrong currently. The id for the image should be to the image as an annotation and not to the image API service. Similarly the id for the image resource ought to actually be an image (with format given) and again not the URL to the image API. For this reason the dimensions of the canvas can be different (and in many cases larger) than the dimensions of the single image resource given as it can be expensive to load very large images.

Content Search API

Both content search and autocomplete are provided for this resource. Both provide a simple URL structure where you can just add “?q=SOMETHING” to the end and get a useful response.

First thing I noticed about the content search response is that it is delivered as “text/plain” rather that JSON or JSON-LD, which is something I’ll let them know about. Otherwise this looks like a nice implementation. The hits include both the before and after text which could be useful for highlighting the text off of the canvas.

The annotations themselves use ids that include the the bounding box as a way to distinguish between them. Again they’re fine identifiers but do don’t have anything at the other end. So here’s an example annotation where “1697,403,230,19” is used for both the media fragment as well as the annotation URL:

{ "@id": ",403,230,19", "@type": "oa:Annotation", "motivation": "sc:painting", "resource": { "@type": "cnt:ContentAsText", "chars": "Medical" }, "on": ",403,230,19" }

The autocomplete service is simple enough:


I learned a bit about how ids are used in this implementation. I hope this helps give some pointers to where others can learn from the existing implementations in the wild.


Subscribe to code4lib aggregator