You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 1 day 11 hours ago

Jonathan Rochkind: Concurrency in Rails 5.0

Sun, 2017-01-01 17:10

My previous posts on concurrency in ActiveRecord have been some of the most popular on this blog (which I’d like to think means concurrency is getting more popular in Rails-land), so I’m going to share what I know about some new concurrency architecture in Rails5 — which is no longer limited to ActiveRecord.

(update: Hours before I started writing this unawares, matthewd submitted a rails PR for a Rails Guide, with some really good stuff; have only skimmed it now, but you might wanna go there either before, after, or in lieu of this).

I don’t fully understand the new stuff, but since it’s relatively undocumented at present, and has some definite gotchas, as well as definite potentially powerful improvements — sharing what I got seems helpful. This will be one of my usual “lots of words” posts, get ready!

The new architecture primarily involves ActiveSupport::Reloader (a global one of which is in Rails.application.reloader) and ActiveSupport::Executor (a global one of which is in Rails.application.executor). Also ActiveSupport::Dependencies::Interlock (a global one of which is at ActiveSupport::Dependencies.interlock.

Why you need to know this

If you create any threads in a Rails app yourself — beyond the per-request threads a multi-threaded app server like Puma will do for you. Rails takes care of multi-threaded request dispatch for you (with the right app server), but if you’re doing any kind of what I’ll call “manual concurrency” yourself —Thread.new, any invocations of anything in concurrent-ruby (recommended), or probably any celluloid (not sure), etc. — you got to pay attention to be using the new architecture to be doing what Rails wants — and to avoid deadlocks if dev-mode-style class-reloading is happening.

If you’re getting apparent deadlocks in a Rails5 app that does multi-threaded concurrency, it’s probably about this.

If you are willing to turn off dev-mode class-reloading and auto-loading altogether, you can probably ignore this.

What I mean by “dev-mode class-reloading”

Rails5 by default generates your environments/development.rb with  with config.cache_classes==false, config.eager_load==false. Classes are auto-loaded only on demand (eager_load == false), and are also sometimes unloaded to be reloaded on next access (cache_classes == false). (The details of when/how/which/if they are unloaded is outside the scope of this blog post, but has also changed in Rails5).

You can turn off all auto-loading with config.cache_classes==true, config.eager_load==true — the Rails5 default production.  All classes are loaded/require’d en masse on boot, and are never unloaded.  This is what I mean by ‘turn off dev-mode class-reloading and auto-loading altogether’.

The default Rails5 generated environments/test.rb has config.cache_classes==true, config.eager_load==false.  Only load classes on demand with auto-loading (eager_load == false), but never unload them.

I am not sure if there’s any rational purpose for having config.cache_classes = false, config.eager_load = true, probably not.

I think there was a poorly documented  config.autoload in previous Rails versions, with confusing interactions with the above two config; I don’t think it exists (or at least does anything) in Rails 5.

Good News

Previously to Rails 5, Rails dev-mode class-reloading and auto-loading were entirely un-thread-safe. If you were using any kind of manual concurrency, then you pretty much had to turn off dev-mode class-reloading and auto-loading. Which was too bad, cause they’re convenient and make dev more efficient. If you didn’t, it might sometimes work, but in development (or possibly test) you’d often see those pesky exceptions involving something like “constant is missing”, “class has been redefined”, or “is not missing constant” — I’m afraid I can’t find the exact errors, but perhaps some of these seem familiar.

Rails 5, for the first time, has an architecture which theoretically lets you do manual concurrency in the presence of class reloading/autoloading, thread-safely. Hooray! This is something I had previously thought was pretty much infeasible, but it’s been (theoretically) pulled off, hooray. This for instance theoretically makes it possible for Sidekiq to do dev-mode-style class-reloading — although I’m not sure if latest Sidekiq release actually still has this feature, or they had to back it out.

The architecture is based on some clever concurrency patterns, so it theoretically doesn’t impact performance or concurrency measuribly in production — or even, for the most part, significantly in development.

While the new architecture most immediately effects class-reloading, the new API is, for the most part, not written in terms of reloading, but is higher level API in terms of signaling things you are doing about concurrency: “I’m doing some concurrency here” in various ways.  This is great, and should be a good for future of Just Works concurrency in Rails in other ways than class reloading too.  If you are using the new architecture correctly, it theoretically makes ActiveRecord Just Work too, with less risk of leaked connections without having to pay lots of attention to it. Great!

I think matthewd is behind much of the new architecture, so thanks matthewd for trying to help move Rails toward a more concurrency-friendly future.

Less Good News

While the failure mode for concurrency used improperly with class-reloading in Rails 4 (which was pretty much any concurrency with class-reloading, in Rails 4) was occasional hard-to-reprodue mysterious exceptions — the failure mode for concurrency used improperly with class-reloading in Rails5 can be a reproduces-every-time deadlock. Where your app just hangs, and it’s pretty tricky to debug why, especially if you aren’t even considering “class-reloading and new Rails 5 concurrency architecture”, which, why would you?

And all the new stuff is, at this point, completely undocumented.  (update some docs in rails/rails #27494, hadn’t seen that before I wrote this).  So it’s hard to know how to use it right. (I would quite like to encourage an engineering culture where significant changes without docs is considered just as problematic to merge/release as significant changes without tests… but we’re not there yet). (The docs Autoloading and Reloading Constants Guide, to which this is very relevant, have not been updated for this ActiveSupport::Reloader stuff, and I think are probably no longer entirely accurate. That would be a good place for some overview docs…).

The new code is a bit tricky and abstract, a bit hard to follow. Some anonymous modules at some points made it hard for me to use my usual already grimace-inducing methods of code archeology reverse-engineering, where i normally count on inspecting class names of objects to figure out what they are and where they’re implemented.

The new architecture may still be buggy.  Which would not be surprising for the kind of code it is: pretty sophisticated, concurrency-related, every rails request will touch it somehow, trying to make auto-loading/class-reloading thread-safe when even ordinary ruby require is not (I think this is still true?).  See for instance all the mentions of the “Rails Reloader” in the Sidekiq changelog, going back and forth trying to make it work right — not sure if they ended up giving up for now.

The problem with maybe buggy combined with lack of any docs whatsoever — when you run into a problem, it’s very difficult to tell if it’s because of a bug in the Rails code, or because you are not using the new architecture the way it’s intended (a bug in your code). Because knowing the way it’s intended to work and be used is a bit of a guessing game, or code archeology project.

We really need docs explaining exactly what it’s meant to do how, on an overall architectural level and a method-by-method level. And I know matthewd knows docs are needed. But there are few people qualified to write those docs (maybe only matthewd), cause in order to write docs you’ve got to know the stuff that’s hard to figure out without any docs. And meanwhile, if you’re using Rails5 and concurrency, you’ve got to deal with this stuff now.

So: The New Architecture

I’m sorry this is so scattered and unconfident, I don’t entirely understand it, but sharing what I got to try to save you time getting to where I am, and help us all collaboratively build some understanding (and eventually docs?!) here. Beware, there may be mistakes.

The basic idea is that if you are running any code in a manually created thread, that might use Rails stuff (or do any autoloading of constants), you need to wrap your “unit of work” in either Rails.application.reloader.wrap { work } or Rails.application.executor.wrap { work }.  This signals “I am doing Rails-y things, including maybe auto-loading”, and lets the framework enforce thread-safety for those Rails-y things when you are manually creating some concurrency — mainly making auto-loading thread-safe again.

When do you pick reloader vs executor? Not entirely sure, but if you are completely outside the Rails request-response cycle (not in a Rails action method, but instead something like a background job), manually creating your own threaded concurrency, you should probably use Rails.application.reloader.  That will allow code in the block to properly pick up new source under dev-mode class-reloading. It’s what Sidekiq did to add proper dev-mode reloading for sidekiq (not sure what current master Sidekiq is doing, if anything).

On the other hand, if you are in a Rails action method (which is already probably wrapped in a Rails.application.reloader.wrap, I believe you can’t use a (now nested) Rails.application.reloader.wrap without deadlocking things up. So there you use Rails.application.executor.wrap.

What about in a rake task, or rails runner executed script?  Not sure. Rails.application.executor.wrap is probably the safer one — it just won’t get dev-mode class-reloading happening reliably within it (won’t necessarily immediately, or even ever, pick up changes), which is probably fine.

But to be clear, even if you don’t care about picking up dev-mode class-reloading immediately — unless you turn off dev-mode class-reloading and auto-loading for your entire app — you still need to wrap with a reloader/executor to avoid deadlock — if anything inside the block possibly might trigger an auto-load, and how could you be sure it won’t?

Let’s move to some example code, which demonstrates not just the executor.wrap, but some necessary use of ActiveSupport::Dependencies.interlock.permit_concurrent_loads too.

An actual use case I have — I have to make a handful of network requests in a Rails action method, I can’t really push it off to a bg job, or at any rate I need the results before I return a response. But since I’m making several of them, I really want to do them in parallel. Here’s how I might do it in Rails4:

In Rails4, that would work… mostly. With dev-mode class-reloading/autoloading on, you’d get occasional weird exceptions. Or of course you can turn dev-mode class-reloading off.

In Rails5, you can still turn dev-mode class-reloading/autoloading and it will still work. But if you have autoload/class-reload on, instead of an occasional weird exception, you’ll get a nearly(?) universal deadlock. Here’s what you need to do instead:

And it should actually work reliably, without intermittent mysterious “class unloaded” type errors like in Rails4.

ActiveRecord?

Previously, one big challenge with using ActiveRecord under concurrency was avoiding leaked connections.

think that if your concurrent work is wrapped in Rails.application.reloader.wrap do or Rails.application.executor.wrap do, this is no longer a problem — they’ll take care of returning any pending checked out AR db connections to the pool at end of block.

So you theoretically don’t need to be so careful about wrapping every single concurrent use of AR in a ActiveRecord::Base.connection_pool.with_connection  to avoid leaked connections.

But I think you still can, and it won’t hurt — and it should sometimes lead to shorter finer grained checkouts of db connections from the pool, which matters if you potentially have more threads than you have pool size in your AR connection. I am still wrapping in ActiveRecord::Base.connection_pool.with_connection , out of superstition if nothing else.

Under Test with Capybara?

One of the things that makes Capybara feature tests so challenging is that they inherently involve concurrency — there’s a Rails app running in a different thread than your tests themselves.

I think this new architecture could theoretically pave the way to making this all a lot more intentional and reliable, but I’m not entirely sure, not sure if it helps at all already just by existing, or would instead require Capybara to make use of the relevant API hooks (which nobody’s prob gonna write until there are more people who understand what’s going on).

Note though that Rails 4 generated a comment in config/environments/test.rb that says “If you are using a tool that preloads Rails for running tests [which I think means Capybara feature testing], you may have to set [config.eager_load] to true.”  I’m not really sure how true this was in even past versions Rails (whether it was neccessary or sufficient). This comment is no longer generated in Rails 5, and eager_load is still generated to be true … so maybe something improved?

Frankly, that’s a lot of inferences, and I have been still leaving eager_load = true under test in my Capybara-feature-test-using apps, because the last thing I need is more fighting with a Capybara suite that is the closest to reliable I’ve gotten it.

Debugging?

The biggest headache is that a bug in the use of the reloader/executor architecture manifests as a deadlock — and I’m not talking the kind that gives you a ruby ‘deadlock’ exception, but the kind where your app just hangs forever doing nothing. This is painful to debug.

These deadlocks in my experience are sometimes not entirely reproducible, you might get one in one run and not another, but they tend to manifest fairly frequently when a problem exists, and are sometimes entirely reproducible.

First step is experimentally turning off dev-mode class-reloading and auto-loading altogether  (config.eager_load = true,config.cache_classes = true), and see if your deadlock goes away. If it does, it’s probably got something to do with not properly using the new Reloader architecture. In desperation, you could just give up on dev-mode class-reloading, but that’d be sad.

Rails 5.0.1 introduces a DebugLocks feature intended to help you debug these deadlocks:

Added new ActionDispatch::DebugLocks middleware that can be used to diagnose deadlocks in the autoload interlock. To use it, insert it near the top of the middleware stack, using config/application.rb:

config.middleware.insert_before Rack::Sendfile, ActionDispatch::DebugLocks

After adding, visiting /rails/locks will show a summary of all threads currently known to the interlock.

PR, or at least initial PR, at rails/rails #25344.

I haven’t tried this yet, I’m not sure how useful it will be, I’m frankly not too enthused by this as an approach.

References
  • Rails.application.executor and Rails.application.reloader are initialized here, I think.
  • Not sure the design intent of: Executor being an empty subclass of ExecutionWrapper; Rails.application.executor being an anonymous sub-class of Exeuctor (which doesn’t seem to add any behavior either? Rails.application.reloader does the same thing fwiw); or if further configuration of the Executor is done in other parts of the code.
  • Sidekiq PR #2457 Enable code reloading in development mode with Rails 5 using the Rails.application.reloader, I believe code may have been written by matthewd. This is aood intro example of a model of using the architecture as intended (since matthewd wrote/signed off on it), but beware churn in Sidekiq code around this stuff dealing with issues and problems after this commit as well — not sure if Sidekiq later backed out of this whole feature?  But the Sidekiq source is probably a good one to track.
  • A dialog in Rails Github Issue #686 between me and matthewd, where he kindly leads me through some of the figuring out how to do things right with the new arch. See also several other issues linked from there, and links into Rails source code from matthewd.
Conclusion

If I got anything wrong, or you have any more information you think useful, please feel free to comment here — and/or write a blog post of your own. Collaboratively, maybe we can identify if not fix any outstanding bugs, write docs, maybe even improve the API a bit.

While the new architecture holds the promise to make concurrent programming in Rails a lot more reliable — making dev-mode class-reloading at least theoretically possible to do thread-safely, when it wasn’t at all possible before — in the short term, I’m afraid it’s making concurrent programming in Rails a bit harder for me.  But I bet docs will go a long way there.


Filed under: General

Meredith Farkas: 2016 wasn’t all bad

Sat, 2016-12-31 17:45

As I alluded to in my last post, this year was a difficult one for me personally that ended up turning out for the better. I know that many of us have felt dispirited and beaten down since the election and feel like 2016 was a flaming dumpster fire of a year, so I’ve decided to look back at the things I’m grateful for in 2016.

Here’s my list, in no particular order:

  • Starting my third year as a librarian at Portland Community College. I have never felt more at home in a job and more challenged in fun and exciting ways. I love the work I get to do, my colleagues in the library, and the disciplinary faculty who are so devoted to their students. I’m right where I want to be.
  • Taking up archery. I’m not a very sporty person, so I was surprised by how much I enjoy shooting arrows at a target. It’s an elegant and challenging sport that I have fallen in love with.
  • Meeting Bruce Springsteen! Twice!!
  • Living in Oregon. Portland is such an amazing city with great food and culture, and Oregon is so full of natural wonders that there is always a new hiking trail, mountain, meadow, or waterfall to discover. It’s not perfect (what place is?), but I feel incredibly privileged to live here.
  • Attending the ACRL Oregon/Washington joint conference, also known as librarian camp. I always come away from that conference feeling so much love for our profession and so much hope in our future. The Pacific NW is full of amazing and collaborative librarians and I feel proud to be a part of this community.
  • Attending my dear high school friend Melodie’s wedding in Florida. There are few people in this world I will ever love the way I do Melodie and I was brought to tears being a part of this happy occasion in her life.
  • Finally finding an exercise routine I could stick with. I’ve been doing it faithfully for a year now and I’ve lost weight, gained muscle, and feel terrific!
  • The fact that my son just keeps getting more and more awesome. I don’t know how I deserved such a brilliant, kind, curious, compassionate, and cuddly kid. I feel especially grateful that I get to spend the entire summer break with him going on adventures.
  • Teaching LIS courses for San Jose State is a lot of work, but is always rewarding. I love teaching and I am constantly amazed by the caliber of students coming out of library school (or at least San Jose State).
  • Splurging for MLB at Bat so we could watch NY Mets games anytime and anywhere we wanted, even on my phone while stuck in traffic on the way home from Seattle.
  • Attending the Nutcracker with my son. We’ve done the abbreviated “Nutcracker Tea” for the past three years, but this December, we got front row center seats to the Oregon Ballet Theater’s stunning rendition of George Balanchine’s The Nutcracker with a live orchestra. It’s a wonderful tradition to have with my buddy.
  • Visiting Sunriver and Bend (in Central Oregon) in August. I love bike riding and I spend as much time as I can in Sunriver riding my bike through forests and along the Deschutes River. We rented a house and did a lot of good eating, hiking, swimming, and miniature golfing on this trip too.
  • My husband’s unconditional love. I don’t deserve everything he’s done for me and the faith he has had in me, but I am so grateful for it. He’s the best person I know.
  • Attending more Hillsboro Hops games this past summer. Our local single A minor league team’s games are SO fun and cheap! I just bought a 4-game season ticket package for the three of us in 2017 for less than a single major league game ticket would cost.
  • Going to Universal Studios Hollywood with my husband and son. What a fantastic day that was!!
  • I am terrible at keeping happy secrets, so I was incredibly proud of myself for actually surprising my husband with a romantic anniversary trip to the wine country in Walla Walla, Washington. He didn’t even know where we were going as we drove there!!! It was a perfect mini-vacation with bike riding, wine drinking, and hot tubbing.
  • All of the concerts I attended! This year I tried to get out to more concerts — both rock/pop, jazz, and classical. Highlights included Pink Martini, Bruce Springsteen, Belly, the Oregon Symphony’s program with Holst’s The Planets and Natasha Peremsky, the Danish String Quartet, Chico Freeman, and Tony Starlight vs. Jazz.
  • Seeing so many people mobilizing to do good after Trump was elected. It feels like we are going into a very dark time now, but I am heartened by seeing so many people coming together to form a resistance and help people who are or might be negatively impacted by Trump. Most people of privilege have been complacent for a long time, and I think the election shook many of us out of our comfort zones. If Trump’s election mobilizes people in the long-term to fight for things like racial and gender equality, transgender rights, and the environment, then that’s a good thing. I have to remain hopeful; I have a kid I love very much who is going to inherit this world from us and I want to help make it better (or at least not worse).
  • I feel so fortunate to be a part of our amazing profession. When I see librarians passionately standing up for the values of our profession — especially those around access and privacy — I feel very proud to be a librarian. I remain optimistic that we will remain faithful to these values during what I’m sure will be a challenging next four years.

And I’m grateful to you, my friends who read this blog, and especially those of you who have engaged with me here or via social media. There have been moments in my life where I’ve felt very alone and isolated, and this blog has sometimes served as a lifeline for over the past 12 years. Thank you for reading it and for connecting with me. I wish you all good things in the coming year.

What was the best thing that happened to you in 2016? What are you most grateful for?

Conal Tuohy: A tool for Web API harvesting

Sat, 2016-12-31 05:31

A medieval man harvesting metadata from a medieval Web API

As 2016 stumbles to an end, I’ve put in a few days’ work on my new project Oceania, which is to be a Linked Data service for cultural heritage in this part of the world. Part of this project involves harvesting data from cultural institutions which make their collections available via so-called “Web APIs”. There are some very standard ways to publish data, such as OAI-PMH, OpenSearch, SRU, RSS, etc, but many cultural heritage institutions instead offer custom-built APIs that work in their own peculiar way, which means that you need to put in a certain amount of effort in learning each API and dealing with its specific requirements. So I’ve turned to the problem of how to deal with these APIs in the most generic way possible, and written a program that can handle a lot of what is common in most Web APIs, and can be easily configured to understand the specifics of particular APIs.

This program, which I’ve called API Harvester, can be configured by giving it a few simple instructions: where to download the data from, how to split up the downloaded data into individual records, where to save the record files, how to name those files, and where to get the next batch of data from (i.e. how to resume the harvest). The API Harvester does have one hard requirement: it is only able to harvest data in XML format, but most of the APIs I’ve seen offered by cultural heritage institutions do provide XML, so I think it’s not a big limitation.

The API Harvester software is open source, and free to use; I hope that other people find it useful, and I’m happy to accept feedback or improvements, or examples of how to use it with specific APIs. I’ve created a wiki page to record example commands for harvesting from a variety of APIs, including OAI-PMH, the Trove API, and an RSS feed from this blog. This wiki page is currently open for editing, so if you use the API Harvester, I encourage you to record the command you use, so other people can benefit from your work. If you have trouble with it, or need a hand, feel free to raise an issue on the GitHub repository, leave a comment here, or contact me on Twitter.

Finally, a brief word on how to use the software: to tell the harvester how to pull a response apart into individual records, and where to download the next page of records from (and the next, and the next…), you give it instructions in the form of “XPath expressions”. XPath is a micro-language for querying XML documents; it allows you to refer to elements and attributes and pieces of text within an XML document, to perform basic arithmetic and manipulate strings of text. XPath is simple yet enormously powerful; if you are planning on doing anything with XML it’s an essential thing to learn, even if only to a very basic level. I’m not going to give a tutorial on XPath here (there are plenty on the web), but I’ll give an example of querying the Trove API, and briefly explain the XPath expressions used in that examples:

Here’s the command I would use to harvest metadata about maps, relevant to the word “oceania”, from the Trove API, and save the results in a new folder called “oceania-maps” in my Downloads folder:

java -jar apiharvester.jar
directory="/home/ctuohy/Downloads/oceania-maps"
retries=5
url="http://api.trove.nla.gov.au/result?q=oceania&zone=map&reclevel=full"
url-suffix="&key=XXXXXXX"
records-xpath="/response/zone/records/*"
id-xpath="@url"
resumption-xpath="/response/zone/records/@next"

For legibility, I’ve split the command onto multiple lines, but this is a single command and should be entered on a single line.

Going through the parts of the command in order:

  • The command java launches a Java Virtual Machine to run the harvester application (which is written in the Java language).
  • The next item, -jar, tells Java to run a program that’s been packaged as a “Java Archive” (jar) file.
  • The next item, apiharvester.jar, is the harvester program itself, packaged as a jar file.

The remainder of the command consists of parameters that are passed to the API harvester and control its behaviour.

  • The first parameter, directory="/home/ctuohy/Downloads/oceania-maps", tells the harvester where to save the XML files; it will create this folder if it doesn’t already exist.
  • With the second parameter, retries=5, I’m telling the harvester to retry a download up to 5 times if it fails; Trove’s server can sometimes be a bit flaky at busy times; retrying a few times can save the day.
  • The third parameter, url="http://api.trove.nla.gov.au/result?q=oceania&zone=map&reclevel=full", tells the harvester where to download the first batch of data from. To generate a URL like this, I recommend using Tim Sherratt’s excellent online tool, the Trove API Console.
  • The next parameter url-suffix="&key=XXXXXXX" specifies a suffix that the harvester will append to the end of all the URLs which it requests. Here, I’ve used url-suffix to specify Trove’s “API Key”; a password which each registered Trove API user is given. To get one of these, see the Trove Help Centre. NB XXXXXXX is not my actual API Key.

The remaining parameters are all XPath expressions. To understand them, it will be helpful to look at the XML content which the Trove API returns in response to that query, and which these XPath expressions apply to.

  • The first XPath parameter, records-xpath="/response/zone/records/*", identifies the elements in the XML which constitute the individual records. The XPath /response/zone/records/* describes a path down the hierarchical structure of the XML: the initial / refers to the start of the document, the response refers to an element with that name at the “root” of the document, then /zone refers to any element called zone within that response element, then /records refers to any records within any of those response elements, and the final /* refers to any elements (with any name) within any of of those response elements. In practice, this XPath expression identifies all the work elements in the API’s response, and means that each of these work elements (and its contents) ends up saved in its own file.
  • The next parameter, id-xpath="@url" tells the harvester where to find a unique identifier for the record, to generate a unique file name. This XPath is evaluated relative to the elements identified by the records-xpath; i.e. it gets evaluated once for each record, starting from the record’s work element. The expression @url means “the value of the attribute named url”; the result is that the harvested records are saved in files whose names are derived from these URLs. If you look at the XML, you’ll see I could equally have used the expression @id instead of @url.
  • The final parameter, resumption-xpath="/response/zone/records/@next", tells the harvester where to find a URL (or URLs) from which it can resume the harvest, after saving the records from the first response. You’ll see in the Trove API response that the records element has an attribute called next which contains a URL for this purpose. When the harvester evaluates this XPath expression, it gathers up the next URLs and repeats the whole download process again for each one. Eventually, the API will respond with a records element which doesn’t have a next attribute (meaning that there are no more records). At that point, the XPath expression will evaluate to nothing, and the harvester will run out of URLs to harvest, and grind to a halt.

Happy New Year to all my readers! I hope this tool is of use to some of you, and I wish you a productive year of metadata harvesting in 2017!

Information Technology and Libraries: President's Message: Focus on Information Ethics

Fri, 2016-12-30 22:48
President's Message: Focus on Information Ethics

Information Technology and Libraries: Editorial Board Thoughts: Metadata Training in Canadian Library Technician Programs

Fri, 2016-12-30 22:48
Editorial Board Thoughts: Metadata Training in Canadian Library Technician Programs

Information Technology and Libraries: Technology Skills in the Workplace: Information Professionals’ Current Use and Future Aspirations

Fri, 2016-12-30 22:48
Information technology serves as an essential tool for today's information professional, with a need for ongoing research attention to assess the technological directions of the field over time. This paper presents the results of a survey of the technologies used by library and information science (LIS) practitioners, with attention to the combinations of technologies employed and the technology skills that practitioners wish to learn.  The most common technologies employed were: email, office productivity tools, web browsers, library catalog and database searching tools, and printers, with programming topping the list of most-desired technology skill to learn. Generally similar technology usage patterns were observed for early and later-career practitioners. Findings also suggested the relative rarity of emerging technologies, such as the makerspace, in current practice.

Information Technology and Libraries: Accessibility of Vendor-Created Database Tutorials for People with Disabilities

Fri, 2016-12-30 22:48
Many video, screencast, webinar, or interactive tutorials are created and provided by vendors for use by libraries to instruct users in database searching. This study investigates whether these vendor-created database tutorials are accessible for people with disabilities, to see whether librarians can use these tutorials instead of creating them in-house.  Findings on accessibility were mixed.  Positive accessibility features and common accessibility problems are described, with recommendations on how to maximize accessibility.

Information Technology and Libraries: Analyzing Digital Collections Entrances: What Gets Used and Why It Matters

Fri, 2016-12-30 22:48
This paper analyzes usage data from Hunter Library's digital collections using Google Analytics for a period of twenty-seven months from October 2013 through December 2015. The authors consider this data analysis to be important for identifying collections that receive the largest number of visits. We argue this data evaluation is important in terms of better informing decisions for building digital collections that will serve user needs. The authors also study the benefits of harvesting to sites such as the DPLA and consider this paper will contribute to the overall literature on Google Analytics and its use by libraries.

LITA: Jobs in Information Technology: December 28, 2016

Wed, 2016-12-28 22:24

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

University at Albany, State University of New York, Director of Technical Services and Library Systems, Albany, NY

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Meredith Farkas: My year in reading 2016

Wed, 2016-12-28 21:58

2016 has been one hell of a year. It started out for me with optimistic giddiness, then crashed into the land of extreme stress and fear and stayed there rather longer than I would have liked. But what I’d thought was the end of so many good things in my life actually marked the beginning of a fantastic new chapter. While I wish I could have skipped the painful lessons and jumped right to the end, I’m grateful for all I learned this year. I’m happier and healthier for having gone through it. That doesn’t mean I’m not freaked out as hell by the incoming presidential administration or really saddened by the deaths of so many actors, writers, and musicians I loved, but I also feel tremendously lucky for what I have in my life. So much love.

Compiling the list of books I read this year takes me back to some of the sadder times, because I can remember where I was while I was reading each one. Those books listed in bold were among my Top 10 for the year. Those with an asterisk are ones I either didn’t finish or didn’t really like. This list does not include the books I read to my son this year because it would be a VERY long list otherwise. Maggie Nelson’s visceral, honest, and poetic essays in Bluets and The Argonauts were, without question, two of the three best things I read this year. If you haven’t read Maggie Nelson, what are you waiting for?!?!?

Novels:

  • Rich and Pretty: A Novel by Rumaan Alam
  • Fifteen Dogs by Andre Alexis
  • Did You Ever Have a Family by Bill Clegg
  • The Sunlit Night by Rebecca Dinerstein
  • The Green Road by Anne Enright
  • Days of Awe: A novel by Lauren Fox
  • The Nightingale by Kristin Hannah
  • The First Bad Man by Miranda July (after slogging through the whole thing, I’m still not sure what to think of this book and whether or not I should give it an asterisk)
  • Modern Lovers by Emma Straub
  • Crossing to Safety by Wallace Stegner
  • The Engagements by J. Courtney Sullivan
  • Gold Fame Citrus: A Novel by Claire Vaye Watkins
  • A Little Life: A Novel by Hanya Yanagihara

Short-Story Collections:

  • A Manual for Cleaning Women by Lucia Berlin
  • What We Talk About When We Talk About Love by Raymond Carver (a re-read, many times over)
  • American Housewife: Stories* by Helen Ellis
  • You Should Pity Us Instead by Amy Gustine
  • Fortune Smiles: Stories by Adam Johnson
  • The Dream Life of Astronauts: Stories by Patrick Ryan

Memoirs/Essays/Non-Fiction:

  • Desert Solitaire: A Season in the Wilderness by Edward Abbey
  • Not That Kind of Girl: A Young Woman Tells You What She’s “Learned”* by Lena Dunham
  • Where Nobody Knows Your Name: Life in the Minor Leagues of Baseball by John Feinstein
  • Wishful Drinking by Carrie Fisher
  • The Sisters Are Alright: Changing the Broken Narrative of Black Women in America by Tamara Winfrey Harris
  • When Breath Becomes Air by Paul Kalanithi
  • Is Everyone Hanging Out Without Me? (And Other Concerns) by Mindy Kaling
  • Bluets by Maggie Nelson
  • The Argonauts by Maggie Nelson
  • Dear Mr. You* by Mary-Louise Parker
  • The Faraway Nearby by Rebecca Solnit
  • Born to Run by Bruce Springsteen

Poetry:

  • All of Us: The Collected Poems by Raymond Carver
  • E. E. Cummings Complete Poems by e. e. cummings
  • Collected Poems by Edna St. Vincent Millay
  • Felicity by Mary Oliver
  • The Selected Poetry of Rainer Maria Rilke by Ranier Maria Rilke
  • Poems New and Collected by Wislawa Szymborska

Here are some books I hope to read in 2017. If you have any feedback on them (must-reads or don’t-reads) let me know!

  • Either Secondhand Time or Voices from Chernobyl (or both) by Svetlana Alexievich
  • Willful Disregard: A Novel About Love by Lena Andersson
  • The Elegance of the Hedgehog by Muriel Barbery
  • Evicted: Poverty and Profit in the American City by Matthew Desmond
  • Eleven Hours by Pamela Erens
  • Abandon Me by Melissa Febos
  • My Brilliant Friend by Elena Ferrante
  • Homegoing by Yaa Gyasi
  • Before the Fall by Noah Hawley
  • Lab Girl by Hope Jahren
  • Furiously Happy by Jenny Lawson
  • It Can’t Happen Here by Sinclair Lewis
  • The Association of Small Bombs by Karan Mahajan
  • Nutshell by Ian McEwan
  • Norwegian by Night by Derek Miller
  • The Bluest Eye by Toni Morrison
  • The Sympathizer by Viet Thanh Nguyen
  • Commonwealth by Ann Patchet
  • Eleanor and Park by Rainbow Rowell (the last Rowell book I haven’t read!)
  • Men Explain Things to Me by Rebecca Solnit
  • The Paying Guests by Sarah Waters
  • The Underground Railroad by Colson Whitehead

May 2017 be kinder to all of us. And if it isn’t, I hope you find some good books that transport you somewhere else (at least temporarily).

Jason Ronallo: Testing DASH and HLS Streams on Linux

Wed, 2016-12-28 20:08

I’m developing a script to process some digitized and born digital video into adaptive bitrate formats. As I’ve gone along trying different approaches for creating these streams, I’ve wanted reliable ways to test them on a Linux desktop. I keep forgetting how I can effectively test DASH and HLS adaptive bitrate streams I’ve created, so I’m jotting down some notes here as a reminder. I’ll list both the local and online players that you can try.

While I’m writing about testing both DASH and HLS adaptive bitrate formats, really we need to consider 3 formats as HLS can be delivered as MPEG-2 TS segments or fragmented MP4 (fMP4). Since mid-2016 and iOS 10+ HLS segments can be delivered as fMP4. This now allows you to use the same fragmented MP4 files for both DASH and HLS streaming. Until uptake of iOS 10 is greater you likely still need to deliver video with HLS-TS as well (or go with an HLS-TS everywhere approach). While DASH can use any codec I’ll only be testing fragmented MP4s (though maybe not fully conformant to DASH-IF AVC/264 interoperability points). So I’ll break down testing by DASH, HLS TS, and HLS fMP4 when applicable.

The important thing to remember is that you’re not playing back a video file directly. Instead these formats use a manifest file which lists out the various adaptations–different resolutions and bitrates–that a client can choose to play based on bandwidth and other factors. So what we want to accomplish is the ability to play back video by referring to the manifest file instead of any particular video file or files. In some cases the video files will be self-contained, muxed video and audio and byte range requests will be used to serve up segments, but in other cases the video is segmented with the audio in either a separate single file or again the audio segmented similar to the video. In fact depending on how the actual video files are created they may even lack data necessary to play back independent of another file. For instance it is possible to create a separate initialization MP4 file that includes the metadata that allows a client to know how to play back each of the segment files that lack this information. Also, all of these files are intended to be served up over HTTP. They can also include links to text tracks like captions and subtitles. Support for captions in these formats is lacking for many HTML5 players.

Also note that all this testing is being done on Ubuntu 16.04.1 LTS though the Xubuntu variant and it is possible I’ve compiled some of these tools myself (like ffmpeg) rather than using the version in the Ubuntu repositories.

Playing Manifests Directly

I had hoped that it would be fairly easy to test these formats directly without putting them behind a web server. Here’s what I discovered about playing the files without a web server.

GUI Players

Players like VLC and other desktop players have limited support for these formats, so even when they don’t work in these players that doesn’t mean the streams won’t play in a browser or on a mobile device. I’ve had very little luck using these directly from the file system. Assume for this post that I’m already in a directory with the video manifest files: cd /directory/with/video

So this doesn’t work for a DASH manifest (Media Presentation Description): vlc stream.mpd

Neither does this for an HLS-TS manifest: vlc master.m3u8

In the case of HLS it looks like VLC is not respecting relative paths the way it needs to. Some players appear like they’re trying to play HLS, but I haven’t found a Linux GUI player yet that can play the stream directly from the file sytem like this yet. Suggestions?

Command Line Players DASH

Local testing of DASH can be done with the GPAC MP4Client: MP4Client stream.mpd

This works and can tell you if it is basically working and a separate audio file is synced, but only appears to show the first adaptation. I also have some times when it will not play a DASH stream that plays just fine elsewhere. It will not show you whether the sidecar captions are working and I’ve not been able to use MP4Client to figure out whether the adaptations are set up correctly. Will the video sources actually switch with restricted bandwidth? There’s a command line option for this but I can’t see that it works.

HLS

For HLS-TS it is possible to use the ffplay media player that uses the ffmpeg libraries. It has some of the same limitations as MP4Client as far as testing adaptations and captions. The ffplay player won’t work though for HLS-fMP4 or MPEG-DASH.

Other Command Line Players

The mpv media player is based on MPlayer and mplayer2 and can play back both HLS-TS and HLS-fMP4 streams, but not DASH. It also has some nice overlay controls for navigating through a video including knowing about various audio tracks. Just use it with mpv master.m3u8. The mplayer player also works, but seems to choose only one adaptation (the lowest bitrate or the first in the list?) and does not have overlay controls. It doesn’t seem to recognize the sidecar captions included in the HLS-TS manifest.

Behind a Web Server

One simple solution to be able to use other players is to put the files behind a web server. While local players may work, these formats are really intended to be streamed over HTTP. I usually do this by installing Apache and allowing symlinks. I then symlink from the web root to the temporary directory where I’m generating various ABR files. If you don’t want to set up Apache you can also try web-server-chrome which works well in the cases I’ve tested (h/t @Bigggggg_Al).

GUI Players & HTTP

I’ve found that the GStreamer based Parole media player included with XFCE can play DASH and HLS-TS streams just fine. It does appear to adapt to higher bitrate versions as it plays along, but Parole cannot play HLS-fMP4 streams yet.

To play a DASH stream: parole http://localhost/pets/fmp4/stream.mpd

To play an HLS-TS stream: parole http://localhost/pets/hls/master.m3u8

Are there other Linux GUIs that are known to work?

Command Line Players & HTTP

ffplay and MP4Client also work with localhost URLs. ffplay can play HLS-TS streams. MP4Client can play DASH and HLS-TS streams, but for HLS-TS it seems to not play the audio.

Online Players

And once you have a stream already served up from a local web server, there are online test players that you can use. No need to open up a port on your machine since all the requests are made by the browser to the local server which it already has access to. This is more cumbersome with copy/paste work, but is probably the best way to determine if the stream will play in Firefox and Chromium. The main thing you’ll need to do is set CORS headers appropriately. If you have any problems with this check your browser console to see what errors you’re getting. Besides the standard Access-Control-Allow-Origin “*” for some players you may need to set headers to accept pre-flight Access-Control-Allow-Headers like “Range” for byte range requests.

The Bitmovin MPEG-DASH & HLS Test Player requires that you select whether the source is DASH or HLS-TS (or progressive download). Even though Linux desktop browsers do not natively support playing HLS-TS this player can repackage the TS segments so that they can be played back as MP4. This player does not work with HLS-fMP4 streams, though. Captions that are included in the DASH or HLS manifests can be displayed by clicking on the gear icon, though there’s some kind of double-render issue with the DASH manifests I’ve tested.

Really when you’re delivering DASH you’re probably using dash.js underneath in most cases so testing that player is useful. The DASH-264 JavaScript Reference Client Player has a lot of nice features like allowing the user to select the adaptation to play and display of various metrics about the video and audio buffers and the bitrate that is being downloaded. Once you have some files in production this can be helpful for seeing how well your server is performing. Captions that are included in the DASH manifest can be displayed.

The hls.js player has a great demo site for each version that has a lot of options to test quality control and show other metrics. The other nice part about this demo page is that you can just add a src parameter to the URL with the localhost URL you want to test. I could not get hls.js to work with HLS-fMP4 streams, though there is an issue to add fMP4 support. Captions do not seem to be enabled.

There is also the JW Player Stream Tester. But since I don’t have a cert for my local server I need to use the JW Player HTTP stream tester instead of the HTTPS one. I was successfully able to test a DASH and HLS-TS streams with this tool. Captions only displayed for the HLS stream.

The commercial Radiant media player has a DASH and HLS tester than can be controlled with URL parameters. I’m not sure why the streaming type needs to be selected first, but otherwise it works well. It knows how to handle DASH captions but not HLS ones, and it does not work with HLS-fMP4.

The commercial THEOplayer HLS and DASH testing tool only worked for my HLS-TS stream and not the DASH or HLS-fMP4 streams I’ve tested. Maybe it was the test examples given, but even their own examples did not adapt well and had buffering issues.

Wowza has a page for video test players but it seems to require a local Wowza server be set up.

What other demo players are there online that can be used to test ABR streams?

I’ve also created a little DASH tester using Plyr and dash.js. You can either enter a URL to an MPD into the input or append a src parameter with the URL to the MPD to the test page URL. To make it even easier to use, I created a short script that allows me to launch it from a terminal just by giving it the MPD URL. This approach could be used for a couple of the other demos above as well.

One gap in my testing so far is the Shaka player. They have a demo site, but it doesn’t allow enabling an arbitrary stream.

Other Tools for ABR Testing

In order to test automatic bitrate switching it is useful to test that bandwidth switching is working. Latest Chromium and Firefox nightly both have tools built into their developer tools to simulate different bandwidth conditions. In Chromium this is under the network tab and in Firefox nightly it is only accessible when turning on the mobile/responsive view. If you set the bandwidth to 2G you ought to see network requests for a low bitrate adaptation, and if you change it to wifi it ought to adapt to a high bitrate adaptation.

Summary

There are decent tools to test HLS and MPEG-DASH while working on a Linux desktop. I prefer using command line tools like MP4Client (DASH) and mpv (HLS-TS, HLS-fMP4) for quick tests that the video and audio are packaged correctly and that the files are organized and named correctly. These two tools cover both formats and can be launched quickly from a terminal.

I plan on taking a DASH-first approach, and for desktop testing I prefer to test in video.js if caption tracks are added as track elements. With contributed plugins it is possible to test DASH and HLS-TS in browsers. I like testing with Plyr (with my modifications) if the caption file is included in DASH manifest since Plyr was easy to hack to make this work. For HLS-fMP4 (and even HLS-TS) there’s really no substitute to testing on an iOS device (and for HLS-fMP4 on an iOS 10+ device) as the native player may be used in full screen mode.

Harvard Library Innovation Lab: Physical Pitch Decks

Thu, 2016-12-22 16:00

I’ve been playing with physical pitch decks lately. Slides as printed cards.

PowerPoint, Deck.js, and the like are fantastic when sharing with large groups of people — from a classroom full of folks to a web full of folks. But, what if easy and broad sharing isn’t a criteria for your pitch deck?

You might end up with physical cards like I did when I recently pitched Private Talking Spaces. The cards are surprisingly good!! Just like non-physical slides, they can provide outlines for talks and discussions, but they’re so simple (just paper and ink), they won’t get in the way when sharing ideas with small groups.

The operation of the cards is as plain as can be – just take the card off the top, flip it upside down, and put it to the side. 

n cards = n screens in the world of physical pitch decks. I wish we had multiple projectors in rooms! In the photo above, I pin my agenda card up top.

I drew the slides in Adobe Illustrator. They’re six inches square and printed on sturdy paper. If you’d like to make your own, here’s my .ai file and here’s a .pdf version.

It feels like there’s something here. Some depth. If you’ve had success with physical pitch decks, please send me pointers. Thanks!!

David Rosenthal: Walking Away From The Table

Thu, 2016-12-22 16:00
Last time we were buying a car, at the end of a long and frustrating process we finally decided that what we wanted was the bottom end of the range, with no options. The dealer told us that choice wasn't available in our market. We said "OK, call us if you ever find a car like that" and walked away. It was just over two weeks before we got the call. At the end of 2014 I wrote:
The discussions between libraries and major publishers about subscriptions have only rarely been actual negotiations. In almost all cases the libraries have been unwilling to walk away and the publishers have known this. This may be starting to change; Dutch libraries have walked away from the table with Elsevier.Actually, negotiations continued and a year later John Bohannon reported for Science that a deal was concluded:
A standoff between Dutch universities and publishing giant Elsevier is finally over. After more than a year of negotiations — and a threat to boycott Elsevier's 2500 journals — a deal has been struck: For no additional charge beyond subscription fees, 30% of research published by Dutch researchers in Elsevier journals will be open access by 2018. ... The dispute involves a mandate announced in January 2014 by Sander Dekker, state secretary at the Ministry for Education, Culture and Science of the Netherlands. It requires that 60% of government-funded research papers should be free to the public by 2019, and 100% by 2024.By being willing to walk away, the Dutch achieved a partial victory against Elsevier's defining away of double-dipping, their insistance that author processing charges were in addition to subscriptions not instead of subscriptions. This is a preview of the battle over the EU's 2020 open access mandate.

The UK has just concluded negotiations, and a major German consortium is in the midst of them. Below the fold, some commentary on their different approaches.

In the UK, JISC Collections negotiates a national deal with each publisher; Universities can purchase from the publisher under the deal. Tim Gowers reports that:
The current deal that the universities have with Elsevier expires at the end of this year, and a new one has been negotiated between Elsevier and Jisc Collections, the body tasked with representing the UK universities.According to Gowers, JISC Collections had some key goals in the negotiation:
  1. No real-terms price increases.
  2. An offsetting agreement for article processing charges.
  3. No confidentiality clauses.
  4. A move away from basing price on “historic spend”.
  5. A three-year deal rather than a five-year deal.
Gowers quotes a JISC representative saying:
We know from analysis of the experiences of other consortia that Elsevier really do want to reach an agreement this year. They really hate to go over into the next year …

A number of colleagues from other consortia have said they wished they had held on longer …

If we can hold firm even briefly into 2017 that should have quite a profound impact on what we can achieve in these negotiations.This isn't what happened:
But this sensible negotiating strategy was mysteriously abandoned, on the grounds that it had become clear that the deal on offer was the best that Jisc was going to get.Gowers' assessment of the eventual deal against the goals is bleak
  1. it is conceivable that [JISC] will end up achieving their first aim of not having any real-terms price increases: this will depend on whether Brexit causes enough inflation to cancel out such money-terms price increases as there may or may not be
  2. there is no offsetting agreement.
  3. when Elsevier insisted on confidentiality clauses, [JISC] meekly accepted this. ... It is for that reason that I have been a bit vague about prices above.
  4. "The agreement includes the ability for the consortium to migrate from historical print spend and reallocate costs should we so wish." I have no information about whether any “migration” has started, but my guess would be that it hasn’t
  5. the deal is for five years and not for three years.
In other words, because JISC wasn't prepared to walk away they achieved little or nothing. In particular, they failed to make the progress the Dutch had already achieved against Elsevier's double-dipping on APCs.

Contrast this with the German DEAL project:
The DEAL project, headed by HRK (German Rectors' Conference) President Prof Hippler, is negotiating a nationwide license agreement for the entire electronic Elsevier journal portfolio with Elsevier. Its objective is to significantly improve the status quo regarding the provision of and access to content (Open Access) as well as pricing. It aims at relieving the institutions' acquisition budgets and at improving access to scientific literature in a broad and sustainable way.

In order to improve their negotiating power, about 60 major German research institutions including Göttingen University cancelled their contracts with Elsevier as early as October 2016. Others have announced to follow this example.The 60 institutions are preparing their readers for the result of cancellation:
From 1 January 2017 on, Göttingen University — as well as more than 60 other major German research institutions — is to be expected to have no access to the full texts of journals by the publisher Elsevier. In Göttingen, this applies to 440 journals. There will be access to most archived issues of Göttingen journals (PDF 95 KB), but there may be no access to individual Göttingen e-packages for the economic sciences in particular (PDF 89 KB).

From this time on, we will offer you a free order service on articles of these journals: Please send your email request citing the necessary bibliographical data to our colleagues at the library (email). Should an inter-library loan not be possible, we will endeavor to procure the article on another delivery route for you. This service will be free of charge.Elsevier's press release indicates that DEAL is sticking to the strategy JISC abandoned:
Since such negotiations for 600+ institutions are complex, both sides have met regularly during the second half of this year and it was a mutual agreement to pause talks until early in the new year.And that being hard-nosed has an impact:
In fact, it was those institutions themselves that informed us of their intention not to auto-renew their expiring individual access agreements based on the assumption that a national deal would be reached by the end of 2016. It goes without saying that all institutions, even if they cancelled their contracts, will be serviced beyond 2016 should they so choose.It will be very interesting to see how this plays out.

LITA: Jobs in Information Technology: December 22, 2016

Thu, 2016-12-22 15:11

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Brandeis University Library, Digital Initiatives Librarian, Boston, MA

UC Riverside, University Library, Business Systems Analyst, Riverside, CA

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

FOSS4Lib Recent Releases: YAZ - 5.19.0

Thu, 2016-12-22 13:23

Last updated December 22, 2016. Created by Peter Murray on December 22, 2016.
Log in to edit this page.

Package: YAZRelease Date: Friday, December 16, 2016

Open Knowledge Foundation: New Report: Making Citizen-Generated Data Work

Thu, 2016-12-22 10:00

Read the full report here.

We are pleased to announce a new research series investigating how citizens and civil society create data to drive sustainable development. The series follows on from earlier papers on Democratising The Data Revolution and how citizen-generated data can change what public institutions measure. The first report “Making Citizen-Generated Data Work” asks what makes citizens and others want to produce and use citizen-generated data. It was written by myself, Shazade Jameson, and Eko Prasetyo.

“The goal of Citizen-Generated Data is to monitor, advocate for, or drive change around an issue important to citizens”

The report demonstrates that citizen-generated data projects are rarely the work of individual citizens. Instead, they often depend on partnerships to thrive and are supported by civil society organisations, community-based organisations, governments, or business. These partners play a necessary role to provide resources, support, and knowledge to citizens. In return, they can harness data created by citizens to support their own mission. Thus, citizens and their partners often gain mutual benefits from citizen-generated data.

“How can the success of these projects be encouraged and what factors support strategic uptake of citizen-generated data in the short and long term”

But if CGD projects rely on partnerships, who has to be engaged, and through which incentives, to enable CGD projects to achieve their goals? How are such multi-stakeholder projects organised, and which resources and expertise do partners bring into a project? What can other projects do to support and benefit their own citizen-generated data initiatives? This report offers recommendations to citizens, civil society organisations, policy-makers, donors, and others on how to foster stronger collaborations.

Read the full report here.

Brown University Library Digital Technologies Projects: Fedora 4 – testing

Wed, 2016-12-21 19:04

Fedora 4.7.1 is scheduled to be released on 1/5/2017, and testing is important to ensure good quality releases (release testing page for Fedora 4.7.1).

Sanity Builds

Some of the testing is for making sure the Fedora .war files can be built with various options on different platforms. To perform this testing, you need to have 3 required dependencies installed, and run a couple commands.

Dependencies

Java 8 is required for running Fedora. Git is required to clone the Fedora code repositories. Finally, Fedora uses Maven as its build/management tool. For each of these dependencies, you can grab it from your package manager, or download it (JavaGitMaven).

Build Tests

Once your dependencies are installed, it’s time to build the .war files. First, clone the repository you want to test (eg. fcrepo-webapp-plus):

git clone https://github.com/fcrepo4-exts/fcrepo-webapp-plus

Next, in the directory you just created, run the following command to test building it:

mvn clean install

If the output shows a successful build, you can report that to the mailing list. If an error was generated, you can ask the developers about that (also on the mailing list). The generated .war files will be installed to your local Maven repository (as noted in the output of the “mvn clean install” command).

Manual Testing

Another part of the testing is to perform different functions on a deployed version of Fedora.

Deploy

One way to deploy Fedora is on Tomcat 7. After downloading Tomcat, uncompress it and run ./bin/startup.sh. You should see the Tomcat Welcome page at localhost:8080.

To deploy the Fedora application, shut down your tomcat instance (./bin/shutdown.sh) and copy the fcrepo-webapp-plus war file you built in the steps above to the tomcat webapps directory. Next, add the following line to a new setenv.sh file in the bin directory of your tomcat installation (update the fcrepo.home directory as necessary for your environment):

export JAVA_OPTS=”${JAVA_OPTS} -Dfcrepo.home=/fcrepo-data -Dfcrepo.modeshape.configuration=classpath:/config/file-simple/repository.json”

By default, the fcrepo-webapp-plus application is built with WebACLs enabled, so you’ll need a user with the “fedoraAdmin” role to be able to access Fedora. Edit your tomcat conf/tomcat-users.xml file to add the “fedoraAdmin” role and give that role to whatever user you’d like to log in as.

Now start tomcat again, and you should be able to navigate to http://localhost:8080/fcrepo-webapp-plus-4.7.1-SNAPSHOT/ and start testing Fedora functionality.

Library of Congress: The Signal: “Volun-peers” Help Liberate Smithsonian Digital Collections

Wed, 2016-12-21 15:56

Scan of Chamaenerion Latifolium. US National Herbarium, Smithsonian.

The Smithsonian Transcription Center creates indexed, searchable text by means of crowdsourcing…or as Meghan Ferriter, project coordinator at the TC describes it, “harnessing the endless curiosity and goodwill of the public.” As of the end of the current fiscal year, 7,060 volunteers at the TC have transcribed 208,659 pages.

The scope, planning and execution of the TC’s work – the in-house coordination among the Smithsonian’s units and the external coordination of volunteers — is staggering to think about. The Smithsonian Institution is composed of 19 museums, archives, galleries and libraries; nine research centers; and a zoo. Fifteen of the Smithsonian units have collections in the TC, which is run by Ching-hsien Wang, Libraries and Archives System Support Branch manager with the Smithsonian Institution Office of the Chief Information Officer.

Ferriter said, “To manage a project of this scope, one must understand and troubleshoot the system and unit workflows as well as work with unit representatives as they select content and set objectives for their projects.  Neither simply building a tool nor merely inviting participation is enough to sustain and grow a digital project, whatever the scale.”

The TC benefits from the Smithsonian’s online collections. Though individual units may have their own databases, they all link to a central repository, the Smithsonian’s “Enterprise Digital Asset Network,” or EDAN, which is searchable from the Smithsonian’s Collections Search Center. The TC leverages the capabilities of EDAN and builds on the foundation of data and collections-management systems supported by the the Office of the Chief Information Officer. In some cases, for example, a unit may have digitized a collection and the TC arranges for volunteers to add metadata.

Ching-hsien Wang.

Each unit has a different goal for its digital collections. The goal for one project might be to transcribe handwritten notes; the goal for another project might be to key in text from a scanned document. A project might call for geotagging or adding metadata from controlled vocabularies (pre-set tags, used to avoid ambiguities or sloppy mistakes). But the source for each TC project is always a collection of digital files that a volunteer can access online.

Sharing data across the Smithsonian’s back end is an impressive technological feat but it’s only half of this story. The other half is about the relationship between the TC and the volunteers. And the pivotal component that enables the two sides to engage effectively: trust.

The TC’s role at the Smithsonian is as an aggregator, making bulk data available for volunteers to process and directing the flow of volunteer-processed data to the main repository. So, more than just trafficking in data, the TC nurtures its relationships with volunteers by means of technical fail-safe resources and down-to-earth, sincere human engagement.

Ferriter shows her respect for the volunteers when she refers to them as “volunpeers.” Ferriter said, ” ‘Volunpeers’ indicates the ways unit administrators and Smithsonian staff experience the TC along with volunteers. ‘Volunpeers’ underscores the values articulated by volunteers describing their activities and personal goals on the TC, including to learn, to help and to give back to something bigger….Establishing a collaborative space that uses peer-review resources brings to the foreground what is being done together rather than exclusively highlighting what is being done by particular individuals.”

TC staff made a crucial discovery when they figured out that what motivated people to volunteer was a sincere desire to help. Wang said, “Volunteers feel privileged and take the responsibility seriously. And they like that the Smithsonian values what they do.”

Meghan Ferriter.

Ferriter said, “Volunteers indicated they were seeking increased behind-the-scenes access as a reward for participating, rather than receiving discounts or merchandise from Smithsonian vendors.” So TC staff developed a close relationship with the volunteers and they remain in constant contact my means of social media.

“Communicating in an authentic way is central to my strategy,” Ferriter said. “Being authentic includes being vulnerable and expressing real enthusiasm. It also entails revealing my lack of knowledge while learning alongside volunteers. My strategy incorporates an inclusive attitude with the intent of shortening the distance of institutional authority and public positioning.”

Institutional authority — or the perception of institutional authority — can be a potential obstacle to finding volunteers. Wang said the Smithsonian — like other staid old institutions — was perceived several years ago to have an image problem. She said that research indicated, “People think it’s nothing but old white men scientists.” Wang and Ferriter do not suggest that the solution is for the TC to appear young and hip and “with it.” Rather the TC demonstrates its inclusiveness in a very real and sincere way: by reaching out to any and all volunteers and treating them with appreciation and respect.

Volunteers are always publicly credited for their work. They can download and review PDFs of what they’ve done once a project is finished. Ferriter said, “I advise Smithsonian staff members who want to be part of the Transcription Center, ‘You need to understand that there is a commitment that you’re making to participate in this project, which requires you to be involved with communicating with the public, to answer their questions, to tell them specific details about projects, to be prepared to provide a behind-the-scenes tour.”

Scan of handwritten document from “The Legend of Sgenhadishon.” National Anthropological Archives, the Smithsonian.

Each project includes three steps: transcription, review and approval. One of the remarkable results of the TC/volunteer relationship is that the review process has become so thorough and consistently reliable, and  volunteers behave so professionally and responsibly, there is often little change required during the approval phase. This trust in the reviewers — trust that the reviewers earn and deserve — saves a significant amount of staff time for the Smithsonian in the approval phase.

Another remarkable result of the volunteers’ dedication is that TC staff has found that their manual transcriptions are statistically far superior than OCR, which often tends to be “dirty” and requires additional time and labor to correct.

Ferriter said that as successful as the Transcription Center is, as evidenced by the amount of digital collections it has made keyword searchable, there remain further opportunities to look at the larger picture of inter-related data. “The story may be more than merely what is contained within the TC project,” Ferriter said. “There are opportunities to connect the project to its significance in history, science and other related SI and cultural heritage collections.”

When those opportunities arise, the volunpeers will no doubt help make the connections happen.

FOSS4Lib Recent Releases: veraPDF - 0.26

Wed, 2016-12-21 14:17

Last updated December 21, 2016. Created by Peter Murray on December 21, 2016.
Log in to edit this page.

Package: veraPDFRelease Date: Wednesday, December 21, 2016

Open Knowledge Foundation: PersonalData.IO helps you get access to your personal data

Wed, 2016-12-21 13:00

PersonalData.IO is a free and open platform for citizens to track their personal data and understand how it is used by companies. It is part of the MyData movement, promoting a human-centric approach to personal data management.

A lot of readers of this blog will be familiar with Freedom of Information laws, a legal mechanism that forces governments to be more open. Individuals, journalists, startups and other actors can use this “right-to-know” to understand what the government is doing and try to make it function better. There are even platforms that help facilitate the exercise of this right, like MuckRock, WhatDoTheyKnow or FragDenStaat. These platforms also have an education function around information rights.

In Europe we enjoy a similar right with respect to personal data held by private companies, but it is often very hard to exercise it. We want to change that, with PersonalData.IO.

Image credit: Kevin O’Connor (CC BY)

What is personal data?

In European law, the definition of personal data is extremely broad: `any information relating to an identified or identifiable natural person`. Unlike in the U.S., the concept of identifiability is crucial in defining personal data, and ever-expanding to match technical possibilities: if some intermediate identifier (license plate, cookie, dynamic IP address, phone number, etc) can reasonably be traced back to you given likely evolution of technology, all the data associated to that identifier becomes personal data.

Why should you care?

Holding personal data often translates into power over people, which in turn becomes economic might (at the extreme, think Facebook, Google, etc). This situation often creates uncomfortable issues of transborder transparency and accountability, but also hinders the appearance of other innovative uses for the data, for instance for research, art, business, education, advocacy, journalism, etc.

Examples PersonalData.IO portal

Leveraging the same mechanisms as FOI portals, we are focused on making such requests easier to initiate, to follow through, to share and then to clone. Processing the requests in the open helps increase the pressure on companies to comply. In practice, we have taken the Froide software developed by Open Knowledge Germany, themed it to our needs and made some basic modifications in the workflow. Our platform is growing its user base slowly, but we benefit from many network effects: for any given company, you only need one person to go through the process of getting their hands on their data, and afterwards everyone benefits!

MyData

Getting to the data is only the first step. The bar is still pretty high then to make it really useful. In May 2018, new regulations will come into place in Europe to help individuals leverage their personal data even more: individuals will enjoy a new right to data portability, i.e. the right to transfer data from one service to another.

In anticipation a whole movement has arisen focused on human-centric personal data management, called MyData. OpenKnowledge Finland recently organised a conference with tons of people building new services giving you more control over all that data! I am looking forward to a tool helping individuals turn their personal data into Open Data (by scraping direct identifiers, for instance). Many companies will also benefit from the Frictionless Data project, since there will be a requirement to transfer that data “in a structured, commonly used, machine-readable and interoperable format”.

Image credit: Salla Thure (Public Domain)

In anticipation of this exciting ecosystem, we want to build experiences expanding access to this data with PersonalData.IO and to encourage companies to broaden their view of what constitutes personal data. The more data is considered personal data, the more you will be in control. Feel free to join us!

You can sign up to our mailing list or directly to the portal itself and initiate new requests. You can also follow us on Twitter or contact us directly. We welcome individual feedback and ideas and are always looking for new partners, developers and contributors!

Pages