You are here

Feed aggregator

Max Planck Digital Library: HTTPS only for MPG/SFX and MPG.eBooks

planet code4lib - Fri, 2017-11-17 09:37

As of next week, all http requests to the MPG/SFX link resolver will be redirected to a corresponding https request.

The Max Planck Society electronic Book Index is scheduled to be switched to https only access the week after, starting on November 27, 2017.

Regular web browser use of the above services should not be affected.

Please thoroughly test any solutions that integrate these services via their web APIs.

Please consider re-subscribing to MPG.eBooks RSS feeds.

Karen G. Schneider: What burns away

planet code4lib - Fri, 2017-11-17 05:29

We are among the lucky ones. We did not lose our home. We did not spend day after day evacuated, waiting to learn the fate of where we live. We never lost power or Internet. We had three or four days where we were mildly inconvenienced because PG&E wisely turned off gas to many neighborhoods, but we showered at the YMCA and cooked on an electric range we had been planning to upgrade to gas later this fall (and just did, but thank you, humble Frigidaire electric range, for being there to let me cook out my anxiety). We kept our go-bags near the car, and then we kept our go-bags in the car, and then, when it seemed safe, we took them out again. That, and ten days of indoor living and wearing masks when we went out, was all we went through.

But we all bear witness.

The Foreshadowing

It began with a five-year drought that crippled forests and baked plains, followed by an soaking-wet winter and a lush  spring that crowded the hillsides with greenery. Summer temperatures hit records several times, and the hills dried out as they always do right before autumn, but this time unusually crowded with parched foliage and growth.

The air in Santa Rosa was hot and dry that weekend, an absence of humidity you could snap between your fingers. In the southwest section of the city, where we live, nothing seemed unusual. Like many homes in Santa Rosa our home does not have air conditioning, so for comfort’s sake I grilled our dinner, our 8-foot backyard fence buffering any hint of the winds gathering speed northeast of us. We watched TV and went to bed early.

Less than an hour later one of several major fires would be born just 15 miles east of where we slept.

Reports vary, but accounts agree it was windy that Sunday night, with windspeeds ranging between 35 and 79 miles per hour, and a gust northwest of Santa Rosa reaching nearly 100 miles per hour. If the Diablo winds were not consistently hurricane-strength, they were exceptionally fast, hot, and dry, and they meant business.

A time-lapse map of 911 calls shows the first reports of downed power lines and transformers coming in around 10 pm.  The Tubbs fire was named for a road that is named for a 19th-century winemaker who lived in a house in  Calistoga that burned to the ground in an eerily similar fire in 1964. In three hours this fire sped 12 miles southwest, growing in size and intent as it gorged on hundreds and then thousands of homes in its way, breaching city limits and expeditiously laying waste to 600 homes in the Fountaingrove district before it tore through the Journey’s End mobile home park, then reared back on its haunches and leapt across a six-lane divided section of Highway 101, whereupon it gobbled up big-box stores and fast food restaurants flanking Cleveland Avenue, a business road parallel to the highway.  Its swollen belly, fat with miles of fuel, dragged over the area and took out buildings in the  the random manner of fires. Kohl’s and KMart were totaled and Trader Joe’s was badly damaged, while across the street from KMart, JoAnn Fabrics was untouched. The fire demolished one Mexican restaurant, hopscotched over another, and feasted on a gun shop before turning its ravenous maw toward the quiet middle-class neighborhood of Coffey Park, making short work of thousands more homes.

Santa Rosa proper is itself only 41 square miles, approximately 13 miles north-south and 9 miles east-west, including the long tail of homes flanking the Annadel mountains. By the time Kohl’s was collapsing, the “wildfire” was less than 4 miles from our home.

I woke up around 2 am, which I tend to do a lot anyway. I walked outside and smelled smoke, saw people outside their homes looking around, and went on Twitter and FaceBook. There I learned of a local fire, forgotten by most in the larger conflagration, but duly noted in brief by the Press Democrat: a large historic home at 6th and Pierson burned to the ground, possibly from  a downed transformer, and the fire licked the edge of the Santa Rosa Creek Trail for another 100 feet. Others in the West End have reported the same experience of reading about the 6th Street house fire on social media and struggling to reconcile the reports of this fire with reports of panic and flight from areas north of us and videos of walls of flame.

At 4 am I received a call that the university had activated its Emergency Operations Center and I asked if I should report in. I showered and dressed, packed a change of clothes in a tote bag, threw my bag of important documents in my purse, and drove south on my usual route to work, Petaluma Hill Road. The hills east of the road flickered with fire, the road itself was packed with fleeing drivers, and halfway to campus I braked at 55 mph when a massive buck sprang inches in front of my car, not running in that “oops, is this a road?” way deer usually cross lanes of traffic but yawing too and fro, its eyes wide. I still wonder, was it hurt or dying.

As I drove onto campus I thought, the cleaning crew. I parked at the Library and walked through the building, already permeated with smoky air. I walked as quietly as I could, so that if they were anywhere in the building I would hear them. As I walked through the silent building I wondered, is this the last time I will see these books? These computers? The new chairs I’m so proud of? I then went to the EOC and found the cleaning crew had been accounted for, which was a relief.

At Least There Was Food And Beer

A few hours later I went home. We had a good amount of food in the house, but like many of us who were part of this disaster but not immediately affected by it, I decided to stock up. The entire Santa Rosa Marketplace– CostCo and Trader Joe’s, Target–on Santa Rosa Avenue was closed, and Oliver’s had a line outside of people waiting to get in. I went to the “G&G Safeway”–the one that took over a down-at-the-heels family market known as G&G and turned it into a spiffy market with a wine bar, no less–and it was without power, but open for business and, thanks to a backup system, able to take ATM cards. I had emergency cash on me but was loathe to use it until I had to.

Sweating through an N95 mask I donned to protect my lungs, I wheeled my cart through the dark store, selecting items that would provide protein and carbs if we had to stuff them in our go-bags, but also fresh fruit and vegetables, dairy and eggs–things I thought we might not see for a while, depending on how the disaster panned out. (Note, we do already have emergency food, water, and other supplies.) The cold case for beer was off-limits–Safeway was trying to retain the cold in its freezer and fridge cases in case it could save the food–but there was a pile of cases of Lagunitas Lil Sumpin Sumpin on sale, so that with a couple of bottles of local wine went home with me too.

And with one wild interlude, for most of the rest of the time we stayed indoors with the windows closed.  I sent out email updates and made phone calls, kept my phone charged and read every Nexil alert, and people at work checked in with one another. My little green library emergency contact card stayed in my back pocket the entire time. We watched TV and listened to the radio, including extraordinary local coverage by KSRO, the Little Station that Could; patrolled newspapers and social media; and rooted for Sheriff Rob, particularly after his swift smack-down of a bogus, Breitbart-fueled report that an undocumented person had started the fires.

Our home was unoccupied for a long time before we moved in this September, possibly up to a decade, while it was slowly but carefully upgraded. The electric range was apparently an early purchase; it was a line long discontinued by Frigidaire, with humble electric coils. But it had been unused until we arrived, and was in perfect condition. If an electric range could express gratitude for finally being useful, this one did. I used it to cook homey meals: pork loin crusted with Smithfield bacon; green chili cornbread; and my sui generis meatloaf, so named because every time I make it, I grind and add meat scraps from the freezer for a portion of the meat mixture. (It would be several weeks before I felt comfortable grilling again.) We cooked. We stirred. We sauteed. We waited.

On Wednesday, we had to run an errand. To be truthful, it was an Amazon delivery purchased that Saturday, when the world was normal, and sent to an Amazon locker at the capacious Whole Foods at Coddington Mall, a good place to send a package until the mall closes down because the northeast section of the city is out of power and threatened by a massive wildfire. By Wednesday, Whole Foods had reopened, and after picking up my silly little order–a gadget that holds soda cans in the fridge–we drove past Russian River Brewing Company and saw it was doing business, so we had salad and beer for lunch, because it’s a luxury to have beer at lunch and the fires were raging and it’s so hard to get seating there nights and weekends, when I have time to go there, but there we were. We asked our waiter how he was doing, and he said he was fine but he motioned to the table across from ours, where a family was enjoying pizza and beer, and he said they had lost their homes.

There were many people striving for routine during the fires, and to my surprise, even the city planning office returned correspondence regarding some work we have planned for our new home, offering helpful advice on the permitting process required for minor improvements for homes in historic districts. Because it turns out developers and engineers could serenely ignore local codes and build entire neighborhoods in Santa Rosa in areas known to be vulnerable to wildfire; but to replace bare dirt with a little white wooden picket fence, or to restore front windows from 1950s-style plate glass to double-hung wooden windows with mullions–projects intended to reinstate our house to its historic accuracy, and to make it more welcoming–requires a written justification of the project, accompanying photos, “Proposed Elevations (with Landscape Plan IF you are significantly altering landscape) (5 copies),” five copies of a paper form, a Neighborhood Context and Vicinity Map provided by the city, and a check for $346, followed by “8-12 weeks” before a decision is issued.

The net result of this process is like the codes about not building on ridges, though much less dangerous; most people ignore the permitting process, so that the historic set piece that is presumably the goal is instead rife with anachronisms. And of course, first I had to bone up on the residential building code and the historic district guidelines, which contradict one another on key points, and because the permitting process is poorly documented I have an email traffic thread rivaling in word count Byron’s letters to his lovers.

But the planning people are very pleasant, and we all seemed to take comfort in plodding through the administrivia of city bureaucracy as if we were not all sheltering in place, masks over our noses and mouths, go-bags in our cars, while fires raged just miles from their office and our home.

The Wild Interlude, or, I Have Waited My Entire Career For This Moment

Regarding the wild interlude, the first thing to know about my library career is that nearly everywhere I have gone where I have had the say-so to make things happen, I have implemented key management. That mishmosh of keys in  a drawer, the source of so much strife and arguments, becomes an orderly key locker with numbered labels. It doesn’t happen overnight, because keys are control and control is political and politics are what we tussle about in libraries because we don’t have that much money, but it happens.

Sometimes I even succeed in convincing people to sign keys out so we know who has them. Other times I convince people to buy a locker with a keypad so we sidestep the question of where the key to the key locker is kept. But mostly, I leave behind the lockers, and, I hope, an appreciation for lockers. I realize it’s not quite as impressive as founding the Library of Alexandria, and it’s not what people bring up when I am introduced as a keynote speaker, and I have never had anyone ask for a tour of my key lockers nor have I ever been solicited to write a peer-reviewed article on key lockers. However unheralded, it’s a skill.

My memory insists it was Tuesday, but the calendar says it was late Monday night when I received a call that the police could not access a door to an area of the library where we had high-value items. It would turn out that this was a rogue lock, installed sometime soon after the library opened in 2000, that unlike others did not have a master registered with the campus, an issue we have since rectified. But in any event, the powers that be had the tremendous good fortune to contact the person who has been waiting her entire working life to prove beyond doubt that KEY LOCKERS ARE IMPORTANT.

After a brief internal conversation with myself, I silently nixed the idea of offering to walk someone through finding the key. I said I knew where the key was, and I could be there in twenty minutes to find it. I wasn’t entirely sure this was the case, because as obsessed as I am with key lockers, this year I have been preoccupied with things such as my deanly duties, my doctoral degree completion, national association work, our home purchase and household move, and the selection of geegaws like our new gas range (double oven! center griddle!). This means I had not spend a lot of time perusing this key locker’s manifest. So there was an outside chance I would have to find the other key, located somewhere in an another department, which would require a few more phone calls. I was also in that liminal state between sleep and waking; I had been asleep for two hours after being up since 2 am, and I would have agreed to do just about anything.

Within minutes I was dressed and again driving down Petaluma Hill Road, still busy with fleeing cars.  The mountain ridges to the east of the road roiled with flames, and I gripped the steering wheel, watching for more animals bolting from fire. Once in the library, now sour with smoke, I ran up the stairs into my office suite and to the key locker, praying hard that the key I sought was in it. My hands shook. There it was, its location neatly labeled by the key czarina who with exquisite care had overseen the organization of the key locker. The me who lives in the here-and-now profusely thanked past me for my legacy of key management, with a grateful nod to the key czarina as well. What a joy it is to be able to count on people!

Items were packed up, and off they rolled. After a brief check-in at the EOC, home I went, to a night of “fire sleep”–waking every 45 minutes to sniff the air and ask, is fire approaching?–a type of sleep I would have for the next ten days, and occasionally even now.

How we speak to one another in the here and now

Every time Sandy and I interact with people, we ask, how are you. Not, hey, how are ya, where the expected answer is “fine, thanks” even if you were just turned down for a mortgage or your mother died. But no, really, how are you. Like, fire-how-are-you. And people usually tell you, because everyone has a story. Answers range from: I’m ok, I live in Petaluma or Sebastopol or Bodega Bay (in SoCo terms, far from the fire), to I’m ok but I opened my home to family/friends/people who evacuated or lost their homes; or, I’m ok but we evacuated for a week; or, as the guy from Home Depot said, I’m ok and so is my wife, my daughter, and our 3 cats, but we lost our home.

Sometimes they tell you and they change the subject, and sometimes they stop and tell you the whole story: when they first smelled smoke, how they evacuated, how they learned they did or did not lose their home. Sometimes they have before-and-after photos they show you. Sometimes they slip it in between other things, like our cat sitter, who mentioned that she lost her apartment in Fountaingrove and her cat died in the fire but in a couple of weeks she would have a home and she’d be happy to cat-sit for us.

Now, post-fire, we live in that tritest of phrases, a new normal. The Library opened that first half-day back, because I work with people who like me believe that during disasters libraries should be the first buildings open and the last to close. I am proud to report the Library also housed NomaCares, a resource center for those at our university affected by the fire. That first Friday back we held our Library Operations meeting, and we shared our stories, and that was hard but good. But we also resumed regular activity, and soon the study tables and study rooms were full of students, meetings were convened, work was resumed, and the gears of life turned. But the gears turned forward, not back. Because there is no way back.

I am a city mouse, and part of moving to Santa Rosa was our decision to live in a highly citified section, which turned out to be a lucky call. But my mental model of city life has been forever twisted by this fire. I drive on 101 just four miles north of our home, and there is the unavoidable evidence of a fire boldly leaping into an unsuspecting city. I go to the fabric store, and I pass twisted blackened trees and a gun store totaled that first night. I drive to and from work with denuded hills to my east a constant reminder.

But that’s as it should be. Even if we sometimes need respite from those reminders–people talk about taking new routes so they won’t see scorched hills and devastated neighborhoods–we cannot afford to forget. Sandy and I have moved around the country in our 25 years together, and we have seen clues everywhere that things are changing and we need to take heed. People like to lapse into the old normal, but it is not in our best interests to do so.

All of our stories are different. But we share a collective loss of innocence, and we can never return to where we were. We can only move forward, changed by the fire, changed forever.

Bookmark to:

William Denton: Org clocktables II: Summarizing a month

planet code4lib - Fri, 2017-11-17 04:46

In Org clocktables I: The daily structure I explained how I track my time working at an academic library, clocking in to projects that are either categorized as PPK (“professional performance and knowledge,” our term for “librarianship,”), PCS (“professional contributions and standing”, which covers research, professional development and the like) and Service. I do this by checking in and out of tasks with the magic of Org.

I’ll add a day to the example I used before, to make it more interesting. This is what the raw text looks like:

* 2017-12 December ** [2017-12-01 Fri] :LOGBOOK: CLOCK: [2017-12-01 Fri 09:30]--[2017-12-01 Fri 09:50] => 0:20 CLOCK: [2017-12-01 Fri 13:15]--[2017-12-01 Fri 13:40] => 0:25 :END: *** PPK **** Libstats stuff :LOGBOOK: CLOCK: [2017-12-01 Fri 09:50]--[2017-12-01 Fri 10:15] => 0:25 :END: Pull numbers on weekend desk activity for A. **** Ebook usage :LOGBOOK: CLOCK: [2017-12-01 Fri 13:40]--[2017-12-01 Fri 16:30] => 2:50 :END: Wrote code to grok EZProxy logs and look up ISBNs of Scholars Portal ebooks. *** PCS *** Service **** Stewards' Council meeting :LOGBOOK: CLOCK: [2017-12-01 Fri 10:15]--[2017-12-01 Fri 13:15] => 3:00 :END: Copious meeting notes here. ** [2017-12-04 Mon] :LOGBOOK: CLOCK: [2017-12-04 Mon 09:30]--[2017-12-04 Mon 09:50] => 0:20 CLOCK: [2017-12-04 Mon 12:15]--[2017-12-04 Mon 13:00] => 0:45 CLOCK: [2017-12-04 Mon 16:00]--[2017-12-04 Mon 16:15] => 0:15 :END: *** PPK **** ProQuest visit :LOGBOOK: CLOCK: [2017-12-04 Mon 09:50]--[2017-12-04 Mon 12:15] => 2:25 :END: Notes on this here. **** Math print journals :LOGBOOK: CLOCK: [2017-12-04 Mon 16:15]--[2017-12-04 Mon 17:15] => 1:00 :END: Check current subs and costs; update list of print subs to drop. *** PCS **** Pull together sonification notes :LOGBOOK: CLOCK: [2017-12-04 Mon 13:00]--[2017-12-04 Mon 16:00] => 3:00 :END: *** Service

All raw Org text looks ugly, especially all those LOGBOOK and PROPERTIES drawers. Don’t let that put you off. This is what it looks like on my screen with my customizations (see my .emacs for details):

Much nicer in Emacs.

At the bottom of the month I use Org’s clock table to summarize all this.

#+BEGIN: clocktable :maxlevel 3 :scope tree :compact nil :header "#+NAME: clock_201712\n" #+NAME: clock_201712 | Headline | Time | | | |----------------------+-------+------+------| | *Total time* | *14:45* | | | |----------------------+-------+------+------| | 2017-12 December | 14:45 | | | | \_ [2017-12-01 Fri] | | 7:00 | | | \_ PPK | | | 3:15 | | \_ Service | | | 3:00 | | \_ [2017-12-04 Mon] | | 7:45 | | | \_ PPK | | | 3:25 | | \_ PCS | | | 3:00 | #+END

I just put in the BEGIN/END lines and then hit C-c C-c and Org creates that table. Whenever I add some more time, I can position the pointer on the BEGIN line and hit C-c C-c and it updates everything.

Now, there are lots of commands I could use to customize this, but this is pretty vanilla and it suits me. It makes it clear how much time I have down for each day and how much time I spent in each of the three pillars. It’s easy to read at a glance. I fiddled with various options but decided to stay with this.

It looks like this on my screen:

Much nicer in Emacs.

That’s a start, but the data is not in a format I can use as is. The times are split across different columns, there are multiple levels of indents, there’s a heading and a summation row, etc. But! The data is in a table in Org, which means I can easily ingest it and process it in any language I choose, in the same Org file. That’s part of the power of Org: it turns raw data into structured data, which I can process with a script into a better structure, all in the same file, mixing text, data and output.

Which language, though? A real Emacs hacker would use Lisp, but that’s beyond me. I can get by in two languages: Ruby and R. I started doing this in Ruby, and got things mostly working, then realized how it should go and what the right steps were to take, and switched to R.

Here’s the plan:

  • ignore “Headline” and “Total time” and “2017-12 December” … in fact, ignore everything that doesn’t start with “\_”
  • clean up the remaining lines by removing “\_”
  • the first line will be a date stamp, with the total day’s time in the first column, so grab it
  • after that, every line will either be a PPK/PCS/Service line, in which case grab that time
  • or it will be a new date stamp, in which case capture that information and write out the previous day’s information
  • continue on through all the lines
  • until the end, at which point a day is finished but not written out, so write it out

I did this in R, using three packages to make things easier. For managing the time intervals I’m using hms, which seems like a useful tool. It needs to be a very recent version to make use of some time-parsing functions, so it needs to be installed from GitHub. Here’s the R:

library(tidyverse) library(hms) ## Right now, needs GitHub version library(stringr) clean_monthly_clocktable <- function (raw_clocktable) { ## Clean up the table into something simple clock <- raw_clocktable %>% filter(grepl("\\\\_", Headline)) %>% mutate(heading = str_replace(Headline, "\\\\_ *", "")) %>% mutate(heading = str_replace(heading, "] .*", "]")) %>% rename(total = X, subtotal = X.1) %>% select(heading, total, subtotal) ## Set up the table we'll populate line by line newclock <- tribble(~date, ~ppk, ~pcs, ~service, ~total) ## The first line we know has a date and time, and always will date_old <- substr(clock[1,1], 2, 11) total_time_old <- clock[1,2] date_new <- NA ppk <- pcs <- service <- vacation <- total_time_new <- "0:00" ## Loop through all lines ... for (i in 2:nrow(clock)) { if (clock[i,1] == "PPK") { ppk <- clock[i,3] } else if (clock[i,1] == "PCS") { pcs <- clock[i,3] } else if (clock[i,1] == "Service") { service <- clock[i,3] } else { date_new <- substr(clock[i,1], 2, 11) total_time_new <- clock[i,2] } ## When we see a new date, add the previous date's details to the table if (! { newclock <- newclock %>% add_row(date = date_old, ppk, pcs, service, total = total_time_old) ppk <- pcs <- service <- "0:00" date_old <- date_new date_new <- NA total_time_old <- total_time_new } } ## Finally, add the final date to the table, when all the rows are read. newclock <- newclock %>% add_row(date = date_old, ppk, pcs, service, total = total_time_old) newclock <- newclock %>% mutate(ppk = parse_hm(ppk), pcs = parse_hm(pcs), service = parse_hm(service), total = parse_hm(total), lost = as.hms(total - (ppk + pcs + service))) %>% mutate(date = as.Date(date)) }

All of that is in a SRC block like below, but I separated the two in case it makes the syntax highlighting clearer. I don’t think it does, but such is life. Imagine the above code pasted into this block:

#+BEGIN_SRC R :session :results values #+END

Running C-c C-c on that will produce no output, but it does create an R session and set up the function. (Of course, all of this will fail if you don’t have R (and those three packages) installed.)

With that ready, now I can parse that monthly clocktable by running C-c C-c on this next source block, which reads in the raw clock table (note the var setting, which matches the #+NAME above), parses it with that function, and outputs cleaner data. I have this right below the December clock table.

#+BEGIN_SRC R :session :results values :var clock_201712=clock_201712 :colnames yes clean_monthly_clocktable(clock_201712) #+END_SRC #+RESULTS: | date | ppk | pcs | service | total | lost | |------------+----------+----------+----------+----------+----------| | 2017-12-01 | 03:15:00 | 00:00:00 | 03:00:00 | 07:00:00 | 00:45:00 | | 2017-12-04 | 03:25:00 | 03:00:00 | 00:00:00 | 07:45:00 | 01:20:00 |

This is tidy data. It looks this this:

Again, in Emacs

That’s what I wanted. The code I wrote to generate it could be better, but it works, and that’s good enough.

Notice all of the same dates and time durations are there, but they’re organized much more nicely—and I’ve added “lost.” The “lost” count is how much time in the day was unaccounted for. This includes lunch (maybe I’ll end up classifying that differently), short breaks, ploughing through email first thing in the morning, catching up with colleagues, tidying up my desk, falling into Wikipedia, and all those other blocks of time that can’t be directly assigned to some project.

My aim is to keep track of the “lost” time and to minimize it, by a) not wasting time and b) properly classifying work. Talking to colleagues and tidying my desk is work, after all. It’s not immortally important work that people will talk about centuries from now, but it’s work. Not everything I do on the job can be classified against projects. (Not the way I think of projects—maybe lawyers and doctors and the self-employed think of them differently.)

The one technical problem with this is that when I restart Emacs I need to rerun the source block with the R function in it, to set up the R session and the function, before I can rerun the simple “update the monthly clocktable” block. However, because I don’t restart Emacs very often, that’s not a big problem.

The next stage of this is showing how I summarize the cleaned data to understand, each month, how much of my time I spent on PPK, PCS and Service. I’ll cover that in another post.

District Dispatch: House passes OPEN Act to improve public access to government data

planet code4lib - Thu, 2017-11-16 17:40

The House of Representatives passed the OPEN Government Data Act on Nov. 15, 2017, as part of the bipartisan Foundations for Evidence-Based Policymaking Act.

On Wednesday, November 15, the House of Representatives passed ALA-supported legislation to improve public access to government data. The Open, Public, Electronic, and Necessary (OPEN) Government Data Act was included as part of the Foundations for Evidence-Based Policymaking Act (H.R. 4174), which the House passed by voice vote. Passage of the bill represents a victory for library advocates, who have supported the legislation since it was first introduced last year.

The OPEN Government Data Act would make more government data freely available online, in machine-readable formats, and discoverable through a federal data catalog. The legislation would codify and build upon then-President Obama’s 2013 executive order. ALA President Jim Neal responded to passage of the bill by saying,

ALA applauds the House’s passage of the OPEN Government Data Act today. This bill will make it easier for libraries to help businesses, researchers and students find and use valuable data that makes American innovation and economic growth possible. The strong bipartisan support for this legislation shows access to information is a value we can all agree on.

With this vote, both the House and the Senate have now passed the OPEN Government Data Act, albeit in different forms. In September, the Senate passed the OPEN bill as an attachment to the annual defense bill, but the provision was later removed in conference with the House. This shows that the Senate supports the fundamental concepts of the OPEN bill – now the question is whether the Senate will agree to the particular details of H.R. 4174 (which also contains new provisions that will require negotiation).

ALA hopes that Congress will soon reach agreement to send the OPEN Government Data Act to the President’s desk so that taxpayers can make better use of these valuable public assets. ALA thanks House Speaker Paul Ryan (R-WI), Reps. Trey Gowdy (R-SC), Derek Kilmer (D-WA), and Blake Farenthold (R-TX), and Sens. Patty Murray (D-WA), Brian Schatz (D-HI), and Ben Sasse (R-NE), for their leadership in unlocking data that will unleash innovation.


The post House passes OPEN Act to improve public access to government data appeared first on District Dispatch.

David Rosenthal: Techno-hype part 2

planet code4lib - Thu, 2017-11-16 16:00
Don't, don't, don't, don't believe the hype!Public Enemy
Enough about the hype around self-driving cars, now on to the hype around cryptocurrencies.

Sysadmins like David Gerard tend to have a realistic view of new technologies; after all, they get called at midnight when the technology goes belly-up. Sensible companies pay a lot of attention to their sysadmins' input when it comes to deploying new technologies.

Gerard's Attack of the 50 Foot Blockchain: Bitcoin, Blockchain, Ethereum & Smart Contracts is a must-read, massively sourced corrective to the hype surrounding cryptocurrencies and blockchain technology. Below the fold, some tidbits and commentary. Quotes not preceded by links are from the book, and I have replaced some links to endnotes with direct links.

Gerard's overall thesis is that the hype is driven by ideology, which has resulted in cult-like behavior that ignores facts, such as:
Bitcoin ideology assumes that inflation is a purely monetary phenomenon that can only be caused by printing more money, and that Bitcoin is immune due to its strictly limited supply. This was demonstrated trivially false when the price of a bitcoin dropped from $1000 in late 2013 to $200 in early 2015 - 400% inflation - while supply only went up 10%.There's recent evidence for this in the collapse of the SegWit2x proposal to improve Bitcoin's ability to scale. As Timothy B Lee writes:
There's a certain amount of poetic justice in the fact that leading Bitcoin companies trying to upgrade the Bitcoin network were foiled by a populist backlash. Bitcoin is as much a political movement as it is a technology project, and the core idea of the movement is a skepticism about decisions being made behind closed doors.Gerard quotes Satoshi Nakamoto's release note for Bitcoin 0.1:
The root problem with conventional currency is all the trust that's required to make it work. The central bank must be trusted not to debase the currency, but the history of fiat currencies is full of breaches of that trust. Banks must be trusted to hold our money and transfer it electronically, but they lend it out in waves of credit bubbles with barely a fraction in reserve. We have to trust them with our privacy, trust them not to let identity thieves drain our accounts. Their massive overhead costs make micropayments impossible.And points out that:
Bitcoin failed at every one of Nakamoto's aspirations here. The price is ridiculously volatile and has had multiple bubbles; the unregulated exchanges (with no central bank backing) front-run their customers, paint the tape to manipulate the price, and are hacked or just steal their user's funds; and transaction fees and the unreliability of transactions make micropayments completely unfeasible.Instead, Bitcoin is a scheme to transfer money from later to earlier adopters:
Bitcoin was substantially mined early on - early adopters have most of the coins. The design was such that early users would get vastly better rewards than later users for the same effort.

Cashing in these early coins involves pumping up the price, then selling to later adopters, particularly in the bubbles. Thus Bitcoin was not a Ponzi or pyramid scheme, but a pump-and-dump. Anyone who bought in after the earliest days is functionally the sucker in the relationship.Satoshi Nakamoto mined (but has never used) nearly 5% of all the Bitcoin there will ever be, a stash now notionally worth $7.5B. The distribution of notional Bitcoin wealth is highly skewed:
a Citigroup analysis from early 2014 notes: "47 individuals hold about 30 percent, another 900 hold a further 20%, the next 10,000 about 25% and another million about 20%".Not that the early adopters' stashes are circulating:
Dorit Ron and Adi Shamir found in a 2012 study that only 22% of then-existing bitcoins were in circulation at all, there were a total of 75 active users or businesses with any kind of volume, one (unidentified) user owned a quarter of all bitcoins in existence, and one large owner was tying to hide their pile by moving it around in thousands of smaller transactions. In the Citigroup analysis, Steven Englander wrote:
The uneven distribution of Bitcoin wealth may be the price to be paid for getting a rapid dissemination of the Bitcoin payments and store of value technology. If you build a better mousetrap, everyone expects you to profit from your invention, but users benefit as well, so there are social benefits even if the innovator grabs a big share.Well, yes, but in this case the 1% of the population who innovated appear to have grabbed about 80% of the wealth, which is a bit excessive.

Since there are very few legal things you can buy with Bitcoin (see Gerard's Chapter 7) this notional wealth is only real if you can convert it into a fiat currency such as USD with which you can buy legal things. There are two problems doing so.

First, Nakamoto's million-Bitcoin hoard is not actually worth $7.5B. It is worth however many dollars other people would pay for it, which would be a whole lot less than $7.5B:
large holders trying to sell their bitcoins risk causing a flash crash; the price is not realisable for any substantial quantity. The market remains thin enough that single traders can send the price up or down $30, and an April 2017 crash from $1180 to 6 cents (due to configuration errors on Coinbase's GDAX exchange) was courtesy of 100 BTC of trades.Second, Jonathan Thornburg was prophetic but not the way he thought:
A week after Bitcoin 0.1 was released, Jonathan Thornburg wrote on the Cryptography and Cryptography Policy mailing list: "To me, this means that no major government is likely to allow Bitcoin in its present form to operate on a large scale."Governments have no problem with people using electricity to compute hashes. As Dread Pirate Roberts found out, they have ways of making their unhappiness clear when this leads to large-scale purchases of illicit substances. But they get really serious when this leads to large-scale evasion of taxes and currency controls.

Governments and the banks they charter like to control their money. The exchanges on which, in practice, almost all cryptocurrency transactions take place are, in effect, financial institutions but are not banks. To move fiat money to and from users the exchanges need to use actual banks. This is where governments exercise control, with regulations such as the US Know Your Customer/Anti Money Laundering regulations. These make it very difficult to convert Bitcoin into fiat currency without revealing real identities and thus paying taxes or conforming to currency controls.

Gerard stresses that Bitcoin is in practice a Chinese phenomenon, both on the mining side:
From 2014 onward, the mining network was based almost entirely in China, running ASICs on very cheap subsidised local electricity (There has long been speculation that much of this is to evade currency controls - buy electricity in yuan, sell bitcoins for dollars) And on the trading side:
Approximately 95% of on-chain transactions are day traders on Chinese exchanges; Western Bitcoin advocates are functionally a sideshow, apart from the actual coders who work on the Bitcoin core software.Gerard agrees with my analysis in Economies of Scale in Peer-to-Peer Networks that economics made decentralization impossible to sustain:
Everything about mining is more efficient in bulk. By the end of 2016, 75% of the bitcoin hashrate was being generated in one building, using 140 megawatts - or over half the estimated power used by all of Google's data centres worldwide.This is the one case where I failed to verify Gerard's citation. The post he links to at NewsBTC says (my emphasis):
According to available information, the Bitmain Cloud Computing Center in Xinjiang, Mainland China will be a 45 room facility with three internal filters maintaining a clean environment. The 140,000 kW facility will also include independent substations and office space.The post suggests that the facility wasn't to be completed until the following month, and quotes a tweet from Peter Todd (my emphasis):
So that's potentially as much as 75% of the current Bitcoin hashing power in one placeGerard appears to have been somewhat ahead of the game.

The most interesting part of the book is Gerard's discussion of Bitfinex, and his explanation for the current bubble in Bitcoin. You need to read the whole thing, but briefly:
  • Bitfinex was based on the code from Bitcoinica, written by a 16-year old. The code was a mess.
  • As a result, in August 2016 nearly 120K BTC (then quoted at around $68M) was stolen from Bitfinex customer accounts.
  • Bitfinex avoided bankruptcy by imposing a 36% haircut across all its users' accounts.
  • Bitfinex offered the users "tokens", which they eventually, last April, redeemed for USD at roughly half what the stolen Bitcoins were then worth.
  • But by then Bitfinex's Taiwanese banks could no longer send USD wires, because Wells Fargo cut them off.
  • So the "USD" were trapped at Bitfinex, and could only be used to buy Bitcoin or other cryptocurrencies on Bitfinex. This caused the Bitcoin price on Bitfinex to go up.
  • Arbitrage between Bitfinex and the other exchanges (which also have trouble getting USD out) caused the price on other exchanges to rise.
Gerard points out that this mechanism drives the current Initial Coin Offering mania:
The trapped "USD" also gets used to buy other cryptocurrencies - the price of altcoins tends to rise and fall with the price of bitcoins - and this has fueled new ICOs ... as people desperately look for somewhere to put their unspendable "dollars". This got Ethereum and ICOs into the bubble as well.In a November 3 post to his blog, Gerard reports that:
You haven’t been able to get actual money out of Bitfinex since mid-March, but now there are increasing user reports of problems withdrawing cryptos as well (archive). Don't worry, the Bitcoin trapped like the USD at Bitfinex can always be used in the next ICO! Who cares about the SEC:
Celebrities and others have recently promoted investments in Initial Coin Offerings (ICOs).  In the SEC’s Report of Investigation concerning The DAO, the Commission warned that virtual tokens or coins sold in ICOs may be securities, and those who offer and sell securities in the United States must comply with the federal securities laws.Or the Chinese authorities:
The People's Bank of China said on its website Monday that it had completed investigations into ICOs, and will strictly punish offerings in the future while penalizing legal violations in ones already completed. The regulator said that those who have already raised money must provide refunds, though it didn't specify how the money would be paid back to investors. This post can only give a taste of an entertaining and instructive book, well worth giving to the Bitcoin enthusiasts in your life. Or you can point them to Izabella Kaminska's interview of David Gerard - it's a wonderfully skeptical take on blockchain technologies and markets.

Open Knowledge Foundation: How mundane admin records helped open Finnish politics: An example of “impolite” transparency advocacy

planet code4lib - Thu, 2017-11-16 09:45

This blogpost was jointly written by Aleksi Knuutila and Georgia Panagiotidou. Their bio’s can be found at the bottom of the page.

In a recent blog post Tom Steinberg, long-term advocate of transparency and open data, looked back on what advocacy groups working on open government had achieved in the past decade. Overall, progress is disappointing. Freedom of Information laws are under threat in many countries, and for all the enthusiasm for open data, much of the information that is public interest remains closed. Public and official support for transparency might be at an all time high, but that doesn’t necessarily mean that governments are transparent.

Steinberg blames the poor progress on one vice of the advocacy groups: being excessively polite. In his interpretation, groups working on transparency, particularly in his native UK, have relied on collaborative, win-win solutions with public authorities. They had been “like a caged bear, tamed by a zookeeper through the feeding of endless tidbits and snacks”. Significant victories in transparency, however, always had associated losers. Meaningful information about institutions made public will have consequences for people in a position of power. That is why strong initiatives for transparency are rarely the result of “polite” efforts, of collaboration and persuasion. They happen when decision-makers face enough pressure to make transparency seem more attractive than any alternative.

The pressure for opening government information can result from transparency itself, especially when it is forced on government. Here the method with which information is made available matters a great deal. Metahaven, a Dutch design collective, coined the term black transparency for the situations in which disclosure happens in an uninvited or involuntary way. The exposed information may itself demonstrate how its systematic availability can be in the public interest. Yet what can be as revealing in black transparency is the response of the authorities, whose reactions in themselves can show their limited commitment to ideals of openness.

Over the past few years, a public struggle took place in Finland regarding information about who influences legislation. Open Knowledge Finland played a part in shifting the debate and agenda by managing to make public a part of the information in question. The story demonstrates both the value and limitations of opening up data as a method of advocacy.

Finland is not perfect after all

Despite its reputation for good governance, Finnish politics is exceptionally opaque when it comes to information about who wields influence in political decisions. In recent years lobbying has become more professional and increasingly happens through hired communications agencies. Large reforms, such as the overhaul of health care, have been mired by the revolving doors (many links in Finnish) between those who design the rules in government and the interest groups looking to exploit them. Yet lobbying in the country is essentially unregulated, and little information is available about who is consulted or how much different interest groups spend on lobbying. While regulating lobbying is a challenge – and transparency can remain toothless – for instance the European Commission keeps an open log about meetings with interest groups and requires them to publish information about their expenditure on lobbying.

Some mundane administrative records become surprisingly important in the public discussion about transparency. The Finnish parliament, like virtually any public building, keeps a log of people who enter and leave. These visitor logs are kept ostensibly for security and are not necessarily designed to be used for other purposes. Yet Finnish activists and journalists, associated with the NGO Open Ministry and the broadcaster Svenska Yle, seized these records to study the influence of private interests. After an initiative to reform copyright law was dropped by parliament in 2014, the group filed freedom of information requests to access the parliament’s visitor log, to see who had met with the MPs influential in the case. Parliament refused to release the information, and over two years of debate in courts followed. In December 2016 the supreme administrative court declared the records public.

Despite the court’s decision, parliament still made access difficult. Following the judgment, the parliament administration began to delete the visitor log daily, making the most recent information about who MPs meet inaccessible. The court’s decision still forced them to keep an archive of older data. In apparent breach of law, the administration did not release this information in electronic format. When faced with requests for access to the records, parliament printed them on paper and insisted that people come to their office to view them. The situation was unusual: the institution responsible for legislation had also decided that it could choose not to follow the instructions of the courts that interpret law.

At this stage, Open Knowledge Finland secured the resources for a wider study of the parliament visitor logs. Because of the administration’s refusal to release the data electronically, we were uncertain what the best course of action was. Nobody knew what the content of the logs would be and whether going through them would be worth the effort. Still, we decided that we should collect and make the information available as soon as possible, while the archive that parliament kept still had some possible public relevance. Collecting and processing the data turned out to be a long process.

The hard work of turning documents into data

In the summer of 2017 the parliament’s administrative offices, on a side street behind the iconic main building, became familiar to us. After having our bags scanned in security, the staff would lead us to a meeting room. Two thick folders filled with papers had been placed on the table, containing the logs of all parliamentary meetings for a period of three months. We were always three people going to parliament, armed with cameras and staplers. After removing the staples from the printouts, we would take photographs in a carefully framed, standardised frame. To photograph the entire available archive, data from a complete year, required close to 2,000 images and four visits to the parliament offices.

Taking the photos in a carefully cropped way was important, since the next challenge was to turn these images into electronic format again. Only in this way could we have the data as a structured dataset that could be searched and queried. For this task open source tools proved invaluable. We used Tesseract for extracting the text from the images, and Tabula for making sure that the text was placed in structured tables. The process, so-called optical character recognition, was inevitably prone to errors. Some errors we were able to correct using tools such as OpenRefine, which is able to identify the likely mistakes in the dataset. Despite the corrections, we made sure the dataset includes references to the original photos, so that the digitised content could be verified from them.

Transforming the paper documents into a useable database required roughly one month of full-time work, spread between our team members. Yet this was only the first step. The content of the visitor log itself was fairly sparse, in most cases only containing dates and names, and little information about people’s affiliations, let alone the content of their meetings. To refine it, we scraped the parliament’s website and connected the names that occur in the log with the identities and affiliations of members of parliament and party staff. Using simple crowdsourcing techniques and public sources of information, we looked at a sample of the 500 people that most frequently visited parliament and tried to understand who they were working for. This stage of refinement required some tricky editorial choices, determining which questions we wanted the data to answer. We chose for instance to classify the most frequent visitors, to be able to answer questions about what parties are most frequently connected to particular types of visitors.

Collaboration with the media

For data geeks like us, being able to access this information was exciting enough. Yet for our final goal, making a case for better regulation on lobbying, releasing a dataset was not sufficient. We chose to partner with investigative journalists, who would be able to present, verify and contextualise the information to a broader audience. Our own analytical efforts focused broader patterns and regularities in the data, while journalists who have been covering Finnish politics for a long time were able to find the most relevant facts and narratives from the data. We gave the data under an embargo to some key journalists, so they would have the time and resources to work on the information. Afterwards the data was available to all journalists who requested it for their own work.

We were lucky that there was sustained media interest in the information. Alfred Harmsworth, the founder of the Daily Mirror, is attributed with the quote “news is what somebody somewhere wants to suppress; the rest is advertising”. In the same vein, when the story broke that the Finnish parliament had started deleting the most recent data about visitors, the interest in the historical records was guaranteed.

Despite the heightened interest, we also became conscious of how difficult it was for the media to interpret data. This was not just because of a lack of technical skills. There simply was such a significant amount of information – details of about 25,000 visits to parliament – that isolating the most meaningful pieces of information or getting an overview of what had happened was a challenge. For news organisations, for whom the dedication of staff even for days on a topic was a significant undertaking, investing into this kind of research was a risk. Even if they would spend the time going through the data, the returns of doing this were uncertain and unclear.

After we released the data to a wider range of publications, many news outlets ended up running fairly superficial stories based on the data, focusing on for instance the most frequently occurring names and other quantities, instead of going through the investigative effort of interrogating the significance of the meetings described in the logs. Information that is in the form of lists lends itself easily to clickbait-like titles. For media outlets that could not wait for their competition to beat them to it, this was to be expected. The news coverage was probably weakened by the fact that we could not share the data with a broader public, due to the fact that it contained personal details that were potentially sensitive. For instance Naomi Colvin has suggested that searchable public databases, that open information for wider scrutiny and discovery, can help to beat the fast tempo of the news cycle and maintain the relevance of datasets.

The stories that resulted from the data

What did journalists find when they wrote stories based on the data? Suomen Kuvalehti ran an in-depth feature that included investigations into the private companies that were most active lobbying. These included a Russian-backed payday loans provider as well as Uber, whose well-funded efforts extend even to Finland. YLE, the Finnish public broadcaster, described the privileged access that representatives of nuclear power enjoyed, while the newspaper Aamulehti showed how individual meetings between legislators and the finance industry had managed to shift regulation. Our own study of the data showed how representatives of private industry were more likely to have access to parties of the governing coalition, while unions and NGOs met more often with opposition parties.

In essence, the stories provided detail about how well-resourced actors were best placed to influence legislation. It confirmed, a cynical person might note, what most people had thought to be the case in the first place. Yet having clear documentation of this phenomenon may well make it harder to ignore. This line of argumentation was often raised with recent large leaks, the value of which may not lie in making public new facts, but providing the concrete data that makes the issue impossible to ignore. “From now on, we can’t pretend we don’t know”, as Slavoj Zizek ironically noted on Wikileaks.

Overall the media response was large. According to our media tracking, at least 50 articles were written in response to the release of the data. Several national newspapers ran editorials on the need for establishing rules for lobbying. In response, four political parties, out of the eight represented in parliament, declared that they would start publishing their own meetings with lobbyists. Parliament was forced to concede, and began to release daily snapshots of data about meetings in an electronic format. These were significant victories, both in practices of transparency as well as changing the policy agenda.

On the importance of time and resources

For a small NGO such as ours, the digitising and processing of information on this scale would obviously not have been possible recently, perhaps even five years ago. Our work was expedited by the availability of powerful open source tools for difficult tasks such as optical character recognition and correcting errors. Being a small association had its advantages as well, as we were aided by the network around the organisation, from which we were able to draw volunteers in areas from data science to media strategy. In many cases governments contain the consequences of releasing information through a kind of excess of transparency: they release so many documents, often in formats that are hard to process, that their meaning becomes muddled. When documents can be automatically processed and queried, this strategy weakens.

Still, it would be naive to think that technology is enough to make information advocacy effective or enough to allow everybody to participate in it. This line of work was possible due to some people’s commitment and personal sacrifice that spanned several years, as well as significant amounts of funding on the right moments. Notably, no newsroom would by themselves have had the resources to sustain the several months of labour that working through the data required. The strategy of being less “polite”, in Tom Steinberg’s terms, may well be desirable, but the obvious challenge is securing the resources to do it.


Author bio’s

Dr. Aleksi Knuutila is a social scientist with a focus on civic technologies and the politics of data, and an interest in applying both computational and qualitative methods for investigation and research. As a researcher with Open Knowledge Finland, he has advised the Finnish government on their personal data strategy and studied political lobbying using public sources of data. He is currently working on an a toolkit for using freedom of information for investigating how data and analytics are used in the public sector.

Georgia Panagiotidou is a software developer and data visualisation designer, with a focus on the intersections between media and technology. She was part of the Helsingin Sanomat data desk where she used to work to make data stories more reader friendly. Now, among other things, she works in data journalism projects most recently with Open Knowledge Finland to digitise and analyse the Finnish parliament visitor’s log. Her interests lie in open data, civic tech, data journalism and media art.

We would like to thank the following people who gave an invaluable contribution to the work: Sneha Das, Jari Hanska, Antti Knuutila, Erkka Laitinen, Juuso Parkkinen, Tuomas Peltomäki, Aleksi Romanov, Liisi Soroush, Salla Thure

District Dispatch: ALA signs trade policy principles

planet code4lib - Wed, 2017-11-15 18:21

Today ALA signed the Washington Principles on Copyright Balance in Trade Agreements, joining over 70 international copyright experts, think tanks and public interest groups. The Principles address the need for balanced copyright policy in trade agreements.

Over the years, trade policies have increasingly implicated copyright and other IP laws, sometimes creating international trade policies that conflict with U.S. copyright law by enforcing existing rights holder interests without considering the interests of new industry entrants and user rights to information. U.S. copyright law is exemplary in promoting innovation, creativity and information sector industries—software, computer design, research—because of fair use, safe harbor provisions, and other exceptions and limitations lacking in other countries.

The Principles were developed at a convening of U.S., Canadian and Mexican law professor and policy experts held by American University Washington College of Law’s Program on Information Justice and Intellectual Property (PIJIP). These three countries are currently engaged in NAFTA negotiations

The Washington Principles:
• Protect and promote copyright balance, including fair use
• Provide technology-enabling exceptions, such as for search engines and text- and data-mining
• Require safe harbor provisions to protect online platforms from users’ infringement
• Ensure legitimate exceptions for anti-circumvention, such as documentary filmmaking, cybersecurity research, and allowing assistive reading technologies for the blind
• Adhere to existing multilateral commitments on copyright term
• Guarantee proportionality and due process in copyright enforcement

The Principles are supplemented by “Measuring the Impact of Copyright Balance,” new research from the American University that finds that balanced copyright policies in foreign countries have a positive effect on the information sector industries in terms of net income and total sales and in the local production of creative and scholarly works and other high-quality output. These positive effects, however, do not harm the revenues of traditional content and entertainment industries. This suggests that industry, creativity and research are more likely to thrive under more open user rights policies that allow for experimentation and transformative use.

The post ALA signs trade policy principles appeared first on District Dispatch.

District Dispatch: Librarians comment on Education Department priorities

planet code4lib - Wed, 2017-11-15 18:01

The American Library Association and librarians across the country submitted comments to the Department of Education (ED) in response to its 11 proposed priorities. The priorities, standard for a new administration, are a menu of goals for the ED to use for individual discretionary grant competitions. Over 1,100 individual comments were filed with the ED, including several dozen from the library community. 

ALA noted the important role of public and school libraries in several key priority areas and how librarians help students of all ages. ALA commented on the role of libraries in providing flexible learning environments, addressing STEM needs, promoting literacy, expanding economic opportunity, as well as assisting veterans in achieving their educational goals.

In its letter to the ED, ALA noted:

“Libraries play an instrumental role in advancing formal educational programs as well as informal learning from pre-school through post-secondary education and beyond. Libraries possess relevant information, technology, experts, and community respect and trust to propel education and learning.”

Many librarians responded to ALA’s Action Alert, urging the ED to include libraries in its priorities, reflecting the range of services available at public and school libraries.

Responding to the need for STEM and computer skills development in Priority 6, one Baltimore City library media specialist wrote:

“Computer science is now foundational knowledge every student needs, yet students, particularly students of color and students on free and reduced lunch in urban and rural areas, do not have access to high-quality computer science courses. Females are not participating in equal numbers in the field of computer science or K-12 computer science courses. This is a problem the computer science community can address by giving teachers access to high-quality computer science professional development and schools access to courses focused on serving underserved communities.”

Highlighting the importance of certified librarians at school libraries, one commenter noted that “certified librarians found in school libraries are instructional partners, curriculum developers, and program developers that meet the objectives of their individual school’s improvement plan. School libraries are a foundational support system for all students.”

Echoing these comments, another school library advocate stated: “School libraries and school librarians transform student learning. They help learners to become critical thinkers, enthusiastic readers, skillful researchers, and ethical users of information. They empower learners with the skills needed to be college, career, and community ready.”

The comment period has closed, but individual comments will be available on the ED website.

The post Librarians comment on Education Department priorities appeared first on District Dispatch.

William Denton: COUNTER data made tidy

planet code4lib - Wed, 2017-11-15 13:16

At work I’m analysing usage of ebooks, as reported by vendors in COUNTER reports. The Excel spreadsheet versions are ugly but a little bit of R can bring them into the tidyverse and give you nice, clean, usable data that meets the three rules of tidy data:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

There are two kinds of COUNTER reports for books: BR1 (“Number of Successful Title Requests by Month and Title”) counts how many times people looked at a book and BR2 (“Number of Successful Section Requests by Month and Title”) counts how many times they look at a part (like a chapter) of a book. The reports are formatted in the same human-readable way, so this code works for both, but be careful to handle them separately.

Fragment of COUNTER report

They start with seven lines of metadata about the report, and then you get the actual data. There are a few required columns, one of which is the title of the book, but that column doesn’t have a heading! It’s blank! Further to the right are columns for each month of the reporting period. Rows are for books or sections, but there is also a “Total for all titles” row that sums them all up.

This formatting is human-readable but terrible for machines. Happily, that’s easy to fix.

First, in R, load in some packages:

  • the basic set of tidyverse packages;
  • readxl, to read Excel spreadsheets;
  • lubridate, to manipulate dates; and
  • yulr, my own package of some little helper functions. If you want to use it you’ll need to install it specially, as explained in its documentation.
library(tidyverse) library(readxl) library(lubridate) library(yulr)

As it happens the COUNTER reports are all in one Excel spreadsheet, organized by sheets. Brill’s 2014 report is in the sheet named “Brill 2014,” so I need to pick it out and work on it. The flow is:

  • load in the sheet, skipping the first seven lines (including the one that tells you if it’s BR1 or BR2)
  • cut out columns I don’t want with a minus select
  • use gather to reshape the table by moving the month columns to rows, where the month name ends up in a column named “month;” the other fields that are minus selected are carried along unchanged
  • rename two columns
  • reformat the month name into a proper date, and rename the unnamed title column (which ended up being called X__1) while truncating it to 50 characters
  • filter out the row that adds up all the numbers
  • reorder the columns for human viewing
brill_2014 <- read_xlsx("eBook Usage.xlsx", sheet = "Brill 2014", skip = 7) %>% select(-ISSN, -`Book DOI`, -`Proprietary Identifier`, -`Reporting Period Total`) %>% gather(month, usage, -X__1, -ISBN, -Publisher, -Platform) %>% rename(platform = Platform, publisher = Publisher) %>% mutate(month = floor_date(as.Date(as.numeric(month), origin = "1900-01-01"), "month"), title = substr(X__1, 1, 50)) %>% filter(! title == "Total for all titles") %>% select(month, usage, ISBN, platform, publisher, title)

Looking at this I think that date mutation business may not always be needed, but some of the date formatting I had was wonky, and this made it all work.

That line above just works for one year. I had four years of Brill data, and didn’t want to repeat the long line for each, because if I ever need to make a change I’d have to make it four times and if I missed one there’d be a problem. This is the time to create a function. Now my code looks like this:

counter_parse_brill <- function (x) { x %>% select(-ISSN, -`Book DOI`, -`Proprietary Identifier`, -`Reporting Period Total`) %>% gather(month, usage, -X__1, -ISBN, -Publisher, -Platform) %>% rename(platform = Platform, publisher = Publisher) %>% mutate(month = floor_date(as.Date(as.numeric(month), origin = "1900-01-01"), "month"), title = substr(X__1, 1, 50)) %>% filter(! title == "Total for all titles") %>% select(month, usage, ISBN, platform, publisher, title) } brill_2014 <- read_xlsx("eBook Usage.xlsx", sheet = "Brill 2014", skip = 7) %>% counter_parse_brill() brill_2015 <- read_xlsx("eBook Usage.xlsx", sheet = "Brill 2015", skip = 7) %>% counter_parse_brill() brill_2016 <- read_xlsx("eBook Usage.xlsx", sheet = "Brill 2016", skip = 7) %>% counter_parse_brill() brill_2017 <- read_xlsx("eBook Usage.xlsx", sheet = "Brill 2017", skip = 7) %>% counter_parse_brill() brill <- rbind(brill_2014, brill_2015, brill_2016, brill_2017)

That looks much nicer in Emacs (in Org, of course):

R in Org

I have similar functions for other vendors. They are all very similar, but sometimes a (mandatory) Book DOI field or something else is missing, so a little fiddling is needed. Each vendor’s complete data goes into its own tibble, which I then glue together. Then I delete all the rows where no month is defined (which, come to think of it, I should investigate to make sure these aren’t being introduced by some mistake I made in reshaping the data), I add the ayear column so I can group things by academic year, and where usage of a book in a given month is 0, I make it 0 instead of NA.

ebook_usage <- rbind(brill, ebl, ebook_central, iet, scholars_portal, spie) ebook_usage <- ebook_usage %>% filter(! ebook_usage <- ebook_usage %>% mutate(ayear = academic_year(month)) ebook_usage$usage[$usage)] <- 0

The data now looks like this (truncating the title even more for display here):

month usage ISBN platform publisher title ayear 2014-01-01 0 9789004216921 BOPI Brill A Comme 2013 2014-01-01 0 9789047427018 BOPI Brill A Wande 2013 2014-01-01 0 9789004222656 BOPI Brill A World 2013 > str(ebook_usage) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1343899 obs. of 7 variables: $ month : Date, format: "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01" ... $ usage : num 0 0 0 0 0 0 0 0 0 0 ... $ ISBN : chr "9789004216921" "9789047427018" "9789004222656" "9789004214149" ... $ platform : chr "BOPI" "BOPI" "BOPI" "BOPI" ... $ publisher: chr "Brill" "Brill" "Brill" "Brill" ... $ title : chr "A Commentary on the United Nations Convention on t" "A Wandering Galilean: Essays in Honour of Seán Fre" "A World of Beasts: A Thirteenth-Century Illustrate" "American Diplomacy" ... $ ayear : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...

The data is now ready for analysis.

HangingTogether: Who should take the Survey of Research Information Management Practices?

planet code4lib - Wed, 2017-11-15 11:00

Is it someone in the library? Or maybe in the office of institutional research? What about the office of the vice president for research? Or maybe the provost’s or rector’s office?

It depends.

Practices vary regionally and also institutionally, and we hope to learn more about these variations through the international Survey of Research Information Management Practices, which has been collaboratively developed by OCLC Research and euroCRIS. As we articulated in a recent OCLC Research position paper, Research Information Management: Defining RIM and the Library’s Role, because of the enterprise nature of the data inputs and uses, there are numerous institutional stakeholders in research information management, including the research office, institutional research, provost or rector, library, human resources, registrar, and campus communications. In some institutions, such as the University of Arizona, the library has assumed a leading role. In other institutions, such as Universität Münster, the research office takes the lead. 

Connecting the survey with the right person within each institution is a central challenge for this study. And that’s where we need your help in promoting this survey and getting it to the right person at your own institution. While completing the survey itself should take only 10-30 minutes, we recognize that it may take some additional legwork to answer all of the questions on behalf of your institution. That’s why we’ve provided a PDF version of the survey for you to review in advance. We also want to encourage all research institutions and universities to participate, regardless of the status of RIM adoption. One survey per institution, please.

In the meantime, OCLC and euroCRIS are working to promote the survey through multiple channels, including:

Our goal is to collect and share meaningful information on behalf of all stakeholders within the research information management landscape worldwide. We need the leadership and engagement of each of you in the community. Thanks for participating, and contact me if you have questions.

Rebecca Bryant, PhD

Cynthia Ng: Horizon Primer in Brief: Circulation Rules

planet code4lib - Tue, 2017-11-14 23:32
I have spent hours mapping out circulation rules the last few weeks in preparation for a pilot project we are about to run at my library. In the process, I have learnt a great deal about circulation parameters and privileges that I’ve put together in a brief primer below. If I missed anything, please let … Continue reading Horizon Primer in Brief: Circulation Rules

Evergreen ILS: T-Shirts! Voting!

planet code4lib - Tue, 2017-11-14 20:46

Last year we did our first community t-shirt featuring a quote from a community member (pulled from the IRC quotes database).  This year we are doing it again but it will be a new quote.  Please rank your favorites from 1 to 6 and the one with the strongest weight will be used. This shirt will be available at the next International Evergreen Conference for sale along with the limited supply stock of the existing “I’m not a cataloger but I know enough MARC to be fun at parties” shirt.

Voting is done here:

Only one vote per person but make your opinion known!

Open Knowledge Foundation: Visual gateways into science: Why it’s time to change the way we discover research

planet code4lib - Tue, 2017-11-14 15:15

Have you ever noticed that it is really hard to get an overview of a research field that you know nothing about? Let’s assume for a minute that a family member or a loved one of yours has fallen ill and unfortunately, the standard treatment isn’t working. Like many other people, you now want to get into the research on the illness to better understand what’s going on.

You proceed to type the name of the disease into PubMed or Google Scholar – and you are confronted with thousands of results, more than you could ever read.

It’s hard to determine where to start, because you don’t understand the terminology in the field, you don’t know what the main areas are, and it’s hard to identify important papers, journals, and authors just by looking at the results list. With time and patience you could probably get there. However, this is time that you do not have, because decisions need to be made. Decisions that may have grave implications for the patient.

If you have ever had a similar experience, you are not alone. We are all swamped with the literature, and even experts struggle with this problem. In the Zika epidemic in 2015 for example, many people scrambled to get an overview of what was until then an obscure research topic. This included researchers, but also practitioners and public health officials. And it’s not just medicine; almost all areas of research have become so specialized that they’re almost impenetrable from the outside.

But the thing is, there are many people on the outside that could benefit from scientific knowledge. Think about journalists, fact checkers, policy makers or students.

They all have the same problem – they don’t have a way in.

Reuse of scientific knowledge within academia is already limited, but when we’re looking at transfer to practice, the gap is even wider. Even in application-oriented disciplines, only a small percentage of research findings ever influence practice – and even if they do so, often with a considerable delay.

At Open Knowledge Maps, a non-profit organization dedicated to improving the visibility of scientific knowledge for science and society, it is our mission to change that. We want to provide visual gateways into research – because we think that it is important that we do not only provide access to research findings, but also to enable discoverability of scientific knowledge.

At the moment, there is a missing link between accessibility and discoverability – and we want to provide that link.

Imagine a world, where you can get an overview of any research field at a glance, meaning you can easily determine the main areas and relevant concepts in the field. In addition, you can instantly identify a set of papers that are relevant for your information need. We call such overviews knowledge maps. You can find an example for the field of heart diseases below. The bubbles represent the main areas and relevant papers are already attached to each of the areas.

Now imagine that each of these maps is adapted to the needs of different types of users, researchers, students, journalists or patients. And not only that: they are all structured and connected and they contain annotated pathways through the literature as to what to read first, and how to proceed afterwards.

This is the vision that we’ve have been working on for the past 1.5 years as a growing community of designers, developers, communicators, advisors, partners, and users. On our website, we are offering an openly accessible service, which allows you to create a knowledge map for any discipline. Users can choose between two databases: Bielefeld Academic Search Engine (BASE) with more than 110 million scientific documents from all disciplines, and PubMed, the large biomedical database with 26 million references. We use the 100 most relevant results for a search term as reported by the respective database as a starting point for our knowledge maps. We use text similarity to create the knowledge maps. The algorithm groups those papers together that have many words in common. See below for an example map of digital education.

See map at:

We have received a lot of positive feedback on this service from the community. We are honored and humbled by hundreds of enthusiastic posts in blogs, and on Facebook and Twitter. The service has also been featured on the front pages of reddit and HackerNews, and recently, we won the Open Minds Award, the Austrian Open Source Award. Since the first launch of the service in May 2016, we have had more than 200,000 visits on Open Knowledge Maps. Currently, more than 20,000 users leverage Open Knowledge Maps for their research, work, and studies per month.

The “Open” in Open Knowledge Maps does not only stand for open access articles – we want to go the whole way of open science and create a public good.

This means that all of our software is developed open source. You can also find our development roadmap on Github and leave comments by opening an issue. The knowledge maps themselves are licensed under a Creative Commons Attribution license and can be freely shared and modified. We will also openly share the underlying data, for example as Linked Open Data. This way, we want to contribute to the open science ecosystem that our partners, including Open Knowledge Austria, rOpenSci, ContentMine, the Internet Archive Labs and Wikimedia are creating.

Open Knowledge International has played a crucial role in incubating the idea of an open discovery platform, by way of a Panton Fellowship where the first prototype of the search service was created. Since then, the Open Knowledge Network has enthusiastically supported the project, in particular the Austrian chapter as well as Open Knowledge International, Open Knowledge Germany and other regional organisations. Members of the international Open Knowledge community have become indispensable for Open Knowledge Maps, be it as team members, advisors or active supporters. A big shout-out and thank you to you!

As a next step, we want to work on structuring and connecting these maps – and we want to turn discovery into a collaborative process. Because someone has already gone that way before and they have all the overview and the insights. We want to enable people to communicate this knowledge so that we can start laying pathways through science for each other. We have created a short video to illustrate this idea:

HangingTogether: How much metadata is practical?

planet code4lib - Tue, 2017-11-14 14:00

That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Jennifer Baxmeyer of Princeton, MJ Han of University of Illinois at Urbana-Champaign, and Stephen Hearn of the University of Minnesota. With the increasing availability of online metadata, we are seeing metadata added to discovery environments representing objects of widely varying granularity. For example, an article in Physical Review Letters—Precision Measurement of the Top Quark Mass in Lepton + Jets Final State—has approximately 300 author names for a five page article (some pictured here).

This seems disproportionate, especially when other objects with many contributors such as feature films and orchestral recordings are represented by only a relative handful of the associated names. If all the names associated with a film or a symphony recording were readily available as metadata, would it be appropriate to include them in library cataloging? Ensuring optimal search results in an environment in which metadata from varying sources with differing models of granularity and extensiveness poses challenges for catalogers and metadata managers.

Abbreviated forms of author names on journal articles make it difficult, and often impossible, to match them to the correct authority form, if it exists. Some discovery systems show only the first two or three lines of author names.  Research Information Management systems make it easier to apply some identity management for local researchers so that they are correctly associated with the articles they have written, which are displayed as part of their profiles. (See for example, Scholars@Duke, Experts@Minnesota or University of Illinois at Urbana-Champaign’s Experts research profiles.)  A number noted that attempts to encourage their researchers to include an ORCID (Open Researcher and Contributor Identifier) in their profiles have met with limited success. Linked data was posited as a potential way to access information across many environments, including those in Research Information Systems, but would require wider use of identifiers.

Much of the discussion was about receiving not enough good metadata from vendors rather than too much. A number of the metadata managers viewed quality at least as important as granularity. Some libraries reject records that are do not meet their minimum standards, while others apply batch updates before loading the records. One criteria for “good enough” metadata is whether it is sufficient to generate an accurate citation. Metadata quality has become a key concern, as evidenced by the Digital Library Federation’s Metadata Assessment Working Group, formed to “measure, evaluate and assess the metadata” in a variety of digital library systems. Record-by-record enhancement was widely considered impractical.

Information surplus will only increase, with accompanying varying levels of metadata granularity. It remains to be seen how the community can bridge if not integrate the various silos of information.



Harvard Library Innovation Lab: Overheard in LIL - Episode 2

planet code4lib - Tue, 2017-11-14 00:00

This week:

A chat bot that can sue anyone and everything!

Devices listening to our every move

And an interview with Jack Cushman, a developer in LIL, about built-in compassion (and cruelty) in law, why lawyers should learn to program, weird internet, and lovely podcast gimmicks (specifically that of Rachel and Griffin McElroy's Wonderful! podcast)

Starring Adam Ziegler, Anastasia Aizman, Brett Johnson, Casey Gruppioni, and Jack Cushman.

Library Tech Talk (U of Michigan): The Voyages of a Digital Collections Audit: Episode 1 (Charting Our Course)

planet code4lib - Tue, 2017-11-14 00:00

The Digital Content & Collections department begins an ambitious full audit of our 280+ digital collections. In this first in a blog series about the endeavor, I note why we are doing this, how we surveyed the digital landscape, how we cemented alliances with others who will help us along the way, and where we're heading next.

District Dispatch: Creating the NLLD special collection

planet code4lib - Mon, 2017-11-13 14:00

This is a guest post from Rosalind Seidel, our fall Special Collections Intern joining us from the University of Maryland (UMD). Rosie has one semester left in her MLIS program at UMD, and hopes to become a rare books and special collections librarian. She graduated from Loyola University New Orleans with a Bachelor of Arts in English Literature and Medieval Studies.

Covers of NLLD participant folders. From top, left to right: “The Card with the Charge” sticker from 1988; “Information Power” button from 1965; “A Word to the Wise” sticker from 1983; “Take time to read” sticker from 1987; “America’s greatest bargain: the library” sticker from 1980.

My first project working in the American Library Association’s Washington Office involved inventorying, processing and creating a finding aid for the wealth of National Library Legislative Day files. Next, this collection will be sent to the ALA Archives where it will be digitized for future access.

Before I began this project, I was unfamiliar with National Library Legislative Day (NLLD) and its purpose. Delving into the files, I quickly learned NLLD is an annual event spanning two days where hundreds of librarians, information professionals and library supporters from across the country come together in Washington, D.C., to meet with their representatives and to advocate for our nation’s libraries.

The files I worked with ranged in date from 1973 to 2016. Such an expanse of years saw quite the development in advocacy for libraries across a 43-year period.Through the files, I got to look at the country and information policy in a whole new light. I began with files from 2016, working my way backward. As I went, it was interesting to see where certain issues arose, and how long they remained focal points. It was surreal, for example, to reach 1994—the year I was born—and see what the ongoing dialogue was, such as “Kids Need Libraries” and “How Stupid Can We Get?”  Surely it is because of the work of library advocates on NLLD that I grew up with the state of libraries and access to information that I did, and I owe them a debt of gratitude. Going forward as a young information professional, it will be my place to do the same.

The reoccurring issues in NLLD’s long history include the Library Services and Construction Act, the Elementary and Secondary Education Act, the Higher Education Act, the White House Conference on Libraries and Information Services, copyright, Title 44, the Library Services and Technology Act, the National Endowment for the Humanities, federal funding for libraries, and access to government information… just to name a few.

What I liked about the NLLD files is that the handouts usually took into account all levels of expertise of NLLD participants which, in turn, made the files an approachable collection. The handouts worked to make all NLLD events, such as lobbying, accessible even to the newest of participants and they also informed and educated participants about the issues on the agenda. I also got to handle letters from various U.S. Presidents in support of National Library Week. From those and other documents, I was able to see how information professionals viewed different administrations and, because of that, when NLLD efforts needed to be strengthened.

Overall, I valued the opportunity I was given to learn more about policy, and those policies I was unaware of that have better informed me about the history and state of librarianship. As my internship continues, I will be given the chance to explore the Washington Office’s history and the work they do even more. My internship has allowed me to interact with libraries, government, the information field, and history in incredible ways that I would never have anticipated. I look forward to what is to come.

NLLD 2018 will take place on May 7 and 8. Registration will open on December 1, 2017. To learn more about participating, visit:

The post Creating the NLLD special collection appeared first on District Dispatch.

Richard Wallis: Structured Data: Helping Google Understand Your Site

planet code4lib - Mon, 2017-11-13 13:40

Firstly let me credit the source that alerted me to the subject of this post.

Jennifer Slegg of TheSEMPost wrote an article last week entitled Adding Structured Data Helps Google Understand & Rank Webpages Better. It was her report of a conversation with Google’s Webmaster Trends Analyst Gary Illyes at PUBCON Las Vegas 2017.

Jennifer’s report included some quotes from Gary which go a long way towards clarifying the power, relevance, and importance for Google for embedding Structured Data in web pages.

Those that have encountered my evangelism for doing just that, will know that there have been many assumptions about the potential effects of adding Structured Data to your HTML, but this is the first confirmation of those assumptions by Google folks that I am aware of.

To my mind there are two key quotes from Garry, firstly:

But more importantly, add structure data to your pages because during indexing, we will be able to better understand what your site is about.

In the early [only useful for Rich Snippets] days of, Google representatives went out of their way to assert that adding to a page would NOT influence its position in search results.  More recently, ‘non-commital’ could be described as the answer to questions about and indexing.    Gary’s phrasing is a little interesting “during indexing, we will be able to better understand“, but you can really only drawn certain conclutions from them.

So, this answers one of the main questions I am asked by those looking to me for help in understanding and applying

If I go to the trouble of adding to my pages, will Google [and others] take any notice?”  To paraphrase Mr Illyes — Yes.

The second key quote:

And don’t just think about the structured data that we documented on Think about any schema that you could use on your pages. It will help us understand your pages better, and indirectly, it leads to better ranks in some sense, because we can rank easier.

So structured data is important, take any schema from and implement it, as it will help.

This answers directly another common challenge I get when recommending the use of the whole vocabulary, and its extensions, as a source of potential Structured Data types for marking up your pages.

The challenge being “where is the evidence that any schema, not documented in the Google Developers Structured Data Pages, will be taken notice of?

So thanks Gary, you have just made my job, and the understanding of those that I help, a heck of a lot easier.

Appart from those two key points there are some other interesting takeaways from his session as reported by Jennifer.

Their recent increased emphasis on things Structured Data:

We launched a bunch of search features that are based on structured data. It was badges on image search, jobs was another thing, job search, recipes, movies, local restaurants, courses and a bunch of other things that rely solely on structure data, annotations.

It is almost like we started building lots of new features that rely on structured data, kind of like we started caring more and more and more about structured data. That is an important hint for you if you want your sites to appear in search features, implement structured data.

Google’s further increased Structured Data emphasis in the near future:

Next year, there will be two things we want to focus on. The first is structured data. You can expect more applications for structured data, more stuff like jobs, like recipes, like products, etc.

For those who have been sceptical as to the current commitment of Google and others to and Structured Data, this should go some way towards settling your concerns.

It is at his point I add in my usual warning against rushing off and liberally sprinkling terms across your web pages.  It is not like keywords.

The search engines are looking for structured descriptions (the clue is in the name) of the Things (entities) that your pages are describing; the properties of those things; and the relationships between those things and other entities.

Behind and Structured Data are some well established Linked Data principles, and to get the most effect from your efforts, it is worth recognising them.  

Applying Structured Data to your site is not rocket science, but it does need a little thought and planning to get it right.   With organisatons such as Google taking notice, like most things in life, it is worth doing right if you are going to do it at all.


Ed Summers: Prospectus

planet code4lib - Mon, 2017-11-13 05:00

I’ve been trying to keep this blog updated as I move through the PhD program at the UMD iSchool. Sometimes it’s difficult to share things here because of fear that the content or ideas are just too rough around the edges. The big assumption being that anybody even finds it, and then finds the time to read the content.

As with most PhD programs the work is leading up to the dissertation. I’m finishing my coursework this semester and so I have put together a prospectus for the research I’d like to do in my dissertation. I’m going to spend the next 8 months or so doing a lot of background reading and writing about it, in order to set up this research. I imagine this prospectus will get revised some more before I share it with my committee, and the trajectory itself will surely change as I work through it. But I thought I’d share the prospectus in this preliminary state to see if anyone has suggestions for things to read or angles to take.

Many thanks to my advisor Ricky Punzalan for his help getting me this far.

Appraisal Practices in Web Archives

It is difficult to imagine today’s scientific, cultural and political systems without the web and the underlying Internet. As the web has become a dominant form of global communications and publishing over the last 25 years we have witnessed the emergence of web archiving as an increasingly important activity. Web archiving is the practice of collecting content from the web for preservation, which is then made accessible at another part of the web known as a web archive. Developing record keeping practices for web content is extremely important for the production of history (Brügger, 2017) and for sustaining the networked public sphere (Benkler, 2006). However, even with widespread practice we still understand very little about the processes by which web content is being selected for an archive.

Part of the reason for this is that the web is an immensely large, decentralized and constantly changing information landscape. Despite efforts to archive the entire web (Kahle, 2007), the idea of a complete archive of the web remains both economically infeasible (Rosenthal, 2012), and theoretically impossible (Masanès, 2006). Features of the web’s Hypertext Transfer Protocol (HTTP), such as code-on-demand (Fielding, 2000), content caching (Fielding, Nottingham, & Reschke, 2014) and personalization (Barth, 2011), have transformed what was originally conceived of as a document oriented web into an information system that delivers information based on who you are, when you ask, and what software you use (Berners-Lee & Fischetti, 2000). The very notion of a singular artifact that can be archived, which has been under strain since the introduction of electronic records (Bearman, 1989), is now being pushed to its conceptual limit.

The web is also a site of constant breakdown (Bowker & Star, 2000) in the form of broken links, failed business models, unsustainable infrastructure, obsolescence and general neglect. Ceglowski (2011) has estimated that about a quarter of all links break every 7 years. Even within highly curated regions of the web, such as scholarly publishing (Sanderson, Phillips, & Sompel, 2011) and jurisprudence (Zittrain, Albert, & Lessig, 2014) rates of link rot can be as high as 50%. Web archiving projects work in varying measures to stem this tide of loss–to save what is deemed worth saving before it becomes 404 Not Found. In many ways web archiving can be seen as a form of repair or maintenance work (Graham & Thrift, 2007 ; Jackson, 2014) that is conducted by archivists in collaboration with each other, as well as with tools and infrastructures that support their efforts.

Deciding what to keep and what gets to be labeled archival have long been a topic of discussion in archival science. Over the past two centuries archival researchers have developed a body of literature around the concept of appraisal, which is the practice of identifying and retaining records of enduring value. The rapid increase in the amount of records being generated, which began in the mid-20th century, led to the inevitable realization that it is impractical to attempt to preserve the complete documentary record. Appraisal decisions must be made, which necessarily shape the archive over time, and by extension our knowledge of the past (Bearman, 1989 ; Cook, 2011). It is in the particular contingencies of the historical moment that the archive is created, sustained and used (Booms, 1987 ; Harris, 2002). The desire for a technology that enables a complete archival record of the web, where everything is preserved and remembered in an archival panopticon, is an idea that has deep philosophical roots, and many social and political ramifications (Brothman, 2001 ; Mayer-Schönberger, 2011).

Notwithstanding these theoretical and practical complexities, the construction of web archives presents new design opportunities for archivists to work in collaboration with each other, as well as with the systems, services and bespoke software solutions used for performing the work. It is essential for these designs to be informed by a better understanding of the processes by which web content is selected for an archive. What are the approaches and theoretical underpinnings for appraisal in web archiving as a sociotechnical appraisal practice? To lay the foundation for answering this question I will be reviewing and integrating the research literature in three areas: Archives and Memory, Sociotechnical Systems (STS), and Praxiography.

Clearly, a firm grounding in the literature of appraisal practices in archives is an important dimension to this research project. Understanding the various appraisal techniques that have been articulated and deployed will help in assessing how these techniques are being translated to the appraisal of web content (Maemura, Becker, & Milligan, 2016). Particular attention will be paid to emerging practices for the appraisal of electronic records and web content. Because the web is a significantly different medium than archives have traditionally dealt with it is important to situate archival appraisal within the larger context of social or collective memory practices (Jacobsen, Punzalan, & Hedstrom, 2013). In addition, the emerging practice of participatory archiving will also be examined to gain insight into how the web is allowing the gatekeeping role of the archivist.

Appraisal practices for web content necessarily involve the use of computer technology as both the means by which the archival processing is performed, and as the source of the content that is being archived. Any analysis of appraisal practices must account for the ways in which the archivist and the technology of the web work together as part of a sociotechnical system. While the specific technical implementations of web archiving systems are of interest, the subject of archival appraisal requires that these systems be studied for their social and cultural and effects. The interdisciplinary approach of software studies provide a theoretical and methodological approach for analyzing computer technologies as assemblages of software, hardware, standards and social practices. Examining the literature of software studies as it relates to archival appraisal will also selectively include reading in the related areas of infrastructure, platform and algorithm studies.

Finally, since archival appraisal is at its core a practice it is imperative to theoretically ground an analysis of appraisal using the literature of practice theory or praxiography. Praxiography is a broad interdisciplinary field of research that draws upon branches of anthropology, sociology, history of science and philosophy in order to understand practice as a sociomaterial phenomena. Ethnographic attention to topics such as rules, strategies, outcomes, training, mentorship, artifacts, work and history also provide an approach to empirical study that I plan on using in my research.


2017-11-01 - Prospectus Draft

2017-12-01 - Prospectus Final Draft

2017-12-15 - Committee Review

2018-01-15 - Committee Approval Meeting

2018-09-01 - Proposal Final Draft

2018-10-01 - Proposal Defense

Reading Archives and Memory

Anderson, K. D. (2011). Appraisal Learning Networks: How University Archivists Learn to Appraise Through Social Interaction. Los Angeles: University of California, Los Angeles.

Bond, L., Craps, S., and Vermeulen, P. (2017). Memory unbound: tracing the dynamics of memory studies. New York: Berghahn.

Bowker, G. C. (2005). Memory practices in the sciences. Cambridge: MIT Press.

Caswell, M. (2014). Archiving the Unspeakable: Silence, Memory, and the Photographic Record in Cambodia. Madison, WI: University of Wisconsin Press.

Daston, L., editor (2017). Science in the archives: pasts, presents, futures. Chicago: University of Chicago Press.

Gilliland, A. J., McKemmish, S., and Lau, A., editors (2016). Research in the Archival Multiverse. Melbourne: Monash University Press.

Halbwachs, M. (1992). On collective memory. Chicago: University of Chicago Press.

Hoskins, A., editor (2018). Digital memory studies: Media pasts in transition. London: Routledge.

Kosnik, A. D. (2016). Rogue Archives: Digital Cultural Memory and Media Fandom. Cambridge: MIT Press.

Van Dijck, J. (2007). Mediated memories in the digital age. Palo Alto: Stanford University Press.

Sociotechnical Theory

Berry, D. (2011). The philosophy of software: Code and mediation in the digital age. New York: Palgrave Macmillan.

Bowker, G. C. and Star, S. L. (2000). Sorting things out: Classification and its consequences. Cambridge: MIT Press.

Brunton, F. (2013). Spam: A shadow history of the Internet. Cambridge: MIT Press.

Chun, W. H. K. (2016). Updating to Remain the Same: Habitual New Media. Cambridge: MIT Press.

Cubitt, S. (2016). Finite Media: Environmental Implications of Digital Technologies. Durham: Duke University Press.

Hu, T. (2015). A Prehistory of the Cloud. Cambridge: MIT Press.

Dourish, P. (2017). The Stuff of Bits: An Essay on the Materialities of Information. Cambridge: MIT Press.

Edwards, P. N. (2010). A vast machine: Computer models, climate data, and the politics of global warming. Cambridge: MIT Press.

Emerson, L. (2014). Reading writing interfaces: From the digital to the bookbound. Minneapolis: University of Minnesota Press.

Kittler, F. A. (1999). Gramophone, film, typewriter. Palo Alto: Stanford University Press.

Galloway, A. R. (2004). Protocol: How control exists after decentralization. Cambridge: MIT Press.

Kelty, C. M. (2008). Two bits: The cultural significance of free software. Durham: Duke University Press.

Kitchin, R. and Dodge, M. (2011). Code/Space: Software and Everyday Life. Cambridge: MIT Press.

Rossiter, N. (2016). Software, Infrastructure, Labor: A Media Theory of Logistical Nightmares. Oxford: Routledge.

Russell, A. L. (2014). Open standards and the digital age. Cambridge: Cambridge University Press.

Practice Theory

Bourdieu, P. (1977). Outline of a Theory of Practice. Cambridge: Cambridge University Press.

Bräuchler, B. and Postill, J. (2010). Theorising media and practice. Bristol: Berghahn Books.

Foucault, M. (2012). Discipline & punish: The birth of the prison. New York: Vintage.

Latour, B. (2005). Reassembling the social: An introduction to actor-network-theory. Oxford: Oxford University Press.

Law, J. (2002). Aircraft stories: Decentering the object in technoscience. Durham: Duke University Press.

Schatzki, T. R., Cetina, K. K., and von Savigny, E. (2001). The practice turn in contemporary theory. Oxford: Routledge.

Wenger, E. (1998). Communities of Practice: Learning, meaning, and identity. Cambridge: Cambridge University Press.


Barth, A. (2011). HTTP state management mechanism (No. 6265). Internet Engineering Task Force. Retrieved from

Bearman, D. (1989). Archival methods. Archives and Museum Informatics, 3(1). Retrieved from

Benkler, Y. (2006). The wealth of networks: How social production transforms markets and freedom. Yale University Press.

Berners-Lee, T., & Fischetti, M. (2000). Weaving the web: The original design and ultimate destiny of the world wide web by its inventor. San Francisco: Harper.

Booms, H. (1987). Society and the formation of a documentary heritage: Issues in the appraisal of archival sources. Archivaria, 24(3), 69–107. Retrieved from

Bowker, G. C., & Star, S. L. (2000). Sorting things out: Classification and its consequences. MIT Press.

Brothman, B. (2001). The past that archives keep: Memory, history, and the preservation of archival records. Archivaria, 51, 48–80.

Brügger, N. (2017). The web as history. (N. Brügger & R. Schroeder, Eds.). UCL Press. Retrieved from

Ceglowski, M. (2011, May). Remembrance of links past. Retrieved from

Cook, T. (2011). We are what we keep; we keep what we are: Archival appraisal past, present and future. Journal of the Society of Archivists, 32(2), 173–189.

Fielding, R. (2000). Representational state transfer (PhD thesis). University of California at Irvine.

Fielding, R., Nottingham, M., & Reschke, J. (2014). Hypertext transfer protocol (http/1.1): Caching (No. 7234). Internet Engineering Task Force. Retrieved from

Graham, S., & Thrift, N. (2007). Out of order understanding repair and maintenance. Theory, Culture & Society, 24(3), 1–25.

Harris, V. (2002). The archival sliver: Power, memory, and archives in South Africa. Archival Science, 2(1-2), 63–86.

Jackson, S. J. (2014). Media technologies: Essays on communication, materiality and society. In P. Boczkowski & K. Foot (Eds.),. MIT Press. Retrieved from

Jacobsen, T., Punzalan, R. L., & Hedstrom, M. L. (2013). Invoking “collective memory”: Mapping the emergence of a concept in archival science. Archival Science, 13(2-3), 217–251.

Kahle, B. (2007). Universal access to all knowledge. The American Archivist, 70(1), 23–31.

Maemura, E., Becker, C., & Milligan, I. (2016). Understanding computational web archives research methods using research objects. In IEEE big data: Computation archival science. IEEE.

Masanès, J. (2006). Web archiving methods and approaches: A comparative study. Library Trends, 54(1), 72–90.

Mayer-Schönberger, V. (2011). Delete: The virtue of forgetting in the digital age. Princeton University Press.

Rosenthal, D. (2012, May). Let’s just keep everything forever in the cloud. Retrieved from

Sanderson, R., Phillips, M., & Sompel, H. V. de. (2011). Analyzing the persistence of referenced web resources with Memento. Open Repositories 2011 Conference. Retrieved from

Zittrain, J., Albert, K., & Lessig, L. (2014). Perma: Scoping and addressing the problem of link and reference rot in legal citations. Legal Information Management, 14(02), 88–99.


Subscribe to code4lib aggregator