After writing about the Ferguson Twitter archive a few months ago three people have emailed me out of the blue asking for access to the data. One was a principal at a small, scaryish defense contracting company, and the other two were from a prestigious university. I’ve also had a handful of people interested where I work at the University of Maryland.
I ignored the defense contractor. Maybe that was mean, but I don’t want to be part of that. I’m sure they can go buy the data if they really need it. My response to the external academic researchers wasn’t much more helpful since I mostly pointed them to Twitter’s Terms of Service which says:
If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs.
You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweets and/or User Objects per user of your Service, per day.
Any Content provided to third parties via non-automated file download remains subject to this Policy.
It’s my understanding that I can share the data with others at the University of Maryland, but I am not able to give it to the external parties. What I can do is give them the Tweet IDs. But there are 13,480,000 of them.
So that’s what I’m doing today: publishing the tweet ids. You can download them from the Internet Archive:
I’m making it available using the CC-BY license.Hydration
On the one hand, it seems unfair that this portion of the public record is unshareable in its most information rich form. The barrier to entry to using the data seems set artificially high in order to protect Twitter’s business interests. These messages were posted to the public Web, where I was able to collect them. Why are we prevented from re-publishing them since they are already on the Web? Why can’t we have lots of copies to keep stuff safe? More on this in a moment.
Twitter limits users to 180 requests every 15 minutes. A user is effectively a unique access token. Each request can hydrate up to 100 Tweet IDs using the statuses/lookup REST API call.180 requests * 100 tweets = 18,000 tweets/15 min = 72,000 tweets/hour
So to hydrate all of the 13,480,000 tweets will take about 7.8 days. This is a bit of a pain, but realistically it’s not so bad. I’m sure people doing research have plenty of work to do before running any kind of analysis on the full data set. And they can use a portion of it for testing as it is downloading. But how do you download it?
Gnip, who were recently acquired by Twitter, offer a rehydration API. Their API is limited to tweets from the last 30 days, and similar to Twitter’s API you can fetch up to 100 tweets at a time. Unlike the Twitter API you can issue a request every second. So this means you could download the results in about 1.5 days. But these Ferguson tweets are more than 30 days old. And a Gnip account costs some indeterminate amount of money, starting at $500…
I suspect there are other hydration services out there. But I adapted twarc the tool I used to collect the data, which already handled rate-limiting, to also do hydration. Once you have the tweet IDs in a file you just need to install twarc, and run it. Here’s how you would do that on an Ubuntu instance:sudo apt-get install python-pip sudo pip install twarc twarc.py --hydrate ids.txt > tweets.json
After a week or so, you’ll have the full JSON for each of the tweets.Archive Fever
Well, not really. You will have most of them. But you won’t have the ones that have been deleted. If a user decided to remove a Tweet they made, or decided to remove their account entirely you won’t be able to get their Tweets back from Twitter using their API. I think it’s interesting to consider Twitter’s Terms of Service as what Katie Shilton would call a value lever.
The metadata rich JSON data (which often includes geolocation and other behavioral data) wasn’t exactly posted to the Web in the typical way. It was made available through a Web API designed to be used directly by automated agents, not people. Sure, a tweet appears on the Web but it’s in with the other half a trillion Tweets out on the Web, all the way back to the first one. Requiring researchers to go back to the Twitter API to get this data and not allowing it circulate freely in bulk means that users have an opportunity to remove their content. Sure it has already been collected by other people, and it’s pretty unlikely that the NSA are deleting their tweets. But in a way Twitter is taking an ethical position for their publishers to be able to remove their data. To exercise their right to be forgotten. Removing a teensy bit of informational toxic waste.
As any archivist will tell you, forgetting is an essential and unavoidable part of the archive. Forgetting is the why of an archive. Negotiating what is to be remembered and by whom is the principal concern of the archive. Ironically it seems it’s the people who deserve it the least, those in positions of power, who are often most able to exercise their right to be forgotten. Maybe putting a value lever back in the hands of the people isn’t such a bad thing. If I were Twitter I’d highlight this in the API documentation. I think we are still learning how the contours of the Web fit into the archive. I know I am.
If you are interested in learning more about value levers you can download a pre-print of Shilton’s Value Levers: Building Ethics into Design.
The Centre for Research in Occupational Safety and Health asked me to give a lunch'n'learn presentation on ResearchGate today, which was a challenge I was happy to take on... but I took the liberty of stretching the scope of the discussion to focus on social networking in the context of research and academics in general, recognizing four high-level goals:
- Promotion (increasing citations, finding work positions)
- Finding potential collaborators
- Getting advice from experts in your field
- Accessing other's work
I'm a librarian, so naturally my take veered quickly into the waters of copyright concerns and the burden (to the point of indemnification) that ResearchGate, Academia.edu, Mendeley, and other such services put on their users to ensure that they are in compliance with copyright and the researchers' agreements with publishers... all while heartily encouraging their users to upload their work with a single click. I also dove into the darker waters of r/scholar, LibGen, and SciHub, pointing out the direct consequences that our university has suffered due to the abuse of institutional accounts at the library proxy.
Happily, the audience opened up the subject of publishing in open access journals--not just from a "covering our own butts" perspective, but also from the position of the ethical responsibility to share knowledge as broadly as possible. We briefly discussed the open access mandates that some granting agencies have put in place, particularly in the States, as well as similar Canadian initiatives that have occurred or are still emerging with respect to public funds (SSHRC and the Tri-Council). And I was overjoyed to hear a suggestion that, perhaps, research funded by the Laurentian University Research Fund should be required to publish in an open access venue.
I'm hoping to take this message back to our library and, building on Kurt de Belder's vision of the library as a Partner in Knowledge help drive our library's mission towards assisting researchers in not only accessing knowledge, but most effectively sharing and promoting the knowledge they create.
That leaves lots of work to do, based on one little presentation
Resources may be divided into groups called classes. The members of a class are known as instances of the class. Classes are themselves resources. They are often identified by IRIs and may be described using RDF properties. The rdf:type property may be used to state that a resource is an instance of a class.This seems simple, but it is in fact one of the primary areas of confusion about RDF.
If you are not a programmer, you probably think of classes in terms of taxonomies -- genus, species, sub-species, etc. If you are a librarian you might think of classes in terms of classification, like Library of Congress or the Dewey Decimal System. In these, the class defines certain characteristics of the members of the class. Thus, with two classes, Pets and Veterinary science, you can have:
- catsIn each of those, dogs and cats have different meaning because the class provides a context: either as pets, or information about them as treated in veterinary science.
For those familiar with XML, it has similar functionality because it makes use of nesting of data elements. In XML you can create something like this:
</drink>and it is clear which price goes with which type of drink, and that the bits directly under the <drink> level are all drinks, because that's what <drink> tells you.
Now you have to forget all of this in order to understand RDF, because RDF classes do not work like this at all. In RDF, the "classness" is not expressed hierarchically, with a class defining the elements that are subordinate to it. Instead it works in the opposite way: the descriptive elements in RDF (called "properties") are the ones that define the class of the thing being described. Properties carry the class information through a characteristic called the "domain" of the property. The domain of the property is a class, and when you use that property to describe something, you are saying that the "something" is an instance of that class. It's like building the taxonomy from the bottom up.
This only makes sense through examples. Here are a few:
1. "has child" is of domain "Parent".
If I say "X - has child - 'Fred'" then I have also said that X is a Parent because every thing that has a child is a Parent.
2. "has Worktitle" is of domain "Work"
If I say "Y - has Worktitle - 'Der Zauberberg'" then I have also said that Y is a Work because every thing that has a Worktitle is a Work.
In essence, X or Y is an identifier for something that is of unknown characteristics until it is described. What you say about X or Y is what defines it, and the classes put it in context. This may seem odd, but if you think of it in terms of descriptive metadata, your metadata describes the "thing in hand"; the "thing in hand" doesn't describe your metadata.
Like in real life, any "thing" can have more than one context and therefore more than one class. X, the Parent, can also be an Employee (in the context of her work), a Driver (to the Department of Motor Vehicles), a Patient (to her doctor's office). The same identified entity can be an instance of any number of classes.
"has child" has domain "Parent"
"has licence" has domain "Driver"
"has doctor" has domain "Patient"
X - has child - "Fred" = X is a Parent
X - has license - "234566" = X is a Driver
X - has doctor - URI:765876 = X is a PatientClasses are defined in your RDF vocabulary, as as the domains of properties. The above statements require an application to look at the definition of the property in the vocabulary to determine whether it has a domain, and then to treat the subject, X, as an instance of the class described as the domain of the property. There is another way to provide the class as context in RDF - you can declare it explicitly in your instance data, rather than, or in addition to, having the class characteristics inherent in your descriptive properties when you create your metadata. The term used for this, based on the RDF standard, is "type," in that you are assigning a type to the "thing." For example, you could say:
X - is type - Parent
X - has child - "Fred"This can be the same class as you would discern from the properties, or it could be an additional class. It is often used to simplify the programming needs of those working in RDF because it means the program does not have to query the vocabulary to determine the class of X. You see this, for example, in BIBFRAME data. The second line in this example gives two classes for this entity:
a bf:Instance, bf:Monograph .
One thing that classes do not do, however, is to prevent your "thing" from being assigned the "wrong class." You can, however, define your vocabulary to make "wrong classes" apparent. To do this you define certain classes as disjoint, for example a class of "dead" would logically be disjoint from a class of "alive." Disjoint means that the same thing cannot be of both classes, either through the direct declaration of "type" or through the assignment of properties. Let's do an example:
"residence" has domain "Alive"
"cemetery plot location" has domain "Dead"
"Alive" is disjoint "Dead" (you can't be both alive and dead)
X - is type - "Alive" (X is of class "Alive")
X - cemetery plot location - URI:9494747 (X is of class "Dead") Nothing stops you from creating this contradiction, but some applications that try to use the data will be stumped because you've created something that, in RDF-speak, is logically inconsistent. What happens next is determined by how your application has been programmed to deal with such things. In some cases, the inconsistency will mean that you cannot fulfill the task the application was attempting. If you reach a decision point where "if Alive do A, if Dead do B" then your application may be stumped and unable to go on.
All of this is to be kept in mind for the next blog post, which talks about the effect of class definitions on bibliographic data in RDF.
LITA has multiple learning opportunities available over the next several months. Hot topics to keep your brain warm over the winter.
Re-Drawing the Map Series
Presenters: Mita Williams and Cecily Walker
Offered: November 18, 2014, December 9, 2014, and January 6, 2015
All: 1:00 pm – 2:00 pm Central Time
Top Technologies Every Librarian Needs to Know
Presenters: Brigitte Bell, Steven Bowers, Terry Cottrell, Elliot Polak and Ken Varnum,
Offered: December 2, 2014
1:00 pm – 2:00 pm Central Time
Getting Started with GIS
Instructor: Eva Dodsworth, University of Waterloo
Offered: January 12 – February 9, 2015
For details and registration check out the fuller descriptions below and follow the links to their full web pages
Join LITA Education and instructors Mita Williams and Cecily Walker in “Re-drawing the Map”–a webinar series! Pick and choose your favorite topic. Can’t make all the dates but still want the latest information? Registered participants will have access to the recorded webinars.
Here’s the individual sessions.
Web Mapping: moving from maps on the web to maps of the web
Tuesday Nov. 18, 2014
1:00 pm – 2:00 pm Central Time
Instructor: Mita Williams
Get an introduction to web mapping tools and learn about the stories they can help you to tell!
OpenStreetMaps: Trust the map that anyone can change
Tuesday December 9, 2014,
1:00 pm – 2:00 pm Central Time
Instructor: Mita Williams
Ever had a map send you the wrong way and wished you could change it? Learn how to add your local knowledge to the “Wikipedia of Maps.”
Coding maps with Leaflet.js
Tuesday January 6, 2015,
1:00 pm – 2:00 pm Central Time
Instructor: Cecily Walker
Register Online page arranged by session date (login required)
We’re all awash in technological innovation. It can be a challenge to know what new tools are likely to have staying power — and what that might mean for libraries. The recently published Top Technologies Every Librarian Needs to Know highlights a selected set of technologies that are just starting to emerge and describes how libraries might adapt them in the next few years.
In this webinar, join the authors of three chapters as they talk about their technologies and what they mean for libraries.
December 2, 2014
1:00 pm – 2:00 pm Central Time
Hands-Free Augmented Reality: Impacting the Library Future
Presenters: Brigitte Bell & Terry Cottrell
The Future of Cloud-Based Library Systems
Presenters: Elliot Polak & Steven Bowers
Library Discovery: From Ponds to Streams
Presenter: Ken Varnum
Register Online page arranged by session date (login required)
Getting Started with GIS is a three week course modeled on Eva Dodsworth’s LITA Guide of the same name. The course provides an introduction to GIS technology and GIS in libraries. Through hands on exercises, discussions and recorded lectures, students will acquire skills in using GIS software programs, social mapping tools, map making, digitizing, and researching for geospatial data. This three week course provides introductory GIS skills that will prove beneficial in any library or information resource position.
No previous mapping or GIS experience is necessary. Some of the mapping applications covered include:
- Introduction to Cartography and Map Making
- Online Maps
- Google Earth
- KML and GIS files
- ArcGIS Online and Story Mapping
- Brief introduction to desktop GIS software
Instructor: Eva Dodsworth, University of Waterloo
Offered: January 12 – February 9, 2015
Register Online page arranged by session date (login required)
Questions or Comments?
For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty, firstname.lastname@example.org.
Every two years the International Budget Partnership (IBP) runs a survey, called the Open Budget Survey, to evaluate formal oversight of budgets, how transparent governments are about their budgets and if there are opportunities to participate in the budget process. To easily measure and compare transparency among the countries surveyed, IBP created the Open Budget Index where the participating countries are scored and ranked using about two thirds of the questions from the Survey. The Open Budget Index has already established itself as an authoritative measurement of budget transparency, and is for example used as an eligibility criteria for the Open Government Partnership.
However, countries do not release budget information every two years; they should do so regularly, on multiple occasions in a given year. There is, however, as stated above a two year gap between the publication of consecutive Open Budget Survey results. This means that if citizens, civil society organisations (CSOs), media and others want to know how governments are performing in between Survey releases, they have to undertake extensive research themselves. It also means that if they want to pressure governments into releasing budget information and increase budget transparency before the next Open Budget Index, they can only point to ‘official’ data which can be up to two years old.
To combat this, IBP, together with Open Knowledge, have developed the Open Budget Survey Tracker (the OBS Tracker), http://obstracker.org,: an online, ongoing budget data monitoring tool, which is currently a pilot and covers 30 countries. The data are collected by researchers selected among the IBP’s extensive network of partner organisations, who regularly monitor budget information releases, and provide monthly reports. The information included in the OBS Tracker is not as comprehensive as the Survey, because the latter also looks at the content/comprehensiveness of budget information — not only the regularity of its publication. The OBS Tracker, however, does provide a good proxy of increasing or decreasing levels of budget transparency, measured by the release to (or witholding from) the public of key budget documents. This is valuable information for concerned citizens, CSOs and media.
With the Open Budget Survey Tracker, IBP has made it easier for citizens, civil society, media and others to monitor, in near real time (monthly), whether their central governments release information on how they plan to and how they spend the public’s money. The OBS Tracker allows them to highlight changes and facilitates civil society efforts to push for change when a key document has not been released at all, or not in a timely manner.
Niger and Kyrgyz Republic have improved the release of essential budget information after the latest Open Budget Index results, something which can be seen from the OBS Tracker without having to wait for the next Open Budget Survey release. This puts pressure on other countries to follow suit.
The budget cycle is a complex process which involves creating and publishing specific documents at specific points in time. IBP covers the whole cycle, by monitoring in total eight documents which include everything from the proposed and approved budgets, to a citizen-friendly budget representation, to end-of-the-year financial reporting and the auditing from a country’s Supreme Audit Institution.
In each of the countries included in the OBS Tracker, IBP monitors all these eight documents showing how governments are doing in generating these documents and releasing them on time. Each document for each country is assigned a traffic light color code: Red means the document was not produced at all or published too late. Yellow means the document was only produced for internal use and not released to the general public. Green means the document is publicly available and was made available on time. The color codes help users quickly skim the status of the world as well as the status of a country they’re interested in.
To make monitoring even easier, the OBS Tracker also provides more detailed information about each document for each country, a link to the country’s budget library and more importantly the historical evolution of the “availability status” for each country. The historical visualisation shows a snapshot of the key documents’ status for that country for each month. This helps users see if the country has made any improvements on a month-by-month basis, but also if it has made any improvements since the last Open Budget Survey.
Is your country being tracked by the OBS Tracker? How is it doing? If they are not releasing essential budget documents or not even producing them, start raising questions. If your country is improving or has a lot of green dots, be sure to congratulate the government; show them that their work is appreciated, and provide recommendations on what else can be done to promote openness. Whether you are a government official, a CSO member, a journalist or just a concerned citizen, OBS Tracker is a tool that can help you help your government.
The new date for the November WMS Web services install is this Sunday, November 23rd. This install will include changes to two of our WMS APIs.
Imagine you’re a legal scholar and you’re examining the U.S. Supreme Court decisions of the late nineties to mid-two thousands and you want to understand what resources were consulted to support official opinions. A study in the Yale Journal of Law and Technology indicates you would find that only half of the nearly 555 URL links cited in Supreme Court opinions since 1996 would still work. This problem has been widely discussed in the media and the Supreme Court has indicated it will print all websites cited and place the printouts in physical case files at the Supreme Court, available only in Washington, DC.
On October 24, 2014 Georgetown University Law Library hosted a one-day symposium on this problem which has been studied across legal scholarship and other academic works. The meeting, titled 404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent, presented a broad overview of why websites disappear, why this is particularly problematic in the legal citation context and the proposal of actual solutions and strategies to addressing the problem.
The keynote address was given by Jonathan Zittrain, George Bemis Professor of Law at Harvard Law School. A video of his presentation is now available from the meeting website. In it he details a service created by Harvard Law School Libraries and other law libraries called Perma.cc that allows those with an account to submit links that can be archived at a participating library. The use case for Perma.cc is to support links in new forms of academic and legal writing. Today, over 26,000 links have been archived.
Herbert Van de Sompel of the Los Alamos National Laboratory also demonstrated the Memento browser plug-in that allows users who’ve downloaded the plug-in to see archived versions of a website (if that website has been archived) while they are using the live web. The Internet Archive, The British Library, the UK National Archives and other archives around the world all provide archived versions of websites through Memento. The Memento protocol has been widely implemented, integrated in MediaWiki sites and supports “time travel” to old websites that cover all topics.
Both solutions, Perma.cc and Memento, depend on action by, and coordination of, organizations and individuals who are affected by the linkrot problem. At the end of his presentation Van de Sompel reiterated that technical solutions exist to deal with linkrot; what is still needed is broad participation in the selection, collection and archiving of web resources and a sustainable and interoperable infrastructure of tools and services, like Memeno and Perma.cc, that connect the archived versions of website with the scholars, researchers and users that want to access them today and into the future.
Michael Nelson of Old Dominion University, a partner in developing Memento, posted notes on the symposium presentations. For even more background and documentation on the problem of linkrot, the meeting organizers collected a list of readings. The symposium demonstrated the ability of a community, in this case, law librarians, to come together to address a problem in their domain, the results of which benefit the larger digital stewardship community and serve as models for coordinated action.
It’s been a little over a month since we launched GIF IT UP, an international competition to find the best GIFs reusing public domain and openly licensed digital video, images, text, and other material available via DPLA and DigitalNZ. Since then we’ve received dozens of wonderful submissions from all over the world, all viewable in the competition gallery.
The winners of GIF IT UP will have their work featured and celebrated online at the Public Domain Review and Smithsonian.com. Haven’t submitted an entry yet? Well, what are you waiting for? Submit a GIF!About GIF IT UP
How it works. The GIF IT UP competition has six categories:
- Planes, trains, and other transport
- Nature and the environment
- Your hometown, state, or province
- WWI, 1914-1918
- GIF using a stereoscopic image
- Open category (any reusable material from DigitalNZ or DPLA)
A winner will be selected in each of these categories and, if necessary, a winner will be awarded in two fields: use of an animated still public domain image, and use of video material.
To view the competition’s official homepage, visit http://dp.la/info/gif-it-up/.
Judging. GIF IT UP will be co-judged by Adam Green, Editor of the Public Domain Review and by Brian Wolly, Digital Editor of Smithsonian.com. Entries will be judged on coherence with category theme (except for the open category), thoroughness of entry (correct link to source material and contextual information), creativity, and originality.
Gallery. All entries that meet the criteria outlined below in the Guidelines and Rules will be posted to the GIF IT UP Tumblr Gallery. The gallery entries with the most amount of Tumblr “notes” will receive the people’s choice award and will appear online at the Public Domain Review and Smithsonian.com alongside the category winners.
Deadline. The competition deadline is December 1, 2014 at 5:00 PM EST / December 2, 2014 at 10:00 AM GMT+13.
GIFtastic Resources. You can find more information about GIF IT UP–including select DPLA and DigitalNZ collections available for re-use and a list of handy GIF-making tips and tools–over on the GIF IT UP homepage.
[This is the second in a short series on our 2014 OCLC Research Library Partnership meeting, Libraries and Research: Supporting Change/Changing Support. You can read the first post and also refer to the event webpage contains links to slides, videos, photos, Storify summaries.]Anja Smit (University Librarian at Utrecht University) [link to video] chaired this session which focused on the ways in which libraries are or could be supporting eScholarship. In opening she shared a story that reflects how the library is really a creature of the larger institution. At Utrect the library engaged in scenario planning* and identified their future as being all about open access and online access to sources. When they brought faculty in to comment on their plans, they were told that they were “going too fast” and that they needed to slow down. Sometimes researchers request services and sometimes the library just acts to fill a void. But innovation is not only starting but also stopping. The Utretch experience with VREs are an example of a well-reasoned library “push” of services – thought they would have 200 research groups actively using the VRE but only 25 took it up. Annotated books on the other hand is an example of “pull,” something requested by researchers. Dataverse (a network for storing data) started as a service in the library that was needed by faculty but ultimately moved to DANS due to scale and infrastructure issues. The decision to discontinue local search was a “pull” decision, based on evidence that researchers were not using it. Ultimately, librarians need to be “embedded” in researcher workflows. If we don’t know what they are doing, we won’t be able to help them.
Ricky Erway (Senior Program Officer, OCLC Research) [link to video] gave her own story of push and pull — OCLC Research was asked by the Research Information Management Interest Group to “do something about digital humanities”. The larger question was, where can libraries make a unique contribution? Ricky and colleague Jennifer Schaffner immersed themselves in the researchers’ perspective regarding processes, issues, and needs, and then tried to see where the library might fill gaps. Their paper [Does Every Research Library Need a Digital Humanities Center?] was written for library directors not already engaged with digital humanities. The answer to the question posed in the title of the paper is, “It depends.” The report suggests that a constellation of engagement possibilities should be considered based on local needs. Start with what you are already offering and ensure that researchers are aware of those services. Scholars enthusiasm for metadata was a surprising finding — humanities researchers use and value metadata sources such as VIAF. (Colleague Karen Smith-Yoshimura has previously blogged about contributions to VIAF from the Syriac scholarly community and contributions from the Perseus Catalog.) A challenge for libraries is figuring out, when to support, when to collaborate, and when to lead. There is no one size fits all in digital humanities and libraries — not only is it the case that “changes in research are not evenly distributed,” but also every library has its own set of strengths and services which may be good matches for local needs.
Adam Farquhar (Head of Digital Scholarship at the British Library) [link to video] talked about what happens when large digital collections are brought together with scholars. Adam’s role, in brief is to get the British Library’s digital collections into the hands of scholars so they can create knowledge. Adam and his team have been trying to find ways to take advantage of the digital qualities of digital collections — up to now, most libraries have treated digital collections the same as print collections apart from delivery. This is a mistake, because there are unique aspects to large-scale digital collections and we should be leveraging them. The British Library has a cross-disciplinary team which is much needed for tackling the challenges at hand. Rather than highlighting the broad range of projects being undertaken at the BL, Adam chose instead to focus on a few small, illustrative examples. In the British Library Labs, developers are invited to sit alongside scholars and co-evolve projects and solutions. The BL Labs Competition is a challenge to encourage people to put forward interesting projects and needs. Winners of the 2014 competition included one from Australia (showing that there is global interest in the BL’s collections). One winner is the Victorian Meme Machine, which will pair Victorian jokes with likely images to illustrate what makes Victorian jokes funny. Another project extracted images from digitized books and put a million images on Flickr (where people go to look for images, not for books). These images have received 160 million views in the last year. These are impressive metrics especially when you consider that previously no one alive had looked any of those images. Now lots of people have and they have been used in a variety of ways, from an art piece at Burning Man, to serious research, to commercial use. Adam’s advice? Relax and take a chance on release of information into the public domain.
Antal van den Bosch (Professor at the Radboud University Nijmegen) [link to video] spoke from his perspective as a researcher. Scientists have long had the ability to shift from first gear (working at the chalkboard) to 5th or 6th gear (doing work on the Large Hadron Collider). Humanists have recently discovered that there is a 3rd or 4th gear and want to go there. In the humanities there is fast and slow scholarship. In his own field, linguistics and computer science, there is no data like more data. Large, rich corpuses are highly valued (and more common over time). One example is Twitter – in the Netherlands, seven million Tweets a day are generated and collected by his institute. Against this corpus, researchers can study the use of language at different times of day and use location metadata to identify use of regional dialect. Another example is the HiTiME (Historical Timeline Mining and Extraction) project which uses linked data in historical sources to enable the study of social movements in Europe. Within texts, markup of persons, locations, and events allow visualizations including timelines and social networks. Analysis of newspaper archives revealed both labor strikes that happened and those that didn’t. However, library technology was not up to the task of keeping up with the data so that findings were not repeatable, underscoring the need for version control and adequate technological underpinnings. Many times in these projects the software goes along with the data, so storing both data and code is important. Most researchers are not sure where to put their research data and may be using cloud storage like GitHub. Advice and guidance are all well and good but what researchers really need is storage, and easy to use services (“an upload button, basically”). In the Netherlands and in Europe, there are long tail storage solutions for e-research data. Many organizations and institutions say “here, let me help you with that.” Libraries seem well situated to help with metadata, but researchers want full text search against very big data sets like Twitter or Google Books. Libraries should be asking themselves if they can host something that big. If libraries can’t offer collections like these, at scale, researchers may not be interested. On the other hand in the humanities which has a “long tail of small topics,” there are many single researchers doing small research projects and here the library may be well positioned to help.
If you are interested in more details you can watch the discussion session that followed:
I’ll be back later to summarize the last two segments of the meeting.
*A few years ago, Jim and I attended one of the ARL 2030 Scenarios workshops. Since that time, I’ve been quite interested in the use of scenario planning as an approach for organizations like libraries that hope to build for resilience.
About Merrilee ProffittMail | Web | Twitter | Facebook | LinkedIn | More Posts (274)
Blessed with the gift-curse of seeing ~24h into the future, I spend it on bad TV.
Monday Nov 17th 2014 (IRC):
- 10:06 danbri: I’ve figured out what the world needs – a new modern WestWorld sequel.
- 10:06 libby: why does the world need that?
- 10:06 danbri: just that it was a great film and it has robots and cowboys and snakes and fembots and a guy who can take his face off and who is a robot and a cowboy. it double ticks all the boxes.
Tuesday Nov 18th 2014 (BBC):
JJ Abrams’ sci fi drama Westworld has been officially commissioned for a whole series by HBO. The Star Wars director is executive producer whilst Interstellar co-writer Jonathan Nolan will pen the scripts Sir Anthony Hopkins, Thandie Newton, Evan Rachel Wood, Ed Harris and James Marsden will all star. The show is a remake of a 1973 sci-fi western about a futuristic themed amusement park that has a robot malfunction.
The studio is calling the series, which will debut in 2015, “a dark odyssey about the dawn of artificial consciousness and the future of sin”
We are delighted to announce that Tufts University has become the latest formal Hydra Partner. Tufts has two Hydra-based projects, the Tufts Digital Library redesign and a New Nation Votes election portal. They are currently working on a Hydra-based administrative interface to allow staff self-deposit in the Tufts Fedora content repository; and the Tufts Digital Image Library, based on Northwestern’s DIL implementation.
In their Letter of Intent, Tufts say that they are committed to the Hydra community in helping solve digital repository and workflow challenges by supporting development and contributing code, documentation and expertise.
From The Fedora Steering Group
Fedora Development - In the past quarter, the development team released two Beta releases of Fedora 4; detailed release notes are here:
On November 19th, the Consumer Financial Protection Bureau and the Institute for Museum and Library Services will offer a free webinar on financial literacy. This session has limited space so please register quickly.
Tune in to the Consumer Financial Protection Bureau’s monthly webinar series intended to instruct library staff on how to discuss financial education topics with their patrons. As part of the series, the Bureau invites experts from other government agencies and nonprofit organizations to speak about key topics of interest.
Tax time is a unique opportunity for many consumers to make financial decisions about how to use their income tax refunds to build savings. In next free webinar “Ways to save during tax time: EITC,” finance leaders will discuss what consumers need to do to prepare before filing their income tax returns, the importance of taking advantage of the tax time moment to save, and the ways people can save automatically when filing their returns.
If you would like to be notified of future webinars, or ask about in-person trainings for large groups of librarians, email email@example.com; subject: Library financial education training. All webinars will be recorded and archived for later viewing.
November 19, 2014
2:30—3:30 p.m. EDT
Join the webinar at 2:30pm You do not need to register for this webinar.
If that link does not work, you can also access the webinar by going to www.mymeetings.com/nc/join and entering the following information:
- Conference number: PW9469248
- Audience passcode: LIBRARY
If you are participating only by phone, please dial the following number:
- Phone: 1-888-947-8930
- Participant passcode: LIBRARY
Libraries and Research: Supporting Change/Changing Support was a meeting on 11-12 June for members of the OCLC Research Library Partnership. The meeting focused on how the evolving nature of academic research practices and scholarship are placing new demands on research library services. Shifting attitudes toward data sharing, methodologies in eScholarship, and rethinking the very definition of scholarly discourse . . . . these are all areas that have deep implications for the library. But it is not only the research process that is changing; research universities are evolving in new directions, often becoming more outcome-oriented, changing to reflect the increased importance of impact assessment, and competing for funding. Libraries are taking on new roles and responsibilities to support change in research and in the academy. From our perch in OCLC Research, we can see that as libraries prepare to meet new demands and position themselves for the future, libraries themselves are changing, both in their organizational structure and in their alliances with other parts of the university and with external entities.
This meeting focused on three thematic areas: supporting change in research; supporting change at the university level; and changing support structures in the library.
For the first time, and in response to an increasing number of active partners in Europe we held our Partnership meeting outside of the United States. Since we have a number of partners in the Netherlands, we opted to hold our meeting in Amsterdam. We were in a terrific venue, and the beautiful weather didn’t hurt.
Meeting attendees were greeted by Maria Heijne (Director of the University of Amsterdam Library and of the Library of Applied Sciences/Hogeschool of Amsterdam). [Link to video.] Maria highlighted the global perspective represented by those attending the meeting — which haled from the Netherlands, the United Kingdom, Denmark, Italy, Germany, Australia, Japan, the US and Canada. The UofA library is a unique combination of library, special collections, and museum of archaeology. The offer a strong combination of services for the university and for the city of Amsterdam. Like so many libraries in the Partnership and beyond, the UofA library is preparing for a new facilities, and looking to shift effort from cataloging and other backroom functions to working more closely with researchers and other customers.
Titia van der Werf (Senior Program Officer, OCLC Research) introduced the meeting and our themes [link to video], welcoming special guests from DANS, LIBER, RLUK and from OCLC EMEA Regional Council. The OCLC Research Library Partnership focuses on projects that have been defined as being of importance to partners. Examples of work in OCLC Research in support of the Partnership include looking at shifts in publication patterns and shifts in research (as highlighted in the Evolving Scholarly Record report), challenges in restructuring and redefining within the library (reflected in work done by my colleague Jim Michalko), and studying the behavior of researchers so we can understand evolving needs (reflected in our work synthesizing user and behavior studies). We also see interest and uptake in new ways of thinking about cataloging data, recasting metadata as identifiers (such as identifiers for people, subjects, or for works). As research changes, as universities change, so too do libraries need to change.
With that introduction to our meeting, I’ll close. Look for a short series of posts summarizing the remainder of the meeting, focusing on the three themes.
[The event webpage contains links to slides, videos, photos, Storify summaries]About Merrilee ProffittMail | Web | Twitter | Facebook | LinkedIn | More Posts (274)
Today I found the following resources and bookmarked them on <a href=
- GraphHopper Route Planner GraphHopper an efficient routing library and server based on OpenStreetMap data.
- OpenConferenceWare OpenConferenceWare is an open source web application for events and conferences. This customizable, general-purpose platform provides proposals, sessions, schedules, tracks and more.
Digest powered by RSS Digest
District Dispatch: It’s now or (almost) never for real NSA reform; contacting Congress today critical!
It was mid-summer when Senator Patrick Leahy (D-VT), the outgoing Chairman of the Senate Judiciary Committee, answered the House of Representative’s passage of an unacceptably weak version of the USA FREEDOM Act by introducing S. 2685, a strong, bipartisan bill of his own. Well, it’s taken until beyond Veterans Day, strong lobbying by civil liberties groups and tech companies, and a tough stand by Senate Majority Leader Harry Reid, but Leahy’s bill and real National Security Agency (NSA) reform may finally get an up or down vote in the just-opened “lame duck” session of the U.S. Senate. That result is very much up in the air, however, as this article goes to press.
Now is the time for librarians and others on the front lines of fighting for privacy and civil liberties to heed ALA President Courtney Young’s September call to “Advocate. Today.” And we do mean today. Here’s the situation:
Thanks to Majority Leader Reid, Senators will cast a key procedural vote late on Tuesday afternoon that is, in effect, “do or die” for proponents of meaningful NSA reform in the current Congress. If Senators Reid and Leahy, and all of us, can’t muster 60 votes on Tuesday night just to bring S. 2685 to the floor, then the overwhelming odds are—in light of the last election’s results—that another bill as good at reforming the USA PATRIOT Act as Senator Leahy’s won’t have a prayer of passage for many, many years.
Even if reform proponents prevail on Tuesday, however, our best intelligence is that some Senators will offer amendments intended to neuter or at least seriously weaken the civil liberties protections provided by Senator Leahy’s bill. Other Senators will try to strengthen the bill but face a steep uphill battle to succeed.
Soooooo….. now is the time for all good librarians (and everyone else) to come to the aid of Sens. Leahy and Reid, and their country. Acting now is critical . . . and it’s easy. Just click here to go to ALA’s Legislative Action Center. Once there, follow the user-friendly prompts to quickly find and send an e-mail to both of your U.S. Senators (well, okay, their staffs but they’ll get the message loud and clear) and to your Representative in the House. Literally a line or two is all you, and the USA FREEDOM Act, need. Tell ‘em:
- The NSA’s telephone records “dragnet,” and “gag orders” imposed by the FBI without a judge’s approval, under the USA PATRIOT Act must end;
- Bring Sen. Leahy’s USA FREEDOM Act to the floor of the Senate now; and
- Pass it without any amendments that make it’s civil liberties protections weaker (but expanding them would be just fine) before this Congress ends!
Just as in the last election, in which so many races were decided by razor thin margins, your e-mail “vote” could be the difference between finally reforming the USA PATRIOT Act. . . or not. With the key vote on Tuesday night, there’s no time to lose. As President Young wrote: “Advocate. Today.”
The post It’s now or (almost) never for real NSA reform; contacting Congress today critical! appeared first on District Dispatch.
Open Knowledge Foundation: An unprecedented Public-Commons partnership for the French National Address Database
This is a guest post, originally published in French on the Open Knowledge Foundation France blog
Nowadays, being able to place an address on a map is an essential information. In France, where addresses were still unavailable for reuse, the OpenStreetMap community decided to create its own National Address Database available as open data. The project rapidly gained attention from the government. This led to the signing last week of an unprecedented Public-Commons partnership between the National Institute of Geographic and Forestry Information (IGN), Group La Poste, the new Chief Data Officer and the OpenStreetMap France community.
In August, before the partnership was signed, we met with Christian Quest, coordinator of the project for OpenStreetMap France. He explained the project and its implications to us.
Here is a summary of the interview, previously published in French on the Open Knowledge Foundation France blog.Why Did OpenStreetMap (OSM) France decided to create an Open National Address Database?
The idea to create an Open National Address Database came about one year ago after discussions with the Association for Geographic Information in France (AFIGEO). An Address Register was the topic of many reports however these reports can and went without any follow-up and there were more and more people asking for address data on OSM.
Address data are indeed extremely useful. They can be used for itinerary calculations or more generally to localise any point with an address on a map. They are also essentials for emergency rescues – ambulances, fire-fighters and police forces are very interested in the initiative.
These data are also helpful for the OSM project itself as they enrich the map and are used to improved the quality of the data. The creation of such a register, with so many entries, required a collaborative effort both to scale up and to be maintained. As such, the OSM-France community naturally took it over. However, there was also a technical opportunity; OSM-France had previously developed a tool to collect information from the french cadastre website, which enabled them to start the register with significant amount of information.Was there no National Address Registry project in France already?
It existed on papers and in slides but nobody ever saw the beginning of it. It is, nevertheless, a relatively old project, launched in 2002 following the publication of a report on addresses from the CNIG. This report is quite interesting and most of its points are still valid today, but not much has been done since then.
IGN and La Poste were tasked to create this National Address Register but their commercial interests (selling data) has so far blocked this 12-year old project. As a result, a French address datasets did exist but these datasets were created for specific purposes as opposed to the idea of creating a reference dataset for French addresses. For instance, La Poste uses three different addresses databases: for mail, for parcels, and for advertisements.Technically, how do you collect the data? Do you reuse existing datasets?
We currently use three main data sources: OSM which gathers a bit more than two million addresses, the address datasets already available as open data (see list here) and, when necessary, the address data collected from the website of the cadastre. We also use FANTOIR data from the DGFIP which contains a list of all streets names and lieux-dits known from the Tax Office. This dataset is also available as open data.
These different sources are gathered in a common database. Then, we process the data to complete entries and remove duplications, and finally we package the whole thing for export. The aim is to provide harmonised content that brings together information from various sources, without redundancy. The process is run automatically every night with the exception of manual corrections that are done from OSM contributors. Data are then made available as csv files, shapefiles and in RDF format for semantic reuse. A csv version is published on github to enable everyone to follow the updates. We also produce an overlay map which allows contributors to improve the data more easily. OSM is used in priority because it is the only source from which we can collaboratively edit the data. If we need to add missing addresses, or correct them, we use OSM tools.Is your aim to build the reference address dataset for the country?
This is a tricky question. What is a reference dataset? When you have more and more public services using OSM data, does that mean you are in front of a reference dataset?
According to the definition of the French National Mapping Council (CNIG), a geographic reference must enable every reuser to georeference its own data. This definition does not consider any particular reuse. On the other hand, its aim is to enable as much information as possible to be linked to the geographic reference. For the National Address Database to become a reference dataset, it is imperative that data is more exhaustive. Currently, there is data for 15 million reusable addresses (August 2014) of an estimated total of about 20 million. We have more in our cumulative database, but our export scripts ensure there is a minimum quality and coherency and release only after the necessary checks have been made. We are also working on the lieux-dits which are not address data point, but which are still used in many rural areas in France.
Beyond the question of the reference dataset, you can also see the work of OSM as complementary to the one of public entities. IGN has a goal of homogeneity in the exhaustivity of its information. This is due to its mission of ensuring an equal treatment of territories. We do not have such a constraint. For OSM, the density of data on a territory depends largely on the density of contributors. This is why we can offer a level of details sometimes superior, in particular in the main cities, but this is also the reason why we are still missing data for some départements.
Finally, we think to be well prepared for the semantic web and we already publish our data in RDF format by using a W3C ontology closed to the European INSPIRE model for address description.The reached agreement includes a dual license framework. You can reuse the data for free under an ODbL license, or you can opt for a non-share-alike license but you have to pay a fee. Is share-alike clause an obstacle for the private sector?
I don't think so because the ODbL license does not prevent commercial reuse. It only requires to mention the source and to share any improvement of the data under the same license. For geographical data aiming at describing our land, this share-alike clause is essential to ensure that the common dataset is up to date. Lands change constantly, data improvements and updates must, therefore, be continuous, and the more people are contributing, the more efficient this process is.
I see it as a win-win situation compared to the previous one where you had multiple address datasets, maintained in closed silos with none of which were of acceptable quality for a key register as it is difficult to stay up to date on your own.
However, for some companies, share-alike is incompatible with their business model, and a double licensing scheme is a very good solution. Instead of taking part in improving and updating the data, they pay a fee which will be used to improve and update the data.And now, what is next for the National Address Database?
We now need to put in place tools to facilitate contribution and data reuse. Concerning the contribution, we want to set-up a one-stop-shop application/API, separated from OSM contribution tool, to enable everyone to report errors, add corrections or upload data. This kind of tool would enable us to easily integrate partners into the project. On the reuse side, we should develop an API for geocoding and address autocompletion because not everybody will necessarily want to manipulate millions of addresses!As a last word, OSM is celebrating its ten years anniversary. What does that inspire you?
First, the success and the power of OpenStreetMap lies in its community, much more than in its data. Our challenge is therefore to maintain and develop this community. This is what enables us to do projects such as the National Addresses Database, but also to be more reactive than traditional actors when it is needed, for instance with the current Ebola situation. Centralised and systematic approaches for cartography reached their limits. If we want better and more up to date map data, we will need to adopt a more decentralised way of doing things, with more contributors on the ground. Here’s to Ten More Years of the OpenStreetMap community!
Today, Federal Communications Commission (FCC) Chairman Tom Wheeler held a press call to preview the draft E-rate order that will be circulated at the Commission later this week. The FCC invited Marijke Visser, assistant director of the American Library Association’s (ALA) Program on Networks, to participate in the call. ALA President Courtney Young released a statement in response to the FCC activity, applauding the momentum:
ALA has worked extremely hard on this proceeding to move the broadband bar for libraries so that communities across the nation can more fully benefit from the E’s of Libraries™. That is, as Chairman Wheeler recognizes, libraries provide critical services to our communities across the nation relating to Education, Employment, Entrepreneurship, Engagement and Empowerment.
Of course, the extent to which communities benefit from these services depends on the broadband capacity our libraries have. Unfortunately, for all too many libraries, the bandwidth needed is either not available at all or it is prohibitively expensive.
But what Chairman Wheeler described today will go a long way towards changing the broadband dynamic. With support and guidance from our Senior Counsel, Alan Fishel, ALA stood fast behind our recommendations through many difficult rounds of discussions. After today we have every indication that ALA’s unwavering advocacy and determination over the past year and a half will add up to a series of changes for the E-rate program that will provide desperately needed increased broadband capacity for urban, suburban, and rural libraries across the country.
ALA applauds Chairman Wheeler for his strong leadership throughout the modernization proceeding in identifying a clear path to closing the broadband gap for libraries and schools and ensuring a sustainable E-rate program. The critical increase in permanent funding that the Chairman described during today’s press call will help ensure that libraries can maintain the broadband upgrades we know the vast majority of our libraries are anxious to make. Moreover, the program changes that were referenced today—on top of those the Commission adopted in July—coupled with more funding is without a doubt a win-win for libraries and most importantly for the people in the communities they serve.
Larry Neal, president of the Public Library Association, a division of ALA, and director of the Clinton-Macomb Public Library (MI), also commented on the FCC draft E-rate order.
“The well-connected library opens up literally thousands of opportunities for the people who walk through the doors of their local library,” said Neal. “Libraries are with you from the earliest years with family apps for literacy, through the school years with STEM learning labs, to collaborative workspaces and information resources for small businesses, entrepreneurs, and the next generation of innovators. This should be the story for every library and could be if they had the capacity they needed.”
The post ALA applauds strong finish to the E-rate proceeding appeared first on District Dispatch.
Among his observations are:
- "by some measures the US spends almost 50% more in telecom services than it does for electricity."
- Content is not king; "net of what they pay to content providers, US cable networks appear to be getting more revenue out of Internet access and voice services than out of carrying subscription video, and all on a far smaller slice of their transport capacity".
- True streaming video, with its tight timing constraints, is not a significant part of the traffic. Video is a large part, "but it is almost exclusively transmitted as faster-than-real-time progressive downloads". Doing so allows for buffering to lift the timing constraints.
- "The main function of data networks is to cater to human impatience. Thus "Overprovisioning is not a bug but a feature, as it is indispensable to provide low transmission latency". "Once you have overengineered your network, it becomes clearer that pricing by volume is not particularly appropriate, as it is the size and availability of the connection that creates most of the value."
- "it seems safe to estimate worldwide telecom revenues for 2011 as being close to $2 trillion. About half the revenue ... comes from wireless."
- "with practically all [wireline] costs coming from ... installing the wire to the end user, the marginal costs of carrying extra traffic are negligible. Hence charging according to the volume of traffic cannot easily be justified on the basis of costs.
- "a modern telecom infrastructure for the US, with fiber to almost every premise, would not cost more than $450 billion, well under one year's annual revenue. But there is no sign of willingness to spend that kind of money ... Hence we can indeed conclude that modern telecom is less about high capital investment and far more a game of territorial control, strategic alliances, services and marketing, than of building a fixed infrastucture."
- "Yet another puzzle is the claim that building out fiber networks to the home is impossibly expensive. Yet at the cost of $1,500 per household (in excess of the $1,200 estimate ... for the Google project in Kansas City, were it to reach every household), and at a cost of capital of 8% ..., this would cost only $10 per house per month. The problem is that managers and their shareholders expect much higher rates of return than 8% per year. One of the paradoxes is that the same observers who claim that pension funds cannot hope to earn 8% annually are also predicting continuation of much higher corporate profit rates."
Of the articles that were most frequently downloaded [from First Monday] in 1999, 6 of the top 10 were published in previous years! This supports the thesis that easy online access leads to much wider usage of older materials. [Section 9]After an initial period, frequency of access does not vary with age of article, and stays pretty constant with time (after discounting for general growth in usage). [Section 10] Now The Google Scholar team have followed their Rise of the Rest paper, which I blogged about here, with a validation of Odlyzko's prediction. Their new paper On the Shoulders of Giants: The Growing Impact of Older Articles takes another look at the effect that the dramatic changes as scholarly communications migrated to the Web have had on the behavior of authors. The two major changes have been:
- The greater accessibility of the literature, caused by digitization of back content, born-digital journals and pre-print archives, and relevance ranking by search engines.
- The great increase in the volume of publication, caused by the greatly reduced cost of on-line publication and the reduction of competition for space.