You are here

Feed aggregator

DPLA: Far-reaching “Hydra-in-a-Box” Joint Initiative Funded by IMLS

planet code4lib - Wed, 2015-04-15 13:30

Boston, MA –  The Digital Public Library of America (DPLA), Stanford University, and the DuraSpace organization are pleased to announce that their joint initiative has been awarded a $2M National Leadership Grant from the Institute of Museum and Library Services (IMLS). Nicknamed Hydra-in-a-Box, the project aims foster a new, national, library network through a community-based repository system, enabling discovery, interoperability and reuse of digital resources by people from this country and around the world.

This transformative network is based on advanced repositories that not only empower local institutions with new asset management capabilities, but also interconnect their data and collections through a shared platform.

“At the core of the Digital Public Library of America is our national network of hubs, and they need the systems envisioned by this project,” said Dan Cohen, DPLA’s executive director. “By combining contemporary technologies for aggregating, storing, enhancing, and serving cultural heritage content, we expect this new stack will be a huge boon to DPLA and to the broader digital library community. In addition, I’m thrilled that the project brings together the expertise of DuraSpace, Stanford, and DPLA.”

Each of the partners will fulfill specific roles in the joint initiative. Stanford will use its existing leadership in the Hydra Project to develop core components, in concert with the broader Hydra community. DPLA will focus on the connective tissue between hubs, mapping, and crosswalks to DPLA’s metadata application profile, and infrastructure to support metadata enhancement and remediation. DuraSpace will use its expertise in building and serving repositories, and doing so at scale, to construct the back-end systems for Hydra hosting.

“DuraSpace is excited to provide the infrastructure for this project,” said Debra Hanken Kurtz, DuraSpace CEO. “It aligns perfectly with our mission to steward the scholarly and cultural heritage records and make them accessible for current and future generations. We look forward to working with DPLA and Stanford to support their work and that of the community to ensure a robust and sustainable future for Hydra-in-a-Box.’”

Over the project’s 30-month time frame, the partners will engage with libraries, archives, and museums nationwide, especially current and prospective DPLA hubs and the Hydra community, to systematically capture the needs for a next-generation, open source, digital repository. They will collaboratively extend the existing Hydra project codebase to build, bundle, and promote a feature-complete, robust digital repository that is easy to install, configure, and maintain—in short, a next-generation digital repository that will work for institutions large and small, and is capable of running as a hosted service. Finally, starting with DPLA’s own metadata aggregation services, the partners will work to ensure that these repositories have the necessary affordances to support networked aggregation, discovery, management and access to these resources, producing a shared, sustainable, nationwide platform.

“The Hydra Project has already demonstrated enormous traction and value as a best-in-class digital repository for institutions like Stanford,” said Tom Cramer, Chief Technology Strategist at the Stanford University Libraries. “And yet there is so much more to do. This grant will provide the means to rapidly accelerate Hydra’s rate of development and adoption–expanding its community, features and value all at once.”

To find out more about the Hydra-in-a-Box initiative contact Dan Cohen (dan@dp.la), Tom Cramer (tcramer@stanford.edu) or Debra Hanken Kurtz (dkurtz@duraspace.org). An information page is available here: https://wiki.duraspace.org/display/hydra/Hydra+in+a+Box.

About DPLA

The Digital Public Library of America (http://dp.la) strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science. Since launching in April 2013, it has aggregated over 8.5 million items from over 1,700 institutions. The DPLA is a registered 501(c)(3) non-profit.

About DuraSpace

DuraSpace (http://duraspace.org), an independent 501(c)(3) not-for-profit organization providing leadership and innovation for open technologies that promote durable, persistent access to digital data. We collaborate with academic, scientific, cultural, and technology communities by supporting projects (DSpace, Fedora, VIVO) and creating services (DuraCloud, DSpaceDirect, ArchivesDirect) to help ensure that current and future generations have access to our collective digital heritage. Our values are expressed in our organizational byline, “Committed to our digital future.”

About Stanford University Libraries

The Stanford University Libraries (http://library.stanford.edu) is internationally recognized as a leader among research libraries, and in leveraging digital technology to support scholarship in the age of information. It is a founder of both the Hydra Project and the Fedora 4 repository effort, and a leading institution in the International Image Interoperability Framework (IIIF) (http://iiif.io).

About the Hydra Project

The Hydra Project (http://projecthydra.org) is both an open source community and a suite of software that provides a flexible and robustframework for managing, preserving, and providing access to digital assets. The project motto, “One body, many heads,” speaks to the flexibility provided by Hydra’s modern, modular architecture, and the power of combining a robust repository backend (the “body”) with flexible, tailored, user interfaces (“heads”). Co-designed and developed in concert with Fedora 4, the extensible, durable, and widely used repository software, the Hydra/Fedora stack is centerpiece of a thriving and rapidly expanding open source community poised to easy-to-implement solution.

FOSS4Lib Recent Releases: Mirador - 2.0

planet code4lib - Wed, 2015-04-15 12:51

Last updated April 15, 2015. Created by Peter Murray on April 15, 2015.
Log in to edit this page.

Package: MiradorRelease Date: Tuesday, April 14, 2015

FOSS4Lib Recent Releases: pycounter - 0.5a2

planet code4lib - Wed, 2015-04-15 12:48

Last updated April 15, 2015. Created by Peter Murray on April 15, 2015.
Log in to edit this page.

Package: pycounterRelease Date: Monday, April 6, 2015

Code4Lib Journal: Special Issue on Diversity in Library Technology Guest Editorial Committee

planet code4lib - Wed, 2015-04-15 10:54
The guest editorial committee for Code4Lib Journal’s Special Issue on Diversity in Library Technology (issue 28) was developed in order to include new voices and perspectives on the journal’s practices and how they support inclusivity. The committee is comprised of eight guest editors and two regular editorial committee members. More information on the development of […]

Code4Lib Journal: Finding and Supporting New Voices: Code4Lib Journal’s Issue 28 on Diversity in Library Technology

planet code4lib - Wed, 2015-04-15 10:54
Welcome to Code4Lib Journal’s special issue on diversity in library technology. As C4LJ’s first-ever special issue, 28 brings together a plethora of voices from the library tech world in order to approach the challenge of inclusivity within our field from all directions. Over a year of development has gone into this project, which has involved […]

Code4Lib Journal: Feminism and the Future of Library Discovery

planet code4lib - Wed, 2015-04-15 10:54
This paper discusses the various ways in which the practices of libraries and librarians influence the diversity (or lack thereof) of scholarship and information access. We examine some of the cultural biases inherent in both library classification systems and newer forms of information access like Google search algorithms, and propose ways of recognizing bias and applying feminist principles in the design of information services for scholars, particularly as libraries re-invent themselves to grapple with digital collections.

Code4Lib Journal: How to Hack it as a Working Parent

planet code4lib - Wed, 2015-04-15 10:54
The problems faced by working parents in technical fields in libraries are not unique or particularly unusual. However, the cross-section of work-life balance and gender disparity problems found in academia and technology can be particularly troublesome, especially for mothers and single parents. Attracting and retaining diverse talent in work environments that are highly structured or with high expectations of unstated off-the-clock work may be impossible long term. (Indeed, it is not only parents that experience these work-life balance problems but anyone with caregiver responsibilities such as elder or disabled care.) Those who have the energy and time to devote to technical projects for work and fun in their off-work hours tend to get ahead. Those tied up with other responsibilities or who enjoy non-technical hobbies do not get the same respect or opportunities for advancement. Such problems mirror the experiences of women on the tenure track in academia, particularly women working in libraries, and they provide a useful corollary for this discussion. We present some practical solutions for those in technical positions in libraries. Such solutions involve strategic use of technical tools, and lightweight project management applications. Technical workarounds are not the only answer; real and lasting change will involve a change in individual priorities and departmental culture such as sophisticated and ruthless time management, reviewing workloads, cross-training personnel, hiring contract replacements, and creative divisions of labor. Ultimately, a flexible environment that reflects the needs of parents will help create a better workplace culture for everyone, kids or no kids.

Code4Lib Journal: But Then You Have to Make It Happen

planet code4lib - Wed, 2015-04-15 10:54
Librarianship as a profession has a strong commitment to diversity and tends to attract professionals ethically inclined to champion inclusion. The authors, both from historically underrepresented populations in library information technology, have a half-century of combined experience in the field and have held positions ranging from technician, systems librarian, instructional technologist, head of circulation, and digital scholarship and services librarian to associate dean in an academic library. The authors share their experiences and discuss how diversity and inclusion must be embraced at the individual level in order to develop a culture of diversity within an organization and to attract and retain diverse technology teams. Internal commitments to supporting a diverse environment are ultimately critical to recognizing, assessing, and fulfilling the needs of patrons. The authors identify and detail individual and grassroots efforts that have led to library technology programming for underserved populations, including programs involving outreach to diverse student and prospective student communities over the course of their careers. They reflect on strategies to create and retain a diverse technology group within the library and to advance and support diversity within the day-to-day work environment. They posit that a mix of experiences is necessary to advocate for access to underrepresented patron populations and to negotiate and implement a truly diverse environment with regard to ethnicity, gender, age, and socioeconomic background.

Code4Lib Journal: Code as Code: Speculations on Diversity, Inequity, and Digital Women

planet code4lib - Wed, 2015-04-15 10:54
All technologies are social. Taking this socio-technological position becomes less a political stance as a necessity when considering the lived experience of digital inequity, divides, and –isms as they are encountered in every-day library work spheres. Personal experience as women and women of color in our respective technological and leadership communities provides both fore- and background to explore the private-public lines delineating definitions of “diversity”, “inequity”, and digital literacies in library practice. We suggest that by not probing these definitions at the most personal level of lived experience, we in the LIS and technology professions will remain well-intentioned, but ineffective, in genuine inclusion.

Code4Lib Journal: User Experience is a Social Justice Issue

planet code4lib - Wed, 2015-04-15 10:54
When we're building services for people, we often have a lot more practice seeing from the computer's point of view than seeing from another person's point of view. The author asks the library technology community to consider several case studies in this problem, including their root causes, and the negative impact of this problem on achieving our mission as library technologists. The author then recommends specific actions that we, as individual contributors and organizations, can take to increase our empathy and improve the user experience we provide to patrons.

Code4Lib Journal: Recognizing Cultural Diversity in Library Interface Development

planet code4lib - Wed, 2015-04-15 10:54
The rapid increase in complex library digital infrastructures has enabled a more full-featured set of resources to become accessible by autonomous users, whether onsite or remote. However, this trend also necessitates careful consideration of the usability of new interfaces for populations with increasing cultural, geographic, and socioeconomic diversity. Researcher Aron Marcus has become an authority on how cultural principles affect interface perceptions and inform their development. This article will explore Marcus’ work to contextualize diversity issues within usability before exploring the redevelopment strategy for the New York University Libraries’ web presence, which serves a broad and global set of users.

Code4Lib Journal: Transforming Knowledge Creation: An Action Framework for Library Technology Diversity

planet code4lib - Wed, 2015-04-15 10:54
This paper will articulate an action framework for library technology diversity consisting of five dimensions and based on the vision for knowledge creation, the academic library’s fundamental vision. The framework focuses on increasing diversity for library technology efforts based on the desire for transformation and inclusiveness within and across the dimensions. The dimensions are people, content and pedagogy, embeddedness and the global perspective, leadership, and the 5th dimension – bringing it all together.

Code4Lib Journal: “What If I Break It?”: Project Management for Intergenerational Library Teams Creating Non-MARC Metadata

planet code4lib - Wed, 2015-04-15 10:54
Libraries are constantly challenged to meet new user needs and to provide access to new types of materials. We are in the process of launching many new technology-rich initiatives and projects which require investments of staff time, a resource which is at a premium for most new library hires. We simultaneously have people on staff in our libraries with more traditional skill sets who may be able to contribute time and theoretical expertise to these projects, but require training. Incorporating these “seasoned” employees into new initiatives can be a daunting task. In this article, I will share some of the strategies I have used as a metadata project manager for bridging diverse generations of library staff who have various levels of comfort and expertise with technology, and strategies that I have used to reduce the barriers to participation for staff with diverse perspectives and skill sets. These strategies can also be helpful in assisting a new librarian with technology-rich skill sets to more successfully orient themselves when embedded in a “traditional” library setting.

David Rosenthal: The Maginot Paywall

planet code4lib - Tue, 2015-04-14 21:55
Two recent papers examine the growth of peer-to-peer sharing of journal articles. Guilliame Cabanac's Bibliogifts in LibGen? A study of a text-sharing platform driven by biblioleaks and crowdsourcing (LG) is a statistical study of the Library Genesis service, and Carolyn Caffrey Gardner and Gabriel J. Gardner's Bypassing Interlibrary Loan via Twitter: An Exploration of #icanhazpdf Requests (TW) is a similar study of one of the sources for Library Genesis. Both implement forms of Aaron Swartz's Guerilla Open Access Manifesto, a civil disobedience movement opposed to the malign effects of current copyright law on academic research. Below the fold, some thoughts on the state of this movement.

In the years leading up to WWII, the French built the Maginot Line as an impregnable barrier against a German invasion:
While the fortification system did prevent a direct attack, it was strategically ineffective, as the Germans invaded through Belgium, going around the Maginot Line.Copyright maximalists such as the major academic publishers, are in a similar position. The more effective and thus intrusive the mechanisms they implement to prevent unauthorized access, the more they incentivize "guerilla open access".

Some copyright owners are coming to terms with this phenomenon. Today, Hugh Pickens reports that the first 4 of the 10 episodes of Game of Thrones new season have leaked:
The episodes have already been downloaded almost 800,000 times, and that figure was expected to blow past a million downloads by the season 5 premiere. Game of Thrones has consistently set records for piracy, which has almost been a point of pride for HBO. "Our experience is [piracy] leads to more penetration, more paying subs, more health for HBO, less reliance on having to do paid advertising. If you go around the world, I think you're right, Game of Thrones is the most pirated show in the world. Well, you know, that's better than an Emmy." LG shows the massive scale on which "guerilla open access" is happening in the field of academic journals. As of the study, Library Genesis hosted nearly 23M articles identified by DOI, 15TB of data. The distribution was heavily skewed to the major publishers, representing 77% of Elsevier's DOIs, 73% of Wiley's and 53% of Springer's, although only 36% of all DOIs. To give some idea of the scale, this is about 60% of Ontario's Scholar's Portal, which has 38M.

Although some open access DOIs are included, the motivation to upload them is much less. A recent estimate by Khabasa and Lee Giles is that 24% of all articles are openly accessible on the Web, their methodology excluded most content from Library Genesis. Not all DOIs from major publishers are paywalled, they publish some open access journals and allow Gold open access (author pays) in some cases. Despite these elements of double counting, it appears likely that at least a majority of all articles, and significantly more than a majority of major publisher articles, can be accessed without passing though a paywall.

Although the bulk of the Library Genesis content arrived via a small number of large uploads, the median upload rate is 2720 new articles/day. Among the sources for them are:
  • The Scholar subreddit, which LG estimates sees about 45 requests/day for articles to be shared via Library Genesis.
  • Sci-Hub, a service using proxies running on networks with subscriptions to paywalled publishers that allows users to enter a DOI. It it is not available from Library Genesis, the service tries proxies at random until one is found that can access the paper, which is both served to the user and added to Library Genesis.
Presumably, the #icanhazpdf hashtag is another of the Library Genesis upload paths. TW analyzed 824 requests from 475 users over 3 months, or about 10/day. 674 of them were for articles, from 493 different journal titles. The mechanism doesn't provide information about how many were satisfied, or how many of the results ended up on Library Genesis.

LG doesn't have an estimate of the Sci-Hub traffic, but unless it is very large there must be other mechanisms filling the large gap between the Scholar subreddit and #icanhazpdf rates and the Library Genesis median upload rate.

Admittedly, it takes time for newly published articles to appear outside their paywalls. Some publishers operate "moving walls", so their articles become open access after an embargo period. It takes time for the various mechanisms driving Library Genesis to locate and upload articles. LG shows that their most recent year (2013) has only about half as many articles as the previous year, so the average delay is similar to the moving wall.

Paying to pass through paywalls thus delivers some value, not just access to a minority of the content but also more timely access to some of the majority. Nevertheless, the multi-billion dollar profits of the major publishers, let alone the other multiple billions that represent their costs in supplying their services, are hard to justify. We have already seen that their peer review process fails in its assigned role of ensuring the quality of the papers they publish. Now we see that the majority of the content for which they charge these enormous sums is available without payment.

My previous posts on scholarly communication.

Meredith Farkas: Sinners, saints, and social media take-downs

planet code4lib - Tue, 2015-04-14 21:39

I hate one-dimensional characters in movies and TV. I love complex characters who have good qualities and bad. I like that “The Good Wife” actually isn’t really such a paragon of moral virtue at all. That she has made questionable decisions and struggles with things, just like we all do. I like how many of the “villains” on that show do monstrous things, but still have likable qualities and people they love and who love them in turn. I’m glad we’re seeing more and more shows like that, where characters are as flawed and three-dimensional as we all are.

Yet there seems to be something in us that likes to simplify things when it comes to judging real people. Someone is either good or bad. On the side of right or on the side of evil. And there’s a tendency to either vilify people or put them on a pedestal. But the world is not so black-and-white.

I think few things have made that tendency to simplify as clear to me as the whole Joe Murphy vs. #Teamharpy lawsuit and social media debacle. It seemed like the dominant narrative either had to be that Lisa Rabey and nina de jesus were heroes and saints and Joe Murphy was a monster, or that Joe Murphy was a saint and poor innocent victim and Lisa Rabey and nina de jesus were monsters. I personally don’t believe either is true. Joe Murphy is not a saint, but he has had his reputation damaged (maybe fatally in our profession) for something there may be no evidence of him having done. Calling someone a sexual predator without first-hand knowledge or evidence that they are one (and I’m not saying that victims need to have evidence) seems like a shitty thing to do. But, given the number of negative things I’d heard about Joe from other librarians prior to all this, I’m assuming (and hoping) that Lisa thought she was doing something good in warning people about him.

I’m writing this knowing that I will probably be trolled by someone for it, but c’est la vie. I’m disturbed by the fact that, after all of the petitions, and Facebook drama, and blog posts, and tweets about this no one seems to be talking about this (other than right-wing feminist-hating nut-jobs) since the lawsuit was settled and Lisa and nina published retractions. We shouldn’t let right-wing feminist-hating nut-jobs control the narrative. And we also should be willing to admit when we were wrong and/or stand up for our beliefs if we feel we are right.

When I first wrote a post about all this, social media had been relatively quiet about it. I think there had been a couple of blog posts and the Team Harpy WordPress site was up, but nothing with a lot of vitriol had come out. Most of the rhetoric seemed focused generally on how common sexual harassment is — even in our female-dominated profession — and how important it is that there are whistleblowers who speak out about that behavior. There were posts about the importance of believing victims and supporting whistleblowers. I’d say that people were generally supportive of Lisa and nina, but were not necessarily assuming that Joe was what they said he was.

Soon after, the discussion took a turn for the bizarre, at least to me. The conversation around Joe on Facebook and Twitter became intensely vitriolic, with plenty of people arguing his guilt as if they had inside information. Respected library administrators who have never met Joe were calling him a “douchebag” on Twitter. There was a change.org petition asking him to drop his lawsuit, apologize to nina and Lisa, and compensate them. It was signed by over 1,000 people, including many people I like and respect. I did not sign it. I found it really odd that no one was considering the fact that he might be the victim in this. Instead, Lisa and nina were treated like victims, which, if they did harm his career without any evidence of a crime, they were very much in the wrong. I find it difficult to believe that over 1,000 other people knew for a fact that he actually was a sexual predator.

It seemed more like people thought he was wrong to have sued them. If someone publicly accused me of a terrible crime with no evidence and damaged my career, wouldn’t I be the injured party and shouldn’t I be able to seek damages in a court of law? The idea that he was squashing their free speech rights was ridiculous. If it’s not true that Joe is a sexual predator, it is slander. It’s one thing to say Joe Murphy is a jerk. That is opinion. But stating that someone is factually something that they don’t know is true is not protected speech. Destroying someone’s reputation is a tremendous and personal violation of another human being. But maybe he deserved it because he was a player and a flirt? How is that any different than “slut-shaming?” I found it disturbing that none of the people I like and respect seemed to be acknowledging this. But maybe everyone but me knew for a fact that it was true?

I don’t like Joe Murphy. I still feel about him exactly the way I did when I wrote my first post. But, as I mentioned then, I think the fact that he was disliked by so many people made it easy for folks to believe him to have done it (and he might consider why so many people were saying awful things about him behind his back, because it’s not just “haters gonna hate”). We’ve all seen the delight people feel when someone powerful (or someone who is perceived of as being privileged) is taken down. I’ve been reading a lot about Jon Ronson’s new book So You’ve Been Publicly Shamed and am looking forward to reading it and learning more about this strange and all-too-common social phenomenon.

In addition to the fact that plenty of people wanted to see him taken down a peg, this was happening at a time when things like gamergate and the recent conversations, articles, and presentations about sexual harassment in librarianship were shining a pretty bright light on this issue. I think people wanted to show their support for women who have been the victims of sexual harassment and this lawsuit gave our community an opportunity to come together to do that.

But let’s remember something here: nina and Lisa were not sexually harassed by Joe Murphy. That was never what anyone was claiming. But many people behaved like Joe was suing the victims of harassment. No. He was suing people who were reporting something they said they’d heard. This wasn’t about believing the victims of sexual harassment. They may have believed they were doing the right thing, but they weren’t harassed by Joe prior to posting what they did.

Now the tide has shifted and the trolls are attacking nina, Lisa, and their supporters (including me, though I wasn’t actually a supporter). I can’t even blame Joe much for engaging in a bit of schadenfreude now (I’ve seen him favoriting some of the trolling tweets his lawyers have been shooting out to me and others) I can’t fathom the suffering he must have endured through all this. I can’t imagine how demoralizing it must have been to have more than 1,000 people in our profession signing a change.org petition against him. But sadly, because he’s put on the mantle of the innocent victim and good-guy, I doubt very much that he is going to examine the behavior that got him here (and I don’t mean the lawsuit).

And that’s the rub. How do we call people like Joe on their shit in a way that might actually create change? Calling them a sexual predator on Twitter without evidence is clearly not it. I believe in the power of social media for good, but I haven’t seen a lot of good come out of it when it comes to calling out powerful men for bad behavior, because many then just position themselves as victims. Has public shaming really ever worked to meaningfully change people’s behavior (again, gotta read Ronson’s book)? But the “whisper network” doesn’t work either. People were saying lots of things about Joe, but the information wasn’t getting to people in power or maybe even Joe himself. Maybe he didn’t know how a lot of people felt about him. I have no idea.

Still the greatest tragedy here, in my opinion is that so many women suffer sexual harassment and most of the time the perpetrators get away with it. And this whole sordid affair did little to help the cause of encouraging women to come forward. I’ve been sexually harassed and stalked and never reported any of it. But it was when a faculty member at a former job who used to stand too close to me and would put his arm around my waist sometimes later escalated to grabbing a colleagues breasts that I realized my silence was hurting other women. Because men who do things like this don’t just do it once. If they get away with something that you consider too minor to report, they may escalate to doing something much worse to someone else. We have to find more ways to help women feel safe reporting harassment. I’m happy that more conferences now have codes of conduct and discernible methods of reporting inappropriate behavior, and that will help, but it’s not enough.

I don’t have anything positive to end with here, so I’ll close with an excerpt from an interview with Jon Ronson where he talks about a situation where guy at a conference was social media shamed after a woman tweeted about an off-color joke he made and then she was horribly trolled after he said he lost his job because of it. See any parallels?

The strange thing is the impulse to shame often comes from a good place. Like the desire to confront sexism, say. A good example is the tech conference incident: Hank whispers a naff joke about ‘big dongles’ to his friend, Adria hears it and takes offence, posts something on Twitter and the whole thing snowballs.

Ronson: Yeah, everybody involved in that shaming is doing it for social justice reasons. So Adria feels that in calling out Hank she’s a calling out a greater truth: that privileged white men don’t know the effect they have on other people. The trolls think they’re doing the right thing because they feel Adria robbed Hank of his employment – so they wanted to get back at her. Everybody involved in that story feels the urge to be a good person – and it’s carnage all round. Everyone is broken by the experience; especially Adria, she has it worse than anybody. I mean, I’m on Hank’s side. Nobody wants to live in a world where you can’t make a dongle joke! But by the end of the story, Hank’s okay, he’s got a new job, but Adria’s unemployed and subjected to death threats. So Adria’s view of the world feels vindicated.

Image credit

Ed Summers: Tweets and Deletes

planet code4lib - Tue, 2015-04-14 16:20

Archives are full of silences. Archivists try to surface these silences by making appraisal decisions about what to collect and what not to collect. Even after they are accessioned, records can be silenced by culling, weeding and purging. We do our best to document these activities, to leave a trail of these decisions, but they are inevitably deeply contingent. The context for the records and our decisions about them unravels endlessly.

At some point we must accept that the archival record is not perfect, and that it’s a bit of a miracle that it exists at all. But in all these cases it is the archivist who has agency: the deliberate or subliminal decisions that determine what comprises the archival record are enacted by an archivist. In addition the record creator has agency, in their decision to give their records to an archive.

Perhaps I’m over-simplifying a bit, but I think there is a curious new dynamic at play in social media archives, specifically archives of Twitter data. I wrote in a previous post about how Twitter’s Terms of Service prevent distribution of Twitter data retrieved from their API, but do allow for the distribution of Tweet IDs and relatively small amounts of derivative data (spreadsheets, etc).

Tweet IDs can then be hydrated, or turned back into raw original data, by going back to the Twitter API. If a tweet has been deleted you cannot get it back from the API. The net effect this has is of cleaning, or purging, the archival record as it is made available on the Web. But the decision of what to purge is made by the record creator (the creator of the tweet) or by Twitter themselves in cases where tweets or users are deleted.

For example lets look at the collection of Twitter data that Nick Ruest has assembled in the wake of the attack on the offices of Charlie Hebdo earlier this year. Nick collected 13 million tweets mentioning four hashtags related to the attacks, for the period of January 9th to January 28th, 2015. He has made the tweet IDs available as a dataset for researchers to use (a separate file for each hashtag). I was interested in replicating the dataset for potential researchers at the University of Maryland, but also in seeing how many of the tweets had been deleted.

So on February 20th (42 days after Nick started his collecting) I began hydrating the IDs. It took 4 days for twarc to finish. When it did I counted up the number of tweets that I was able to retrieve. The results are somewhat interesting:

hashtag archived tweets hydrated deletes percent deleted #JeSuisJuif 96,518 89,584 6,934 7.18% #JeSuisAhmed 264,097 237,674 26,423 10.01% #JeSuisCharlie 6,503,425 5,955,278 548,147 8.43% #CharlieHebdo 7,104,253 6,554,231 550,022 7.74% Total 13,968,293 12,836,767 1,131,526 8.10%

It looks like 1.1 million tweets out of the 13.9 million tweet dataset have been deleted. That’s about 8.1%. I suspect now even more have been deleted. While the datasets themselves are significantly smaller the number of deletes for #JeSuiAhmed and #JeSuisJuif seem quite a bit higher than #JeSuisCharlie and #CharlieHebdo. Could this be that users were concerned about how their tweets would be interpreted by parties analyzing the data?

Of course, it’s very hard for me to say since I don’t have the deleted tweets. I don’t even know who sent them. A researcher interested in these questions would presumably need to travel to York University to work with the dataset. In a way this seems to be how archives usually work. But if you add the Web as a global, public access layer into the mix it complicates things a bit.

FOSS4Lib Updated Packages: pycounter

planet code4lib - Tue, 2015-04-14 16:04

Last updated April 14, 2015. Created by wooble on April 14, 2015.
Log in to edit this page.

pycounter makes working with COUNTER usage statistics in Python easy, including fetching statistics with NISO SUSHI.

Developed by the Health Sciences Library System of the University of Pittsburgh to support importing usage data into our in-house Electronic Resources Management (ERM) system.

Licensed under the MIT license. See the file LICENSE for details.

pycounter is tested on Python 2.6, 2.7, 3.3, 3.4, 3.5, and pypy2

Package Type: Electronic Resource ManagementLicense: MIT License Package Links Releases for pycounter Operating System: Browser/Cross-PlatformTechnologies Used: SUSHIProgramming Language: Python

Brown University Library Digital Technologies Projects: Josiah

planet code4lib - Tue, 2015-04-14 14:01

Josiah is the library’s public catalog.  Records are exported nightly for the VuFind and new Blacklight discovery systems.

Mark E. Phillips: Extended Date Time Format (EDTF) use in the DPLA: Part 2, EDTF use by Hub

planet code4lib - Tue, 2015-04-14 12:30

This is the second in a series of posts about the Extended Date Time Format (EDTF) and its use in the Digital Public Library of America.  For more background on this topic take a look at the first post in this series.

EDTF Use by Hub

In the previous post I looked at the overall usage of the EDTF in the DPLA and found that it didn’t matter that much if a Hub was classified as a Content Hub or a Service Hub when it came to looking at the availability of dates in item records in the system.  Content Hubs had 83% of their records with some date value and Service Hubs had 81% of their records with date values.

Looking overall at the dates that were present,  there were 51% that were valid EDTF strings and 49% that were not valid EDTF strings.

One of the things that should be noted is that there are many common date formats that we use throughout our work that are valid EDTF date strings that may not have been created with the idea of supporting EDTF.  For example 1999, and 2000-04-03 are both valid (and highly used date_patterns) that are normal to run across in our collections. Many of the “valid EDTF” dates in the DPLA fall into this category.

I wanted to look at how the EDTF was distributed across the Hubs in the dataset and was able to create the following table.

Hub Name Items With Date % of total items with date present Valid EDTD Valid EDTF % Not Valid EDTF Not Valid EDTF % ARTstor 49,908 88.6% 26,757 53.6% 23,151 46.4% Biodiversity Heritage Library 29,000 21.0% 22,734 78.4% 6,266 21.6% David Rumsey 48,132 100.0% 48,132 100.0% 0 0.0% Digital Commonwealth 118,672 95.1% 14,731 12.4% 103,941 87.6% Digital Library of Georgia 236,961 91.3% 188,263 79.4% 48,687 20.5% Harvard Library 6,957 65.8% 6,910 99.3% 47 0.7% HathiTrust 1,881,588 98.2% 1,295,986 68.9% 585,598 31.1% Internet Archive 194,454 93.1% 185,328 95.3% 9,126 4.7% J. Paul Getty Trust 92,494 99.8% 6,319 6.8% 86,175 93.2% Kentucky Digital Library 87,061 68.1% 87,061 100.0% 0 0.0% Minnesota Digital Library 39,708 98.0% 33,201 83.6% 6,507 16.4% Missouri Hub 34,742 83.6% 32,192 92.7% 2,550 7.3% Mountain West Digital Library 634,571 73.1% 545,663 86.0% 88,908 14.0% National Archives and Records Administration 553,348 78.9% 10,218 1.8% 543,130 98.2% North Carolina Digital Heritage Center 214,134 82.1% 163,030 76.1% 51,104 23.9% Smithsonian Institution 675,648 75.3% 44,860 6.6% 630,788 93.4% South Carolina Digital Library 52,328 68.9% 42,128 80.5% 10,200 19.5% The New York Public Library 791,912 67.7% 47,257 6.0% 744,655 94.0% The Portal to Texas History 424,342 88.8% 416,835 98.2% 7,505 1.8% United States Government Printing Office (GPO) 148,548 99.9% 17,894 12.0% 130,654 88.0% University of Illinois at Urbana-Champaign 14,273 78.8% 11,304 79.2% 2,969 20.8% University of Southern California. Libraries 269,880 89.6% 114,293 42.3% 155,573 57.6% University of Virginia Library 26,072 86.4% 21,798 83.6% 4,274 16.4%

Turning this into a graph helps things show up a bit better.

EDTF info for each of the DPLA Hubs

There are a number of things that can be teased out of here,  first is that there are a few Hubs that have 100% or nearly 100% of their dates conforming to EDTF already,  notably David Rumsey’s Hub and the Kentucky Digital Library both at 100%.  Harvard at 99% and the Portal to Texas History at 98% are also notable.  On the other end of the spectrum we have the National Archives and Records Administration with 98% of their dates being Not Valid,  New York Public Library with 94%, and the J Paul Getty Trust at 93%.

Use of EDTF Level Features

The EDTF has the notion of feature levels which include Level 0, Level 1, and Level 2.  Level 0 are the basic date features such as date, date and time, and intervals.  Level 1 adds features like
Uncertain/Approximate dates,  Unspecified, Extended Intervals, years exceeding four digits and seasons to the mix. Level 2 adds to the feature set with partial uncertain/approximate dates,  partial unspecified, sets, multiple dates, masked precision and extensions of the extended interval and years exceeding four digits.  Finally Level 2 lets you qualify seasons.  For a full list of the features please take a look at the draft specification at the Library of Congress.

When I was preparing the dataset I also tested the dates to see which feature level they matched to.  After starting the analysis I noticed a few bugs in my testing code and added them as issues to the GitHub site for the ExtendedDateTimeFormat Python module available here.  Even with the bugs which falsely identified one feature as a Level0 and Level1 feature, and another feature as both Level1 and Level2,  I was able to come up with usable data for further analysis.  Because of these bugs there are a few Hubs in the list below that differ slightly in the number of valid EDTF items than the list presented in the first part of this post.

Hub Name valid EDTF items valid-level0 % Level0 valid-level1 % Level1 valid-level2 % Level2 ARTstor 26,757 26,726 99.9% 31 0.1% 0 0.0% Biodiversity Heritage Library 22,734 22,702 99.9% 32 0.1% 0 0.0% David Rumsey 48,132 48,132 100.0% 0 0.0% 0 0.0% Digital Commonwealth 14,731 14,731 100.0% 0 0.0% 0 0.0% Digital Library of Georgia 188,274 188,274 100.0% 0 0.0% 0 0.0% Harvard Library 6,910 6,822 98.7% 83 1.2% 5 0.1% HathiTrust 1,295,990 1,292,079 99.7% 3,662 0.3% 249 0.0% Internet Archive 185,328 185,115 99.9% 212 0.1% 1 0.0% J. Paul Getty Trust 6,319 6,308 99.8% 11 0.2% 0 0.0% Kentucky Digital Library 87,061 87,061 100.0% 0 0.0% 0 0.0% Minnesota Digital Library 33,201 26,055 78.5% 7,146 21.5% 0 0.0% Missouri Hub 32,192 32,190 100.0% 2 0.0% 0 0.0% Mountain West Digital Library 545,663 542,388 99.4% 3,274 0.6% 1 0.0% National Archives and Records Administration 10,218 10,003 97.9% 215 2.1% 0 0.0% North Carolina Digital Heritage Center 163,030 162,958 100.0% 72 0.0% 0 0.0% Smithsonian Institution 44,860 44,642 99.5% 218 0.5% 0 0.0% South Carolina Digital Library 42,128 42,079 99.9% 49 0.1% 0 0.0% The New York Public Library 47,257 47,251 100.0% 6 0.0% 0 0.0% The Portal to Texas History 416,838 402,845 96.6% 6,302 1.5% 7,691 1.8% United States Government Printing Office (GPO) 17,894 16,165 90.3% 875 4.9% 854 4.8% University of Illinois at Urbana-Champaign 11,304 11,275 99.7% 29 0.3% 0 0.0% University of Southern California. Libraries 114,307 114,307 100.0% 0 0.0% 0 0.0% University of Virginia Library 21,798 21,558 98.9% 236 1.1% 4 0.0%

Looking at the top 25% of the data,  you get the following.

EDTF Level Use by Hub

Obviously the majority of dates in the DPLA that are valid EDTF comply with Level0 which includes standard dates like years, (1900), year and month (1900-03), year month and day (1900-03-03), full date and time (2014-03-03T13:23:50 and intervals with any of the dates (yyyy, yyyy-mm, yyyy-mm-dd) in the format of 2004-02/2014-03-23.

There are a number of Hubs that are making use of Level 1 and Level 2 features with the most notable being the Minnesota Digital Library that makes use of Level 1 features in 21.5 % of their item records.  The Portal to Texas History and the Government Printing Office both make use of Level2 features as well with the Portal having them present in 7,691 item records (1.8% of their total) and GPO in 854 of their item records (4.8%).

I have one more post in this series that will take a closer look at which of the EDTF features are being used in the DPLA as a whole and then for each of the Hubs.

Feel free to contact me via Twitter if you have questions or comments.

Pages

Subscribe to code4lib aggregator