You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 1 hour 56 min ago

LITA: Brave New Workplace: Text Mining

Wed, 2015-11-25 14:00
Text Mining Visualization from McGill University

Hi there, future text miners. Before we head down the coal shoot together, I’ll begin by saying this, and I hope it will reassure you- no matter your level of expertise, your experience in writing code or conducting data analysis, you can find an online tool to help you text mine.

The internet is a wild and beautiful place sometimes.

But before we go there, you may be wondering- what’s this Brave New Workplace business all about? Brave New Workplace is my monthly discussion of tech tools and skill sets which can help you adapt and know a new workplace. In our previous two installments I’ve discussed my own techniques and approaches to learning about your coworkers’ needs and common goals. Today I’m going to talk about text mining the results of your survey, but also text mining generally.

Now three months into my new position, I have found that text mining my survey results was only the first step to developing additional awareness of where I could best apply my expertise to library needs and goals. I went so far as to text mine three years of eresource Help Desk tickets and five years of meeting notes. All of it was fun, helpful, and revealing.

Text mining can assist you in information gathering in a variety of ways, but I tend to think it’s helpful to keep in mind the big three.

1. Seeing the big picture (clustering)
2. Finding answers to very specific questions (question answering)
3. Hypothesis generation (concept linkages)

For the purpose of this post, I will focus on tools for clustering your data set. As with any data project, I encourage you to categorize your inputs and vigorously review and pre-process your data. Exclude documents or texts that do not pertain to the subject of your inquiry. You want your data set to be big and deep, not big and shallow.

I will divide my tool suggestions into two categories: beginner and intermediate. For my beginners just getting started, you will not need to use any programming language, but for intermediate, you will.


I know, you’ve seen a million word clouds.

Start yourself off easy and use This simple site will make you a pretty word cloud,  and also provide you with a comprehensive word frequencies list. Those frequencies are concept clusters, and you can begin to see trends and needs in your new coworkers and your workplace goals. This is a pretty cool, and VERY user friendly way to get started text mining.

WordClouds eliminates frequently used words, like articles, and gets you to the meat of your texts. You can copy paste text or upload text files. You can also scan a site URL for text, which is what I’ve elected to do as an example here, examining my library’s home page. The best output of WordClouds is not the word cloud. It’s the easily exportable list of frequently occurring words.

WordCloud Frequency List

To be honest, I often use this WordClouds’ function in advance of getting into other data tools. It can be a way to better figure out categories of needs, a great first data mining step which requires almost zero effort. With your frequencies list in hand you can do some immediate (and perhaps more useful) data visualization in a simple tool of your choice, for instance Excel.


Excel Graphs for Visualization


Intermediate Tools

Depending on your preferred programming language, many options are available to you. While I have traditionally worked in SPSS for data analysis, I have recently been working in R. The good news about R versus SPSS- R is free and there’s a ton of community collaboration. If you have a question (I often do) it’s easy to find an answer.

Getting started in R with text mining is simple. You’ll need to install the packages necessary if you are text mining for the first time.

Then save your text files in a folder titled: “texts,” and load those in R. Once in, you’ll need to pre-process your text to remove common words and punctuation.  This guide is excellent in taking you through the steps to process your data and analyze it.

Just like our WordClouds, you can use R to discover term frequencies and visualize them. Beyond this, working in R or SPSS or Python can allow you to cluster terms further. You can find relationships between words and examine those relationships within a dendrogram or by k-means. These will allow you to see the relationships between clusters of terms.

Ultimately, the more you text mine, the more familiar you will become with the tools and analysis valuable in approaching a specific text dataset. Get out there and text mine, kids. It’s a great way to acculturate to a new workplace or just learn more about what’s happening in your library.

Now that we’ve text mined the results of our survey, it’s time to move onto building a Customer Relationship Management system (CRM) for keeping our collaborators and projects straight. Come back for Brave New Workplace: Your Homegrown CRM on December 21st.

William Denton: Zotero and Mozilla

Wed, 2015-11-25 04:31

A quick pointer to Automated Scanning of Firefox Extensions is Security Theater (And Here’s Code to Prove It) by Dan Stillman, lead developer on Zotero, about how extension signing (meant to make Firefox more secure) could cause serious problems for Zotero.

A quote:

For the last few months, we’ve been asking on the Mozilla add-ons mailing list that Zotero be whitelisted for extension signing. If you haven’t been following that discussion, 1) lucky you, and 2) you can read my initial post about it, which gives some context. The upshot is that, if changes aren’t made to the signing process, we’ll have no choice but to discontinue Zotero for Firefox when Firefox 43 comes out, because, due to Zotero’s size and complexity, we’ll be stuck in manual review forever and unable to release timely updates to our users, who rely on Zotero for time-sensitive work and trust us to fix issues quickly. (Zotero users could continue to use our standalone app and one of our lightweight browser extensions, but many people prefer the tighter browser integration that the original Firefox version provides.)

Mozilla should give Zotero the special treatment it deserves. It’s a very important tool, and a crucial part of ongoing research all over the world. Mozilla needs to support it.

Tara Robertson: NOPE-sevier

Wed, 2015-11-25 00:51

This morning I received an email asking me to peer review a book proposal for Chandos Publishing, the Library and Information Studies imprint of Elsevier. Initially I thought it was spam because of some sloppy punctuation and the “Dr. Robertson” salutation.

When other people pointed out that this likely wasn’t spam my ego was flattered for a few minutes and I considered it. I was momentarily confused–would participating in Elsevier’s book publishing process be evil? Isn’t it different from their predatory pricing models with libraries and roadblocks to sharing research more broadly? I have a lot to learn about scholarly publishing, but decided that I’m not going to contribute my labour to a company that are jerks to librarians, researchers and libraries.

Here’s some links I found useful:

Amy Buckland’s pledge to support open access

Mita Williams pointed me to The Cost of Knowledge petition, which I also encourage you to sign.

LITA: Jobs in Information Technology: November 24, 2015

Tue, 2015-11-24 20:56

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week:

Tenure-track – STEM Librarian, Shippensburg University, Shippensburg, PA

Web Services Librarian, Meridian Library District, Meridian, ID

Information Technology & Virtual Services (ITVS) Officer, Pikes Peak Library District, Colorado Springs, CO

Systems & Discovery Services Librarian, Wabash College, Crawfordsville, IN

Systems Administrator, University of Wisconsin-Madison General Library System, Madison, WI

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

DuraSpace News: View from the DuraSpace Tweet-o-sphere

Tue, 2015-11-24 00:00

Winchester, MA  For a quick round-up of current news and information about events and achievements happening around the digital preservation and access ecosystem visit DuraSpace Today: Follow DuraSpace on Twitter by clicking the link at the top of the page.

DuraSpace News: Cineca Highlights Recent Events and Developments

Tue, 2015-11-24 00:00

From Michele Mennielli, Cineca  

Bologna, Italy  In the last two months Cineca attended two very important IT events focused on support for Higher Education. At both events the Italian Consortium presented its research ecosystem related activities, focusing on DSpace and DSpace-CRIS.

DuraSpace News: Provider-NT is Now Neki IT

Tue, 2015-11-24 00:00

From Tiago Ferreira  

Petrópolis, Rio de Janeiro, Brazil  Provider IT Neki Technologies, the Brazilian Duraspace Registered Service Provider, has undergone a major change during the past few months and is now Neki IT. Located in Petrópolis, Rio de Janeiro, Neki IT has left the Provider Group and is again running its own structure.  

NYPL Labs: Scribe: Toward a General Framework for Community Transcription

Mon, 2015-11-23 22:08

A couple of weeks ago, NYPL Labs was very excited to release Emigrant City, the Library's latest effort to unlock the data found within our collections.

But Emigrant City is a bit different from the other projects we’ve released in one very important way: this one is built on top of a totally new framework called Scribe, built in collaboration with Zooniverse and funded by a grant from the NEH Office of Digital Humanities along with funds from the University of Minnesota. Scribe is the codebase working behind the scenes to support this project.


What is Scribe?

Scribe is a highly configurable, open-source framework for setting up community transcription projects around handwritten or OCR-resistant texts. Scribe provides the foundation of code for a developer to configure and launch a project far more easily than if starting from scratch.

NYPL Labs R&D has built a few community transcription apps over the years. In general, these applications are custom built to suit the material. But Scribe prototypes a way to describe the essential work happening at the center of those projects. With Scribe, we propose a rough grammar for describing materials, workflows, tasks, and consensus. It’s not our last word on the topic, but we think it’s a fine first pass proposal for supporting the fundamental work shared by many community transcription projects.

So, what’s happening in all of these projects?

Our previous community transcription projects run the gamut from requesting very simple, nearly binary input like “Is this a valid polygon?” (as in the case of Building Inspector) to more complex prompts like “Identify every production staff member in this multi-page playbill” (as in the case of Ensemble). Common tasks include:

  • Identify a point/region in a digitized document or image
  • Answer a question about all or part of an image
  • Flag an image as invalid (meaning it’s blank or does not include any pertinent information)
  • Flag other’s contributions as valid/invalid
  • Flag a page or group of pages as “done”

There are many more project-specific concerns, but we think the features above form the core work. How does Scribe approach the problem?

Scribe reduces the problem space to “subjects” and “classifications.” In Scribe, everything is either a subject or a classification: Subjects are the things to be acted upon, classifications are created when you act. Creating a classification has the potential to generate a new subject, which in turn can be classified, which in turn may generate a subject, and so on.

This simplification allows us to reduce complex document transcription to a series of smaller decisions that can be tackled individually. We think reducing the atomicity of tasks makes projects less daunting for volunteers to begin and easier to continue. This simplification doesn’t come at the expense of quality, however, as projects can be configured to require multiple rounds of review.

The final subjects produced by this chain of workflows represent the work of several people carrying an initial identification all the way through to final vetted data. The journey comprises a chain of subjects linked by classifications connected by project-specific rules governing exposure and consensus. Every region annotated is eventually either deleted by consensus or further annotated with data entered by several hands and, potentially, approved by several reviewers. The final subjects that emerge represent singular assertions about the data contained in a document validated by between three and 25 people.

In the case of Emigrant City specifically, individual bond records are represented as subjects. When participants mark those records up, they produce “mark” subjects, which appear in Transcribe. In the Transcribe workflow, other contributors transcribe the text they see, which are combined with others’ transcriptions as “transcribe” subjects. If there’s any disagreement among the transcriptions, those transcribe subjects appear in Verify where additional classifications are added by other contributors as votes for one or another transcription. But this is just the configuration that made sense for Emigrant City. Scribe lays the groundwork to support other configurations.

Is it working?

I sure hope so! In any case, the classifications are mounting for Emigrant City. At writing we’ve gathered 227,638 classifications comprising marks, transcriptions, and verifications from nearly 3,000 contributors. That’s about 76 classifications each, on average, which is certainly encouraging as we assess the stickiness of the interface.

We’ve had to adjust a few things here and there. Bugs have surfaced that weren’t apparent before testing at scale. Most issues have been patched and data seems to be flowing in the right directions from one workflow to the next. We’ve already collected complete, verified data for several documents.

Reviewing each of these documents, I’ve been heartened by the willingness of a dozen strangers spread between the US, Europe, and Australia to meditate on some scribbles in a 120 year old mortgage record. I see them plugging away when I’m up at 2 a.m. looking for a safe time to deploy fixes.

What’s next?

As touched on above, Scribe is primarily a prototype of a grammar for describing community transcription projects in general. The concepts underlying Scribe formed over a several-month collaboration between remote teams. We built the things we needed as we needed them. The codebase is thus a little confusing in areas, reflecting several mental right turns when we found the way forward required an additional configuration item or chain of communication. So one thing I’d like to tackle is reigning in some of the areas that have drifted from the initial elegance of the model. The notion that subjects and workflows could be rearranged and chained in any configuration has been a driving idea, but in practice the system obliges only a few arrangements.

An increasingly more pressing desire, however, is developing an interface to explore and vet the data assembled by the system. We spent a lot of time developing the parts that gather data, but perhaps not enough on interfaces to analyze it. Because we’ve reduced document transcription into several disconnected tasks, the process to reassemble the resultant data into a single cohesive whole is complicated. That complexity requires a sophisticated interface to understand how we arrived at a document’s final set of assertions from the the chain of contributions that produced it. Luckily we now have a lot of contributions around which to build that interface.

Most importantly, the code is now out in the wild, along with live projects that rely on it. We’re already grateful for the tens of thousands of contributions people have made on the transcription and verification front, and we’d likewise be immensely grateful for any thoughts or comments on the framework itself—let us know in the comments, or directly via Github, and thanks for helping us get this far.

Also, check out the other community transcription efforts built on Scribe. Measuring the Anzacs collects first-hand accounts from New Zealanders in WWI. Coming soon, “Old Weather: Whaling” gathers Arctic ships’ logs from the late 19th and early 20th centuries.

District Dispatch: Fair Use highlighted at Re:Create conference

Mon, 2015-11-23 22:04

In his opening remarks at the November 17 Re:Create conference, Public Knowledge President & CEO Gene Kimmelman shared his thoughts about fair use as a platform for today’s creative revolution, and about it being a key to the importance of how knowledge is shared in today’s society. That set the tone for a dynamic discussion of copyright policy and law that followed, the cohesive focus behind the Re:Create coalition.

Panel Moderator Mike Masnick, founder, Techdirt and CEO, Copia Institute, poses a question to panelists (L.toR.): Casey Rae, CEO, Future of Music; Julie Samuels, executive director, Engine; Howard University Law Professor Lateef Mtima, founder and director, Institute for Intellectual Property and Social Justice; and Greta Peisch, international trade counsel, Senate Finance Committee.

“Yes, it’s important for creators to have a level of protection for their work,” Eli Lehrer, president of the R Street Institute, said, “but that doesn’t mean government should have free rein. The founding fathers wanted copyright to be limited but they also wanted it to support the growth of science and the arts.” He went on to decry how copyright has been “taken over by special interests and crony capitalism. We need a vibrant public domain to support true creation,” he said, “and our outdated copyright law is stifling the advancement of knowledge and new creators in the digital economy.”

Three panels of experts brought together by the Re:Create Coalition then proceeded to critique pretty much every angle of copyright law and the role of the copyright office. They also discussed the potential for modernization of the U.S. Copyright Office, whether the office should stay within the Library of Congress or move, and the prospects for reform of the copyright law. The November 17 morning program was graciously hosted by Washington, D.C.’s Martin Luther King, Jr. Memorial Library.

Cory Doctorow, author and advisor to the Electronic Frontier Foundation, believes audiences should have the opportunity to interact with artists/creators. He pointed to Star Wars as an example. Because fans and audiences have interacted and carried the theme and impact forward, Star Wars continues to be a big cultural phenomenon, despite long pauses between new parts in the series. As Michael Weinberg, general counsel and intellectual property (IP) expert at Shapeways, noted, there are certain financial benefits in “losing control,” i.e. the value of the brand is being augmented by audience interaction, thus adding value to the product. Doctorow added that we’ve allowed copyright law to become entertainment copyright law, thus “fans get marginalized by the heavyweight producers.”

On the future of the copyright office panel, moderator Michael Petricone, senior vice president, government affairs, Consumer Technology Association (CTA), said we need a quick and efficient copyright system, and “instead of fighting over how to slice up more pieces of the pie, let’s focus on how to make the pie bigger.”

Jonathan Band, counsel to the Library Copyright Alliance (LCA), said the Copyright Office used to just manage the registration process, but then, 1) volume multiplied 2) some people registered and others didn’t who are not necessarily using the system, and 3) the office didn’t have the resources to keep up on the huge volume of things being created. This “perfect storm,” he said, is not going to improve without important changes, such as modernizing its outdated and cumbersome record-keeping, but the Office also needs additional resources to address the “enormous black hole of rights.” Laura Moy, senior policy counsel at the Open Technology Institute (OTI) agreed that this is a big problem, because many new creators don’t have the resources or the legal counsel to help them pursue copyright searches and registration.

All the panelists were in agreement that it makes no sense to move the Copyright Office out of the Library of Congress, as has been proposed by a few. Matt Schruers, vice president for law & policy, Computer & Communications Industry Association (CCIA) agreed, urging for more robust record-keeping, incentives to get people to register and taking steps to mitigate costs. He said “we need to look at what the problems are, and fix them where they are. A lot of modernization can be done in the Office where it is, instead of all the cost of moving it and setting it up elsewhere.”  Band strongly agreed. “Moving it elsewhere wouldn’t solve the issue/cost of taking everything digital. Moving the Office just doesn’t make sense.” Moy suggested there are also some new skills and expertise that are needed, such as someone with knowledge in IP and its impacts on social justice.

Later in the program, panelists further batted around the topic of fair use. For Casey Rae, CEO of the Future of Music Coalition, fair use is often a grey area of copyright law because it depends on how fair use is interpreted by the courts. In the case of remixes, for example, the court, after a lengthy battle, ruled in favor of 2 Live Crew’s remix of Pretty Woman, establishing that commercial parodies can qualify for fair use status. Lateef Mtima, professor of law, Howard University, and founder and director of the Institute for Intellectual Property and Social Justice, cited the Lenz v. Universal case that not only ruled in favor of the mom who had posted video on YouTube of her baby dancing to Prince’s Let’s Get Crazy, but established that fair use is a right, warning those who consider issuing a takedown notice to “ignore it at your own peril.”

When determining fair use, Greta Peisch, international trade counsel, Senate Finance Committee, said “Who do you trust more to best interpret what is in the best interests of society, the courts or Congress?” The audience response clearly placed greater confidence in the courts on that question. And Engine Executive Director Julie Samuels concluded that “fair use is the most important piece of copyright law—absolutely crucial.”

In discussing the future for copyright reform, Rae said there’s actually very little data on how revising the laws will impact the creative community and their revenue streams. He said legislation can easily be created based on assumptions without the data to back it up, so he urged for more research. But he also implied that the music industry (sound recording and music studios) need to do a better job of explaining their narrative…i.e. go to policymakers with data in hand and real life stories to share.

Mtima is optimistic that society is making progress in better understanding how the digital age has opened up the world for sharing knowledge and expanding literacy (what he called society’s Gutenberg moment). At first, he says, there was resistance to change. But as content users have made more and more demands for access to content, big content providers are recognizing the need to move away from the old model of controlling and “monetizing copies.” New models are developing and there’s recognition that opening access is often expanding the value of the brand.

Re:Create’s ability to focus on such an important area of public policy as copyright is the reason the coalition has attracted a broad and varied membership. It remains an important forum for discourse among creators, advocates, thinkers and consumers.

The post Fair Use highlighted at Re:Create conference appeared first on District Dispatch.

Cynthia Ng: Better than Christmas Morning: Finding Your Motivation

Mon, 2015-11-23 19:32
This was originally posted on The Pastry Box on October 1. Unfortunately, for some reason it was not tweeted about, so I didn’t see that it had been published. Anyway, here it is re-published. Enjoy. People frequently ask me whether I like or enjoy the work that I do. In theory, I’m helping 10% of … Continue reading Better than Christmas Morning: Finding Your Motivation

DPLA: George Washington Had a Killer “Craft” Beer Recipe

Mon, 2015-11-23 18:00

What better time than Fall for a new craft beer recipe? This one, in particular, has a unique origin story—and it starts with Founding Father and first US President George Washington.

The recipe was found written in a notebook that Washington kept during the French and Indian War, digitized and available through The New York Public Library. The notebook entries, which begin in June 1757, put a 25-year-old Washington at Fort Loundoun in Winchester, Virginia, where he served as a colonel in the Virginia Regiment militia. Washington’s experience in the militia, where he served as an ambassador, led expeditions, and defended Virginia against French and Indian attacks, gave him a military and political savvy that helped shape his leadership of the Continental Army during the Revolutionary War.

The notebook gives a unique view into Washington’s time in the military on a day-to-day basis. These include his notes for “Sundry things to be done in Williamsburg,” and lists of supplies (the pages marked with cross-hatched x’s, once the items were done). Washington outlines memos and letters, including to the Governor of Virginia and the Speaker of the House of Burgesses. He describes his horses, too—Nelly, Jolly, Ball, Jack, Rock, Woodfin, Prince, Buck, Diamond, and Crab—with illustrations of their brand marks.

Excerpt on making small beer from George Washington’s notebook as a Virginia colonel. Courtesy New York Public Library.

Among these notes, on the final page of the book, is Washington’s recipe for “small beer.” This type of beer is thought to have low alcohol content and low quality, and is believed to have been regularly given to soldiers in the British Army. While other, higher-quality alcohol was for the rich, who could afford the luxury, small beer was typically for paid servants. Other alcohol rations, like rum and later whiskey, were given to both slaves and employees at Mount Vernon on a weekly basis.

The small beer recipe, transcribed below, makes provisions for the types of conditions Washington or others may have needed for wartime preparation, outside of a more stable brewery. The directions require little time or ingredients, and include additional steps to take depending on the weather.

Take a large Sifter full of Bran Hops to your Taste — Boil these 3 hours. Then strain out 30 Gall. into a Cooler put in 3 Gallons Molasses while the Beer is scalding hot or rather drain the molasses into the Cooler. Strain the Beer on it while boiling hot let this stand til it is little more than Blood warm. Then put in a quart of Yeast if the weather is very cold cover it over with a Blanket. Let it work in the Cooler 24 hours then put it into the Cask. leave the Bung open til it is almost done working — Bottle it that day Week it was Brewed.

For Washington, beer was considered a favorite drink (though he enjoyed a higher quality than that described in his notebook). It was typically on the menu for dinners at Mount Vernon, and a bottle of beer was given to servants daily. Washington even brewed his own beer on the estate, at what Mount Vernon historians believe to be sizeable rates.

An illustration of Washington’s estate, Mount Vernon. Courtesy of The New York Public Library.

In 1797, he started a whiskey distillery, too, making use of the plantation’s grain, which produced up to 12,000 gallons a year. While his distillery was a successful business venture for Washington, he himself wasn’t a fan of whiskey, and preferred his customary mug of beer each night at dinner.

Washington’s notebook was digitized as part of The New York Public Library’s Early American Manuscripts Project, which is looking to digitize 50,000 pages of material. These unique documents give a new perspective on life in the colonies and during the Revolutionary War, on a large and small scale. Besides the digitized papers of Founding Fathers (like Washington, Thomas Jefferson, Alexander Hamilton and James Madison), there are collections of diaries, business papers, and other fascinating colonial material.

A portrait of George Washington as a colonel during the French and Indian War. Courtesy of The New York Public Library.

ACRL TechConnect: The Library as Research Partner

Mon, 2015-11-23 16:00

As I typed the title for this post, I couldn’t help but think “Well, yeah. What else would the library be?” Instead of changing the title, however, I want to actually unpack what we mean when we say “research partner,” especially in the context of research data management support. In the most traditional sense, libraries provide materials and space that support the research endeavor, whether it be in the physical form (books, special collections materials, study carrels) or the virtual (digital collections, online exhibits, electronic resources). Moreover, librarians are frequently involved in aiding researchers as they navigate those spaces and materials. This aid is often at the information seeking stage, when researchers have difficulty tracking down references, or need expert help formulating search strategies. Libraries and librarians have less often been involved at the most upstream point in the research process: the start of the experimental design or research question. As one considers the role of the Library in the scholarly life-cycle, one should consider the ways in which the Library can be a partner with other stakeholders in that life-cycle. With respect to research data management, what is the appropriate role for the Library?

In order to achieve effective research data management (RDM), planning for the life-cycle of the data should occur before any data are actually collected. In circumstances where there is a grant application requirement that triggers a call to the Library for data management plan (DMP) assistance, this may be possible. But why are researchers calling the Library? Ostensibly, it is because the Library has marketed itself (read: its people) as an expert in the domain of data management. It has most likely done this in coordination with the Research Office on campus. Even more likely, it did this because no one else was. It may have done this as a response to the National Science Foundation (NSF) DMP requirement in 2011, or it may have just started doing this because of perceived need on campus, or because it seems like the thing to do (which can lead to poorly executed hiring practices). But unlike monographic collecting or electronic resource acquisition, comprehensive RDM requires much more coordination with partners outside the Library.

Steven Van Tuyl has written about the common coordination model of the Library, the Research Office, and Central Computing with respect to RDM services. The Research Office has expertise in compliance and Central Computing can provide technical infrastructure, but he posits that there could be more effective partners in the RDM game than the Library. That perhaps the Library is only there because no one else was stepping up when DMP mandates came down. Perhaps enough time has passed, and RDM and data services have evolved enough that the Library doesn’t have to fill that void any longer. Perhaps the Library is actually the *wrong* partner in the model. If we acknowledge that communities of practice drive change, and intentional RDM is a change for many of the researchers, then wouldn’t ceding this work to the communities of practice be the most effective way to stimulate long lasting change? The Library has planted some starter seeds within departments and now the departments could go forth and carry the practice forward, right?

Well, yes. That would be ideal for many aspects of RDM. I personally would very much like to see the intentional planning for, and management of, research data more seamlessly integrated into standard experimental methodology. But I don’t think that by accomplishing that, the Library should be removed as a research partner in the data services model. I say this for two reasons:

  1. The data/information landscape is still changing. In addition to the fact that more funders are requiring DMPs, more research can benefit from using openly available (and well described – please make it understandable) data. While researchers are experts in their domain, the Library is still the expert in the information game. At its simplest, data sources are another information source. The Library has always been there to help researchers find sources; this is another facet of that aid. More holistically, the Library is increasingly positioning itself to be an advocate for effective scholarly communication at all points of the scholarship life-cycle. This is a logical move as the products of scholarship take on more diverse and “nontraditional” forms.

Some may propose that librarians who have cultivated RDM expertise can still provide data seeking services, but perhaps they should not reside in the Library. Would it not be better to have them collocated with the researchers in the college or department? Truly embedded in the local environment? I think this is a very interesting model that I have heard some large institutions may want to explore more fully. But I think my second point is a reason to explore this option with some caution:

2. Preservation and access. Libraries are the experts in the preservation and access of materials. Central Computing is a critical institutional partner in terms of infrastructure and determining institutional needs for storage, porting, computing power, and bandwidth but – in my experience – are happy to let the long-term preservation and access service fall to another entity. Libraries (and archives) have been leading the development of digital preservation best practices for some time now, with keen attention to complex objects. While not all institutions can provide repository services for research data, the Library perspective and expertise is important to have at the table. Moreover, because the Library is a discipline-agnostic entity, librarians may be able to more easily imagine diverse interest in research data than the data producer. This can increase the potential vehicles for data sharing, depending on the discipline.

Yes, RDM and data services are reaching a place of maturity in academic institutions where many Libraries are evaluating, or re-evaluating, their role as a research partner. While many researchers and departments may be taking a more proactive or interested position with RDM, it is not appropriate for Libraries to be removed from the coordinated work that is required. Libraries should assert their expertise, while recognizing the expertise of other partners, in order to determine effective outreach strategies and resource needs. Above all, Libraries must set scope for this work. Do not be deterred by the increased interest from other campus entities to join in this work. Rather, embrace that interest and determine how we all can support and strengthen the partnerships that facilitate the innovative and exciting research and scholarship at an institution.

Islandora: Islandora CLAW Community Sprint 001 - Complete!

Mon, 2015-11-23 14:15

Back in September, the Islandora community completed its first volunteer sprint, a maintenance sprint on Islandora 7.x-1.x that cleaned up 38 tickets in advance of the 7.x-1.6 release (November 4th). For our second sprint (and likely for all sprints in the future), we moved over to Islandora's future and did our work on Islandora 7.x-2.x (also known as CLAW). With CLAW being very new to the vast majority of the community, we put the focus on knowledge sharing and exploring the new stack, with a a lot user documentation and discussion tickets for new folks to dig their teeth into. A whopping 17 sprinters signed up:

  • Nick Ruest
  • Jared Whiklo
  • Melissa Anez
  • Kim Pham
  • Diego Pino
  • Brad Spry
  • Caleb Derven
  • Lingling Jiang
  • Danny Lamb
  • Don Moses
  • Lydia Z
  • Luke Wesley
  • Kelli Babcock
  • Chul Yoon
  • Sunny Lee
  • Peter Murray
  • Nigel Banks

We got started on Monday, November 2nd with a live demo of CLAW provided by Nick Ruest and Danny Lamb, which has been recorded for anyone who'd like to take it in on their own time:

And then a virtual meeting to discuss how we'd approach the sprint and hand out tickets. As with our previous sprint, we mixed collaboration with solo work, coordinating via the #islandora IRC channel on freenode and with discussion on GitHub issues and pull requests. We stayed out of JIRA this time, doing all of our tracking with issues right in the CLAW GitHub Repo. In the end, we closed up nine tickets, and there should be extra kudos for Donald Moses from UPEI, who was the sprint MVP with his wonderful user documentation:

The existing CLAW Committers also tackled some more technical issues, such as Solr provisioning in vagrant or setting up the CLAW vagrant to easily deploy on Digital Ocean or Amazon Web Services.

Nick Ruest put together a pretty awesome visualization of work on CLAW so far, where you can see the big burst off activity in November from our sprint:

You should also note that even in the early days of the project, activity on the code is usually followed up by activity on the documentation - that's a deliberate approach to make documenting CLAW an integral part of developing CLAW, so that when you are ready to adopt it, you'll find a rich library of technical, installation, and user documentation to support you.

With the success of the first two sprints, we are going to start going monthly. The next sprint will be December 7th - 18th and we are looking for volunteers to sign up. This sprint will have a more technical focus, concentrating on improvements in a single area of the stack; PHP Services. We're especially looking for some developers who'd like to contribute to help us reach our goals. That said, there is always a need for testers and reviewers, so don't be afraid to sign up even if you are not a developer.

PHP Services description: Have the majority of RESTful services pulled out of the CMS context, and exposed so that Drupal hooks or Event system can interact with them. We've already implemented two (images and collections) in Java, and we'd like to start by porting those over. These services will handle operations on PCDM objects and object types. There are lots of different ways to do this (Silex, Slim, Phalcon, Symfony, etc...), but the core idea is maintaining these as a separate layer.

What about Chullo?

Chullo will be the heart of the micro services. If everything is written properly the code reuse will allow for individual services to be a thin layer to expose the Chullo code in a particular context.

DuraSpace News: Updated VIVO Website Tells VIVO Story

Mon, 2015-11-23 00:00

Winchester, MA  The VIVO Project has launched a new website ( focused on telling the VIVO story, and simplifying access to all forms of information regarding VIVO.

Short videos tell the VIVO story—how VIVO is connecting data to provide an integrated view of the scholarly work of an organization, how VIVO uses open standards to share data, and how VIVO is used to discover patterns of collaboration and work within and between organizations.

Nicole Engard: Bookmarks for November 22, 2015

Sun, 2015-11-22 20:30

Today I found the following resources and bookmarked them on Delicious.

  • NumFOCUS Foundation NumFOCUS promotes and supports the ongoing research and development of open-source computing tools through educational, community, and public channels.

Digest powered by RSS Digest

The post Bookmarks for November 22, 2015 appeared first on What I Learned Today....

Related posts:

  1. Non-Profit Organizations for FLOSS Projects
  2. 10 Secrets to Sustainable Open Source Communities
  3. What you can learn today

Ed Summers: Seminar 11

Sun, 2015-11-22 05:00

This week we focused on issues of privacy with Jessica Vitak after reading Palen & Dourish (2003), Vitak & Kim (2014) and Smith, Dinev, & Xu (2011). Of course privacy is a huge topic to tackle in a couple hours. But even this brief introduction was useful, and really made me start to think about how important theoretical frameworks are for the work I would like to do around appraisal in web archives.

Notions of privacy predate our networked world, but they are clearly bound up in, and being transformed, by the information technologies being deployed today. We spent a bit of time talking about Danah Boyd’s idea of context collapse that social media technologies often engenders or (perhaps) affords. Jessica used the wedding as a prototypical example of context collapse happening in a non-networked environment: extended family and close friends from both sides (bride and groom) are thrown into the same space to interact.

I’m not entirely clear on whether it’s possible to think of a technology affording privacy. Is privacy an affordance? I got a bit wrapped around the axle about answering this question because the notion of affordances has been so central to our seminar discussions this semester. I think that’s partly because of the strength of the Human Computer Interaction Lab here at UMD. Back in week 4 we defined affordances as a particular relationship between an object and a human (really any organism) that allows the human to perform some action. Privacy feels like it is more of a relation between humans and other humans, but perhaps that’s largely the result of it being an old concept that is being projected into a networked world of the web, social media and big data. Computational devices certainly have roles to play in our privacy, and if we look closer perhaps they always have.

Consider this door with a lock. Let’s say it was the door to your bedroom in a house you are renting with some friends. Imagine you want to get some peace and quiet to read a book. You can go into your room and close this door. The door affords some measure of privacy. But if you are getting changed and want to prevent someone from accidentally coming into your room you can choose to lock the door. The lock affords another measure of privacy. This doesn’t seem too much of a stretch to me. When I asked in class about whether privacy was an affordance I got the feeling that I was barking up the wrong tree. So I guess there’s more for me to unpack here.

One point I liked in the extensive literature review that Smith et al. (2011) provides was the distinction between normative versus descriptive privacy research. Normative privacy research focuses on ethical commitments or ought statement about the way things should be. Whereas descriptive privacy research focuses on what is, and can be further broken down into purely descriptive or empirically descriptive research. I think the purely descriptive line of research interests me the most, because privacy itself seems like an extremely complex topic that isn’t amenable to the way things should be, or positivist thinking. The authors basically admit this themselves early in the paper:

General privacy as a philosophical, psychological, sociological, and legal concept has been researched for more than 100 years in almost all spheres of the social sciences. And yet, it is widely recognized that, as a concept, privacy “is in disarray [and n]obody can articulate what it means” (Solove (2006), p. 477).

Privacy has so many facets, and is so contingent on social and cultural dynamics that I can’t help but wonder about how useful it is to think about it in abstract terms. But privacy is such an important aspect to the work I’m sketching out around social media and Web archives that it is essential that I spend significant time following some of these lines of research backwards and forwards. In particular I want to follow up on the work of Irwin Altman and Sandra Petronio who helped shape [communication privacy management theory], as well as Helen Nissenbaum who has done work bridging this work into online spaces Nissenbaum (2009) and Brunton & Nissenbaum (2015). I’ve also had MacNeil (1992) on my to-read list for a while since it specifically addresses privacy in the archive.

Maybe there’s an independent study in my future centered on privacy?


Brunton, F., & Nissenbaum, H. (2015). Obfuscation: A user’s guide for privacy and protest. Mit Press.

MacNeil, H. (1992). Without consent: The ethics of disclosing personal information in public archives. Scarecrow Press.

Nissenbaum, H. (2009). Privacy in context: Technology, policy, and the integrity of social life. Stanford University Press.

Palen, L., & Dourish, P. (2003). Unpacking privacy for a networked world. In Proceedings of the sIGCHI conference on human factors in computing systems (pp. 129–136). Association for Computing Machinery.

Smith, H. J., Dinev, T., & Xu, H. (2011). Information privacy research: An interdisciplinary review. MIS Quarterly, 35(4), 989–1016.

Solove, D. J. (2006). A taxonomy of privacy. University of Pennsylvania Law Review, 477–564.

Vitak, J., & Kim, J. (2014). You can’t block people offline: Examining how facebook’s affordances shape the disclosure process. In Proceedings of the 17th aCM conference on computer supported cooperative work & social computing (pp. 461–474). Association for Computing Machinery.

Open Library Data Additions: An error occurred

Sat, 2015-11-21 19:38
The RSS feed is currently experiencing technical difficulties. The error is: Search engine returned invalid information or was unresponsive

flickr: DSC_9001

Sat, 2015-11-21 01:53

schwartzray posted a photo:

flickr: DSC_8994

Sat, 2015-11-21 01:50

schwartzray posted a photo: