You are here

Feed aggregator

Open Knowledge Foundation: Daystar University student journos learn about tracking public money through Open Data

planet code4lib - Wed, 2017-03-15 15:00

This blog is part of the event report series on International Open Data Day 2017. On Saturday 4 March, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. 44 events received additional support through the Open Knowledge International mini-grants scheme, funded by SPARC, the Open Contracting Program of Hivos, Article 19, Hewlett Foundation and the UK Foreign & Commonwealth OfficeThis event was supported through the mini-grants scheme under the Open Contracting and tracking public money flows theme. 

Before March 4, 2017, I had grand plans for the Open Data Day event. The idea was to bring together journalists from our student paper Involvement, introduce them to the concept of open data, then have them look at data surrounding the use of public monies. From there, we’d see what stories could emerge. 

Then Stephen Abbott Pugh from Open Knowledge International linked me up with Catherine Gicheru, the lead for Code for Kenya, which is affiliated with the open data and civic technology organisation Code for Africa. The event took a wonderfully new turn.

Prestone Adie, data analyst at ICT Authority, started us off with an explanation of open data and gave us interesting links to sites such as Kenya National Bureau of Statistics, Kenya Open Data Portal, and Kenya’s Ethics and Anti-Corruption Commission. He also pushed it up by taking us to some interesting blogs. There was one from a data analyst who uses his knowledge and expertise to post about mall tickets and fashion vloggers among other varied topics. There was also another that crowdsources information about bad roads in Kenya.  

It was a prime teachable moment, and I jumped in to emphasise how good writing is not restricted to journalism students. Data scientists and self-confessed nerds are in on the game too, and doing some pretty provocative storytelling in the process.

We took a refreshments break where Catherine and Florence Sipalla, surprised us with delicious branded muffins, giving all participants a sugar rush that sustained us for the second session. 

Catherine and Florence, who works as communication consultant and trainer, walked us through what Code for Kenya is doing, using massive amounts of data to tell stories that keep our public officials accountable. Among the tools, they’ve developed is PesaCheck, which enables citizens to verify the numbers that our leaders provide.

We then planned to have our students meet to come up with story ideas using these tools. I’m looking forward to what they will produce.

Here is what one of them said about the event:

As a journalist, the life blood of information today is data. The more you have it in your story the more credible and evidence based your story will look. Such a conference will inspire young journalists to rethink of how they write their stories. Data to me is inescapable  -Abubaker Abdullahi.

Prestone Adie’s list of Open Data sources
  1. Health facilities datasets
  2.  Water And sanitation datasets
  3.  Humanitarian datasets
  4.  Africa open datasets
  5. Kenya Tender awards
  6. Data on parliament activities
  7.  Laws made by the Kenyan parliament
  8. / / for agricultural datasets
  9. Commission on Revenue Allocation datasets
  10. Environmental data
  11. Historical photos

David Rosenthal: SHA1 is dead

planet code4lib - Wed, 2017-03-15 15:00
On February 23rd a team from CWI Amsterdam (where I worked in 1982) and Google Research published The first collision for full SHA-1, marking the "death of SHA-1". Using about 6500 CPU-years and 110 GPU-years, they created two different PDF files with the same SHA-1 hash. SHA-1 is widely used in digital preservation, among many other areas, despite having been deprecated by NIST through a process starting in 2005 and becoming official by 2012.

There is an accessible report on this paper by Dan Goodin at Ars Technica. These collisions have already caused trouble for systems in the field, for example for Webkit's Subversion repository. Subversion and other systems use SHA-1 to deduplicate content; files with the same SHA-1 are assumed to be identical. Below the fold, I look at the implications for digital preservation.

SHA-1 Collision (source)Technically, what the team achieved is a collision via an identical-prefix attack. Two different files generating the same SHA-1 hash is a collision. In their identical-prefix attack, they carefully designed the start to a PDF file. This prefix contained space for a JPEG image. They created two files, each of which started with the prefix, but in each case was followed by different PDF text. For each file, they computed a JPEG image that, when inserted into the prefix, caused the two files to collide.

Conducted using Amazon's "spot pricing" of otherwise idle machines the team's attack would cost about $110K. This attack is less powerful than a chosen-prefix attack, in which a second colliding file is created for an arbitrary first file. 2012's Flame malware used a chosen-prefix attack on MD5 to hijack the Windows update mechanism.

As we have been saying for more than a decade, the design of digital preservation systems must start from a threat model. Clearly, there are few  digital preservation systems whose threat model currently includes external or internal evil-doers willing to spend $110K on an identical-prefix attack. Which, in any case, would require persuading the preservation system to ingest a file of the attacker's devising with the appropriate prefix.

But the attack, and the inevitability of better, cheaper techniques leading to chosen-prefix attacks in the future illustrate the risks involved in systems that use stored hashes to verify integrity. These systems are vulnerable to chosen-prefix attacks, because they allow content to be changed without causing hash mis-matches.

The LOCKSS technology is an exception. LOCKSS boxes do not depend on stored hashes and are thus not vulnerable to identical-prefix or even chosen-prefix attacks on their content. The system does store hashes (currently SHA-1), but uses them only as hints to raise the priority of polls on content if the hashes don't match. The polling system is currently configured to use SHA-1, but each time it prepends different random nonces to the content that is hashed, mitigating these attacks.

If an attacker replaced a file in a LOCKSS box with a SHA-1 colliding file, the next poll would not be given higher priority because the hashes would match. But when the poll took place the random nonces would ensure that the SHA-1 computed would not be the collision hash. The damage would be detected and repaired. The hashes computed during each poll are not stored, they are of no value after the time for the poll expires. For details of the polling mechanism, see our 2003 SOSP paper.

There are a number of problems with stored hashes as an integrity preservation technique. In Rick Whitt on Digital Preservation I wrote:
The long timescale of digital preservation poses another problem for digital signatures; they fade over time. Like other integrity check mechanisms, the signature attests not to the digital object, but to the hash of the digital object. The goal of hash algorithm design is to make if extremely difficult with foreseeable technology to create a different digital object with the same hash, a collision. They cannot be designed to make this impossible, merely very difficult. So, as technology advances with time, it becomes easier and easier for an attacker to substitute a different object without invalidating the signature. Because over time hash algorithms become vulnerable and obsolete, preservation system depending for integrity on preserving digital signatures, or even just hashes, must routinely re-sign, or re-hash, with a more up-to-date algorithm.When should preservation systems re-hash their content? The obvious answer is "before anyone can create collisions", which raises the question of how the preservation system can know the capabilities of the attackers identified by the system's threat model. Archives whose threat model leaves out nation-state adversaries are probably safe if they re-hash and re-sign as soon as the open literature shows progress toward a partial break, as it did for SHA-1 in 2005.

The use of stored hashes for integrity checking has another, related problem. There are two possible results from re-computing the hash of the content and comparing it with the stored hash:
  • The two hashes match, in which case either:
    • The hash and the content are unchanged, or
    • An attacker has changed both the content and the hash, or
    • An attacker has replaced the content with a collision, leaving the hash unchanged.
  • The two hashes differ, in which case:
    • The content has changed and the hash has not, or
    • The hash has changed and the content has not, or
    • Both content and hash have changed.
The stored hashes are made of exactly the same kind of bits as the content whose integrity they are to protect. The hash bits are subject to all the same threats as the content bits. In effect, the use of stored hashes has reduced the problem of detecting change in a string of bits to the previously unsolved problem of detecting change in a (shorter) string of bits.

Traditionally, this problem has been attacked by the use of Merkle trees, trees in which each parent node contains the hash of its child nodes. Notice that this technique does not remove the need to detect change in a string of bits, but by hashing hashes it can reduce the size of the bit string.

The possibility that the hash algorithm used for the Merkle tree could become vulnerable to collisions is again problematic. If a bogus sub-tree could be synthesized that had the same hash at its root as a victim sub-tree, the entire content of the sub-tree could be replaced undetectably.

One piece of advice for preservation systems using stored hashes is, when re-hashing with algorithm B because previous algorithm A has been deprecated, to both keep the A hashes and verify them during succeeding integrity checks. It is much more difficult for an attacker to create files that collide for two different algorithms. For example, using the team's colliding PDF files:

$ sha1sum *.pdf
38762cf7f55934b34d179ae6a4c80cadccbb7f0a shattered-1.pdf
38762cf7f55934b34d179ae6a4c80cadccbb7f0a shattered-2.pdf
$ md5sum *.pdf
ee4aa52b139d925f8d8884402b0a750c shattered-1.pdf
5bd9d8cabc46041579a311230539b8d1 shattered-2.pdf

Clearly, keeping the A hashes is pointless unless they are also verified. The attacker will have ensured that the B hashes match, but will probably not have expended the vastly greater effort to ensure that the A hashes also match.

Stored hashes from an obsolete algorithm are in practice adequate to detect random bit-rot, but as we see they can remain useful even against evil-doers. Archives whose threat model does not include a significant level of evil-doing are unlikely to survive long in today's Internet.

D-Lib: Open Access to Scientific Information in Emerging Countries

planet code4lib - Wed, 2017-03-15 12:13
Article by Joachim Schopfel, University of Lille, GERiiCO Laboratory

D-Lib: Broken-World Vocabularies

planet code4lib - Wed, 2017-03-15 12:13
Article by Daniel Lovins, New York University (NYU) Division of Libraries; Diane Hillmann, Metadata Management Associates LLC

D-Lib: The Landscape of Research Data Repositories in 2015: A re3data Analysis

planet code4lib - Wed, 2017-03-15 12:13
Article by Maxi Kindling, Stephanie van de Sandt, Jessika Rucknagel and Peter Schirmbacher, Humboldt-Universitat zu Berlin, Berlin School of Library and Information Science (BSLIS), Germany; Heinz Pampel, Paul Vierkant and Roland Bertelmann, GFZ German Research Centre for Geosciences, Section 7.4 Library and Information Services (LIS), Germany; Gabriele Kloska and Frank Scholze, Karlsruhe Institute of Technology (KIT), KIT Library, Germany; Michael Witt, Purdue University Libraries, West Lafayette, Indiana, USA

D-Lib: Workshop Report: CAQDAS Projects and Digital Repositories' Best Practices

planet code4lib - Wed, 2017-03-15 12:13
Workshop Report by Sebastian Karcher and Christiane Page, Syracuse University

D-Lib: ReplicationWiki: Improving Transparency in Social Sciences Research

planet code4lib - Wed, 2017-03-15 12:13
Article by Jan H. Hoeffler, University of Gottingen, Germany

D-Lib: Research Data Challenges

planet code4lib - Wed, 2017-03-15 12:13
Editorial by Laurence Lannom, CNRI

Open Knowledge Foundation: OK Belgium welcomes Dries van Ransbeeck as new project coordinator… and other updates from quarter 4

planet code4lib - Wed, 2017-03-15 11:00

This blog post is part of our on-going Network series featuring updates from chapters across the Open Knowledge Network and was written by the Open Knowledge Belgium team.

A lot of things has happened over the past few months at Open Knowledge Belgium. First, we welcomed Dries Van Ransbeeck as the new project coordinator. His previous experience ranges from data modelling, civic engagement to crowdsourcing. He also has a keen interest in open innovation and the power of many intrinsically motivated individuals contributing to projects with social and societal impact, serving the interests of the many rather than the happy few.

Dries’ mission is to bring Open Knowledge and Open Data to a level playing field, where people with all sort of backgrounds, technical and non-technical can use, reuse and create knowledge in a sustainable way. Read more on Dries here. We also moved to a new office space in Brussels and welcomed three new interns: Chris, Umut and myself [Sarah] who will be helping the team for the next few months. Below are the latest updates of our activities:

Open Belgium Conference

Our annual conference was held on March 6 in Brussels, with the theme: ‘Open Cities and Smart Data’. There were talks and discussions about how Open Data can contribute to smart urban development, the rise of smart applications and the shift from raising the quantity of data to raising the quality of data. About 300 industries, researchers, government and citizen stakeholders were expected for the conference to discuss various open efforts in Belgium. More will be shared later.

Open Education

An interesting Kickoff meeting about Open Education was held in February to discuss the possibilities of opening up educational data in Belgium. This bottom-up action is needed in order to make things possible and keep the discussion alive. Business owners, data providers, data users and problem owners sat together and discussed the possibilities concerning open educational resources (OER) and open educational practices.

While students and staff want information which is up-to-date and easy to find, most colleges are unwilling to open up their data which is very problematic because opening up educational data to build applications would make things a lot easier for students as well as the colleges themselves. Another issue that was identified is the fact of every educational institution working with a different database.

It was, therefore, interesting to discuss an open data standard for every college or university. We can solve these problems by giving educational institutions more concrete information about what data they have to open up and what the consequences are. Therefore, this working group could contribute to the discussions on the possibilities of Open Education and create extra pressure on colleges and universities to open up their data and provide them with more information.

OpenStreetMap Belgium

The biggest achievement for OSM Belgium in 2016 was co-organizing SOTM (State of the Map) in Brussels, which is the yearly international conference on OpenStreetMap. Our community for OpenStreetMap is growing and thanks to the help of many enthusiastic volunteers, SOTM was a great success. 

For 2017, OSM plans to formalise their organisation by setting up a membership and some very basic governance rules. By doing this, they want to provide some structure and support for future projects. OSM will continue to stay an open community as it always has been. The main goal of OSM is to communicate in a better way about their ongoing projects so as to attract sponsorships for the new year. They’re also collaborating more closely with other organisations which share the same goals.

Open Badges Belgium

We have also recently started a new working group who wants to help spread the use of Open Badges in Belgium. An Open Badge is a digital platform where you can showcase the talents you have acquired and share them with the labour market. They are visual tokens of achievement or other trust relationship given by training centres, companies and selection agencies and are also shareable across the web.

Open Badges are more detailed than a CV as they can be used in all kinds of combinations, creating a constantly evolving picture of a person’s lifelong learning. To learn more about how Open Badges work, watch our introductory video here.


Oasis is the acronym for ‘Open Applications for Semantically Interoperable Services’. This is a cooperation between the city of Ghent and the region of Madrid to increase the accessibility of public services and public transport. Both cities work together and publish linked open data to prove that new technologies can lead to economies of scale, such as the creation of cross-country applications.

To read up on Open Knowledge team and our activities please visit or follow us on Twitter, Facebook and Linkedin.


Ed Summers: IS

planet code4lib - Wed, 2017-03-15 04:00

This week’s readings were focused on Interaction Sociolinguistics (IS) which is a field of discourse analysis that is situated at the intersection of anthropology, sociology and linguistics. At a high level Gordon (2011) defines IS this way:

IS offers theories and methods that enable researchers to explore not only how language works but also to gain insights into the social processes through which individuals build and maintain relationships, exercise power, project and negotiate identities, and create communities.

IS typically uses recordings obtained through ethnography to identify signaling mechanisms or contextual cues. Breakdowns in communication happen when participants don’t share contextualization conventions, which can contribute to social problems such as stereotyping and differential access. Looking at the role of discourse in creating and reinforcing social problems was a specific theme in Gumperz work. It seems much in keeping with the goals of CDA, to interrogate power relationships, but perhaps without such an explicit theoretical framework, like what Marx or Foucault provide.

If this sounds similar to previous weeks’ focus on Ethnography of Communication and Critical Discourse Analysis that’s because, well, they are pretty similar. But there are some notable differences. The first notable one is something Gordon (2011), highlights: things start with John Gumperz. Gumperz was trained as a linguist, but his research and work brought him into close contact with some of the key figures in sociology and anthropology at the time.

Gumperz work grew out of developing “replicable methods of qualitative analysis that account for our ability to interpret what participants intend to convey in everyday communicative practices”. He drew on and synthesized several areas of theoretical and methodological work:

  • structural linguistics: ideas of communicative competence, when (and when not) to speak, what to talk about, who to talk to, what manner to talk in, subconscious/automatic speech, and regional linguistic diversity.
  • ethnography of communication: the use of participant observation and interviewing and data collection as “thick description” (from Geertz)
  • ethnomethodology: from nature of social interaction, and the background knowledge needed to participate. Garfinkeling experiments where the researcher breaks social norms in order to discover unknown social norms.
  • conversation analysis: interaction order, frames, face saving, how conversation represents and creates social organization, and a focus on “micro features of interaction” while also allowing for cultural context. (from Goffman

Gumperz developed the idea of contextualization cues and how indexical signs offer a way of discovering how discourse is framed. Bateson calls these metamessages, or messages about how to interpret messages.

He also established the concept of conversational inference, which is how people assess what other people say in order to create meaning. It is an idea that bears a lot of resemblance to Grice’s idea of from CA of the cooperative principle, and how implicatures are sent by following or breaking maxims.

The idea of indirectness, linguistic politeness and face saving from Robin Lakoff also factor into IS. The choices speakers make (rate, pitch, amplitude) that affect an utterance’s interpretation. Tannen’s IS work also demonstrated how cultural background, race, ethnicity, gender and sexual orientation were factors in conversational style. IS admits generative grammar (Chomsky) as a theory, but does not limit study of language to just grammar, and allows for issues of context and culture to have a role. I basically think of IS saying “yes, and” to CA: it recognizes all the patterns and structures that CA identifies, but doesn’t limit work to just the text, and provides a framework where context is relevant and important to understanding.

The other article we read by Gordon took a deep dive into an empirical study that uses IS methods (Gordon & Luke, 2012). The authors examine email correspondence between school counselors in training with their supervising professor. They make the point that not much work has been done on supervision in the medium of email. Specifically they examine how identity development can be seen as ongoing face negotiation. Their research draws on work around face negotiation and politeness theory from Goffman (1967), as well as Arundale (2006)’s idea of face as a con-constructed understanding of self in relation to others.

Specifically their work is centered on the notion that face, politeness and identity are interconnected:

  • face is the social value a person can claim by the way people receive that person’s verbal and nonverbal acts.
  • they are connected through Lave and Wenger (1991)’s idea of community of practice.
  • “The notion of face is crucial to understanding how novices develop professional identities within a community of practice.” (p. 114)
  • politeness strategies are employed in online communications (email)

They collected the email of 8 (6F, 2M) participants, who sent at least one email per week over 14 weeks to their supervisor. These participants were learning how to be school counselors. The authors used data-driven discourse analysis with IS and computer-mediated discourse analysis (Herring, 2004). Among their findings they discovered taht:

  • constructed dialogue or reported speech and metadiscourse are used to raise face which they argue is part of identity construction
  • first person plural pronouns (we, us and our) are used to create shared alignment as well as give advice while saving face
  • use of discourse markers Schiffrin (1988) to structure ideas, meanings and interaction. For example “that being said” which is used by by the supervisors to offer criticism while also saving face
  • repetition is used to construct conversational worlds and establish communities of practice. It is possible that its used more in email because previous word usage can easily be recalled and copied, as compared with spoken words.

Tannen (2007) seems to be cited a fair bit in this paper as sources for these patterns. Perhaps it could be a good source of types of patterns to look for in my own data? I particular like the angle on community of practice which is something I’m looking to explore in my own research into web archiving.

Ironically it is another book by Tannen (2005) that is included as the next set of readings–specifically two chapters that analyze a set of conversations that happen over a Thanksgiving dinner. The rich description of this conversation (in which the author is a participant) offers many methodological jumping off points for IS work. Tannen does a pretty masterful job of weaving the description of the conversation, often with detailed transcription, with the reflections from her and the other participants, and observations from discourse analysis. It is clear that she recorded the conversations and then reviewed them with participants afterwards. Here’s a list of some of the methodological tools she used when analyzing the converstation, there were a lot!

  • conversations as a continuous stream that are punctuated into separate events that respond to each other (Bateson, 1972)

  • machine gun question: rapid fire questions which are meant them to show enthusiasm and interest, but can lead to defensiveness.

  • latching: when responses follow directly on from each other (Scheinkein, 1978)

  • dueting: jointly holding one side of the conversation (Falk, 1979)

  • buffering talk: for example “and all that” which can be used to save face when positioning.

  • back channel responses which serve to meta-conversational purposes when the mode of communication is mostly one way [Yngve:1970]

  • deviations from answer/question as adjacency pair (Sacks, Schegloff, & Jefferson, 1974)

  • sharing in conversational form and pacing is participation, even when specific knowledge is not being exhibited

  • formulaic or repeated phrases

  • shared revelations: to create personal connections

  • reduced syntactic form: “You survive?” instead of “Do you survive?”

  • intonational contours (interesting mode of illustration)

  • metaphorical code switching (Blom & Gumperz, 1972)


Generally speaking I enjoyed the readings this week, in particular the piece by Tannen which does a really nice job of exhibiting the various discourse features and explaining their significance in understanding the Thanksgiving dinner conversation. The ultimate realization of cultural differences that explain why some of the conversations played out the way they did, and why they were remembered by participants in particular ways seemed to be what made this an IS study. The fact that this contextual information (nationality, ethnicity) have a place when understanding the language seems like an important distinction for IS work. It also speaks to making the analysis relevant–one isn’t merely identifying patterns in the discourse but also casting those patterns in a light where greater insight is achieved. This seems like an important dimension for research to have. Even the Gordon & Luke (2012) piece seemed to draw some pretty sound inferences and conclusions about the research. This speaks to the pragmatist in me.

I also liked the mode of data collection and analysis since it seemed to strike a good balance between the detail and rigor of Conversational Analysis and the openness to context and thick description offered by Ethnography of Communication. I will admit to feeling a bit overwhelmed with the number of discourse features that were covered, and worry that I wouldn’t be able to draw on them as successfully as the authors. But I guess this must come with practice.

With regards to my own research the discussion of IS got me wondering if it might be fruitful to examine my interviews for segments where participants talked about how they understood web archiving processes like crawling or external systems like CMS. Specifically I’d like to see where their own experience and knowledge was shaped or formed by a community if practice. Thinking of GWU’s understanding of the data center - or Stanford’s idea about how a particular CMS worked, or NCSU’s understanding of DNS. I’m still really interested in this idea of a community of practice and it seems like using discourse as a window into this realm might be something other folks have done before. What are the methods that could best yield insights into this in my data?


Arundale, R. B. (2006). Face as relational and interactional: A communication framework for research on face, facework, and politeness. Journal of Politeness Research. Language, Behaviour, Culture, 2(2), 193–216.

Bateson, G. (1972). Steps to an ecology of mind: Collected essays in anthropology, psychiatry, evolution, and epistemology. University of Chicago Press.

Blom, J.-P., & Gumperz, J. J. (1972). Directions in sociolinguistics. In. Holt, Rinehart,; Winston.

Falk, J. L. (1979). The duet as a conversational process (PhD thesis). Princeton University.

Goffman, E. (1967). Interaction ritual: Essays in face to face behavior. Doubleday.

Gordon, C. (2011). The sage handbook of sociolinguistics. In R. Wodak, B. Johnstone, & P. E. Kerswill (Eds.),. Sage Publications.

Gordon, C., & Luke, M. (2012). Discursive negotiation of face via email: Professional identity development in school counseling supervision. Linguistics and Education, 23(1), 112–122.

Herring, S. C. (2004). Online communication: Through the lens of discourse. Internet Research Annual, 1, 65–76.

Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for conversation. Language, 696–735.

Scheinkein, J. (1978). Studies in the organization of conversational interaction. Academic Press.

Schiffrin, D. (1988). Discourse markers. Cambridge University Press.

Tannen, D. (2005). Conversational style: Analyzing talk among friends (2nd ed.). Oxford University Press.

Tannen, D. (2007). Talking voices: Repetition, dialogue, and imagery in conversational discourse (Vol. 26). Cambridge University Press.

District Dispatch: NLLD 2017 keynote announced

planet code4lib - Tue, 2017-03-14 18:10

We are happy to announce the keynote speaker for National Library Legislative Day 2017! Hina Shamsi, director of the ACLU National Security Project, will be joining us in Washington, D.C. on May 1, 2017. The National Security Project is dedicated to ensuring that U.S. national security policies and practices are consistent with the Constitution, civil liberties, and human rights. According to the ACLU website, Shamsi has:

litigated cases upholding the freedoms of speech and association, and challenging targeted killing, torture, unlawful detention, and post-9/11 discrimination against racial and religious minorities.

Her work includes a focus on the intersection of national security and counter-terrorism policies with international human rights and humanitarian law. She previously worked as a staff attorney in the ACLU National Security Project and was the acting director of Human Rights First’s Law & Security Program. She also served as senior advisor to the U.N. Special Rapporteur on Extrajudicial Executions. You can find her on Twitter @HinaShamsi and on the ACLU blog. In addition to a review of current legislation and issue briefs, provided by ALA Washington staff, we will also be joined by the team from The Campaign Workshop for an hour of advocacy training. Christian Zabriskie, Executive Director of Urban Librarians Unite, and Jennifer Manley, Managing Director of Prime Strategies NYC, will also be leading at 30 minutes breakout session called “The Political Dance.” Other afternoon sessions will be announced as the schedule is finalized.

If you are interested in library advocacy and are unfamiliar with National Library Legislative Day, you can find out more by visiting our website, or reading previous articles about the event. Online registration will remain open till April and registrations are accepted at the door.

Photo Credit: Adam Mason

For folks looking for funding, check out the WHCLIST award, which is still accepting submissions until April 2nd. You can also check out, which offers librarians the chance to fund-raise for Professional Development events, collection development projects, and other library needs.

Unable to join us in D.C. in May? Register to participate virtually – in May, we’ll send you a list of talking points, background information, and other resources, so that you can call, email, or Tweet at your Members of Congress about legislative issues that are important to you and your patrons. We’ll also send you a link to our webcast, so you can watch our keynote speaker and the issue briefs live!

The post NLLD 2017 keynote announced appeared first on District Dispatch.

Jonathan Rochkind: “Polish”; And, What makes well-designed software?

planet code4lib - Tue, 2017-03-14 16:34

Go check out Schneem’s post on “polish”. (No, not the country).

Polish is what distinguishes good software from great software. When you use an app or code that clearly cares about the edge cases and how all the pieces work together, it feels right. Unfortunately, this is the part of the software that most often gets overlooked, in favor of more features or more time on another project…

…When we say something is “polished” it means that it is free from sharp edges, even the small ones. I view polished software to be ones that are mostly free from frustration. They do what you expect them to and are consistent.…

…In many ways I want my software to be boring. I want it to harbor few surprises. I want to feel like I understand and connect with it at a deep level and that I’m not constantly being caught off guard by frustrating, time stealing, papercuts.

I definitely have experienced the difference between working with and on a project that has this kind of ‘polish’ and, truly, experiencing a deep-level connection of the code that lets me crazy effective with it — and working on or with projects that don’t have this.  And on projects that started out with it, but lost it! (An irony is that it takes a lot of time, effort, skill, and experience to design an architecture that seems like the only way it would make sense to do it, obvious, and as schneems says, “boring”!)

I was going to say “We all have experienced the difference…”, but I don’t know if that’s true. Have you?

What do you think one can do to work towards a project with this kind of “polish”, and keep it there?  I’m not entirely sure, although I have some ideas, and so does schneems. Tolerating edge-case bugs is a contraindication — and even though I don’t really believe in ‘broken windows theory’ when it comes to neighborhoods, I think it does have an application here. Once the maintainers start tolerating edge case bugs and sloppiness, it sends a message to other contributors, a message of lack of care and pride. You don’t put in the time to make a change right unless the project seems to expect, deserve, and accommodate it.

If you don’t even have well-defined enough behavior/architecture to have any idea what behavior is desirable or undesirable, what’s a bug– you’ve clearly gone down a wrong path incompatible with this kind of ‘polish’, and I’m not sure if it can be recovered from. A Fred Brooks “Mythical Man Month” quote I think is crucial to this idea of ‘polish’: “Conceptual integrity is central to product quality.”  (He goes on to say that having an “architect” is the best way to get conceptual integrity; I’m not certain, I’d like to believe this isn’t true because so many formal ‘architect’ roles are so ineffective, but I think my experience may indeed be that a single or tight team of architects, formal or informal, does correlate…).

There’s another Fred Brooks quote that now I can’t find and I really wish I could cause I’ve wanted to return to it and meditate on it for a while, but it’s about how the power of a system is measured by what you can do with it divided by the number of distinct architectural concepts. A powerful system is one that can do a lot with few architectural concepts.  (If anyone can find this original quote, I’ll buy you a beer or a case of it).

I also know you can’t do this until you understand the ‘business domain’ you are working in — programmers as interchangeable cross-industry widgets is a lie. (‘business domain’ doesn’t necessarily mean ‘business’ in the sense of making money, it means understanding the use-cases and needs of your developer users, as they try to meet the use cases and needs of their end-users, which you need to understand too).

While I firmly believe in general in the caution against throwing out a system and starting over, a lot of this caution is about losing the domain knowledge encoded in the system (really, go read Joel’s post). But if the system was originally architected by people (perhaps past you!) who (in retrospect) didn’t have very good domain knowledge (or the domain has changed drastically?), and you now have a team (and an “architect”?) that does, and your existing software is consensually recognized as having the opposite of the quality of ‘polish’, and is becoming increasingly expensive to work with (“technical debt”) with no clear way out — that sounds like a time to seriously consider it. (Although you will have to be willing to accept it’ll take a while to get feature parity, if those were even the right features).  (Interestingly, Fred Books was I think the originator of the ‘build one to throw away’ idea that Joel is arguing against. I think both have their place, and the place of domain knowledge is a crucial concept in both).

All of these are more vague hand wavy ideas than easy to follow directions, I don’t have any easy to follow directions, or know if any exist.

But I know that the first step is being able to recognize “polish”, a well-designed parsimoniously architected system that feels right to work with and lets you effortlessly accomplish things with it.  Which means having experience with such systems. If you’ve only worked with ball-of-twine difficult to work with systems, you don’t even know what you’re missing or what is possible or what it looks like. You’ve got to find a way to get exposure to good design to become a good designer, and this is something we don’t know how to do as well with computer architecture as with classic design (design school consists of exposure to design, right?)

And the next step is desiring and committing to building such a system.

Which also can mean pushing back on or educating managers and decision-makers.  The technical challenge is already high, but the social/organizational challenge can be even higher.

Because it is harder to build such a system than to not, designing and implementing good software is not easy, it takes care, skill, and experience.  Not every project deserves or can have the same level of ‘polish’. But if you’re building a project meant to meet complex needs, and to be used by a distributed (geographically and/or organizationally) community, and to hold up for years, this is what it takes. (Whether that’s a polished end-user UX, or developer-user UX, which means API, or both, depending on the nature of this distributed community).

Filed under: General

Islandora: Report from a release perspective: Islandora 7.x-1.9RC1 VM is available

planet code4lib - Tue, 2017-03-14 15:57

Good day dearest Islandora Folks,

Our Islandora 7.x-1.9 Release Candidate One Virtual Machine and updated islandora_vagrant github branch are available for immediate download and testing.   Summoning all Testers and Documenters, Maintainers, Committers and their friends: give this humble virtualised (or american virtualized) box of dreams a test.   Before: Give this a look again:,-Document,-or-Test-an-Islandora-Release   What to expect: just a working, clean, vanilla Islandora 7.x-1.9 VM machine. What is new?: not much really   How to use:    Passwords, which URL, other Questions, are answered here.   VirtualBox   Download the OVA file (3.4 Gbytes)   ( md5 hash of that ova file is: 6kpTJwCyWNXLG17ExmuUrw== )   and import as usual into your VirtualBox app.   Vagrant + VirtualBox as provider  Open your favourite Terminal and execute (won't go into too much details here)    git clone -b 7.x-1.9
 cd islandora_vagrant
 vagrant up

  (coffee, black no sugar, and look for some warnings... will open some jira tickets today =)
  (when finished)

 vagrant ssh   Enjoy 7.x-1.9 on fire ;) remember:  if its not on fire, it's not a RC1   Please don't hesitate to reach out, contact me or Melissa Anez if you have questions (email, IRC or Skype). This week is our Release call (Thursday 16, 3PM (ADT) 2PM NYC time, please join us if you signed for some release role or have ideas/questions/concerns about what is happening here.   We are here to help and to facilitate this one being another successful and trusty Islandora release   Thanks again for making Islandora happen. 

Library of Congress: The Signal: A Library of Congress Lab: More Use and More Users of Digital Collections

planet code4lib - Tue, 2017-03-14 14:15

Mass digitization — coupled with new media, technology and distribution networks — has transformed what’s possible for libraries and their users. The Library of Congress makes millions of items freely available on and other public sites like HathiTrust and DPLA. Incredible resources — like digitized historic newspapers from across the United States, the personal papers of Rosa Parks and Sigmund Freud and archived web sites of U. S. election candidates — can be accessed anytime and anywhere by researchers, Congress and the general public.

The National Digital Initiatives division of the Library of Congress seeks to facilitate even more use of the Library’s digital collections. Emerging disciplines — like data science, data journalism and digital humanities that take advantage of new computing tools and infrastructure — provide a model for creating new levels of access to library collections. Visualizing historical events and relationships on maps, with network diagrams and analysis of thousands of texts for the occurrence of words and phrases are a few examples of what’s possible. NDI is actively exploring how to support these and other kinds of interactions with the Library’s vast digital holdings.

A visualization of links between web sites extracted from an October 2015 Library of Congress crawl of news site feeds. This diagram was created as part of the demonstration pilot for the Library of Congress Lab report.

Michelle Gallinger and Daniel Chudnov were asked by NDI to study how libraries and other research centers have developed services that use computational analysis, design and engagement to enable new kinds of discovery and outreach. Their report, Library of Congress Lab (PDF), was just released. For the report, they interviewed researchers and managers of digital scholarship labs and worked with Library staff on a pilot project that demonstrated how the collections could be used in data analysis. This work resulted in concrete recommendations to the Library on how to approach setting up a Lab at the Library of Congress. These recommendations could also be helpful to other organizations who may be thinking of establishing their own centers for digital scholarship and engagement.

Michelle, Dan, thanks for the report, and thank you for talking with me more about it. How do you think digital labs are addressing a need or a gap in how digital collections are served by libraries and archives?

Michelle Gallinger

Michelle: The value proposition for digital collections has always been their usefulness to researchers, scholars, scientists, artists, as well as others. However, use was limited in the past because substantial computational analysis was something that an individual needed a great deal of specialized knowledge to pursue. That’s changing now. Tools have become more ubiquitous and labs have been established to support users in their analysis of digital collections. Where labs are supporting the work of users to delve deeply into the digital collections, we’re seeing computational analysis being used as another tool in areas of scholarship that haven’t benefited from it in the past. We are seeing that the support labs provide helps address the pent-up demand in a wide variety of fields to use digital content in meaningful ways.  And as this computational work is published, it’s creating new demand for additional support.

Dan: We were particularly impressed by the breadth of answers to this question shared by the colleagues we interviewed who lead and support digital scholarship services in Europe, Canada and the U.S. They have each molded their skills and services to fit these new and unique combinations of service demands coming from their own communities.  In university settings, labs fill a growing role supporting teaching and learning with workshops and consultations for younger students, graduate students, and early-career researchers alike.  In labs connected with large collections, they are enabling advanced researchers to perform large-scale computational techniques and finding ways — based on the services they are providing to scholars — to rethink and revise institutional workflows to enable more innovative uses of collections.  Each of these success stories represents a need- or a services-gap filled and presents an opportunity to consider doing more at our respective institutions.

Why do you think this is a good time for the Library of Congress to consider establishing a Lab?

Michelle: It’s a great time to be engaged in addressing the needs of scholars to work with digital collections. As I mentioned before, there really is a demand from users for support in performing digital scholarship. The Library of Congress receives regular requests for this support and it’s my opinion the number of those requests will continue to grow. Concepts of “big data” and data analytics have permeated society. Everyone knows about it, everyone wants to be working with digital scholarship techniques and tools. A Lab is an opportunity for the Library of Congress to start addressing these requests for support with routine workflows, regular access permissions, consistent legal counsel and predictable guidelines. This support not only helps further the transformative influence of digital scholarship, it also makes the Library of Congress more efficient and able to respond and serve the needs of its 21st century scholars.

Dan Chudnov

Dan: As Michelle highlights, better tools and increased demand to work with much greater volumes of materials have changed the equation.  The pilot project we performed, working with Library of Congress Web Archive collections not directly available to the public, demonstrated this well.  We used a third-party cloud services platform to securely transfer and process several terabytes of data from the Library to the cloud.  Using tools included in the cloud services platform for cluster computing, we defined access controls for this data where it was stored, then automated file format transformations, extracted focused derivative data, and ran parallel algorithms on a cluster with two dozen virtual machines performing network analysis on a quarter of a billion web links.  Once the extracted data was ready, it took less than five minutes to run a half-dozen of these queries over the entire dataset, and after just a few minutes more to verify the results, we shut the cluster down, having spent no more than a few dollars to rent that computing power for under an hour.  Back in the early 2000s, I worked in a medical informatics research center and helped to support cluster computing there with expensive, custom-designed racks full of fickle servers that gobbled up power and taxed our building cooling systems beyond reason.  Today, any ambitious high school student or not-yet-funded junior researcher can perform that same scale of computation and more, much more easily, all for the price of a cup of coffee.  To do this, they need the kinds of support Michelle describes: tool training, a solid legal framework with reasonable guidelines and routine workflows for enabling access, all of which the Library of Congress is ideally suited to develop and deliver right now.

How could a Lab help to serve audiences beyond the typical scholarly or academic user?

Michelle: I loved [the new Librarian of Congress] Dr. Hayden’s quote in the recent New Yorker article when she asked herself: “How can I make this library that relevant, and that immediate?” I think a Lab supporting digital scholarship will help her achieve that vision of increasing the relevance and immediacy of the Library of Congress. The Lab offers a new way for users to access and get support in analyzing the Library’s digital collections. But it is also an opportunity for the Library to reach out to underrepresented groups and engage with those groups in new ways — coding, analytics, scholarly networks, and more. Unique perspectives help the Lab in its efforts to transform how the Library’s digital collections are used. The Lab becomes a controlled access point for users that might not be able to get to the Library in person.

One of the reasons Dan and I think that the Lab should have an open-ended name (rather than something more specific like “Digital Scholars Lab” or “Digital Research Lab”) is that we both feel strongly that the Lab should be as inclusive as possible. A specific name encourages a small group of people who identify with that name to come. Researchers look at a research lab. Scholars look to a scholarly lab. But a really transformative Lab environment gives anyone the tools to use digital collections for their work — whether that’s scholarship, research, data analytics, art, history, social science, creative expression, or anything else they can imagine. We think that there is significant value to making the Lab a space where anyone can imagine working — even if they aren’t a typical Library of Congress researcher. Everyone should be able to see themselves at the lab, engaging with the Library of Congress digital collections in a myriad of ways.

Dan: I agree on all counts.  That focus from Dr. Hayden resonates with something we heard from a scholar at the Collections as Data event last fall, that the sheer size of Library of Congress collections can sometimes overwhelm. Anyone approaching LC collections for the first time should be able to find and work with material at a scale that meets their needs and abilities. It is most important to provide access to collections and services at a ‘human scale’, whether that means one item at a time, or millions of items at a time, or some scale in between which best fits the needs of the individual coming to the Library.  For example, UCLA’s Miriam Posner engages humanities students with collections at the scale of a few thousand items, which challenges them to use automated tools and techniques but is still small enough that they can “get to know” the materials over the course of a project.  Another critical aspect of this focus is representation.  To make the Library relevant and immediate, anyone visiting its collections should be able to see themselves and to recognize stories of people like them reflected and featured among digital collections, at every scale.  The breadth and variety of collections at the Library of Congress reflects our wonderfully diverse culture, and that means all of us and all of our histories.

What other opportunities do you see in establishing a Lab at the Library of Congress?  

Michelle: The Library of Congress is a powerful convener. It has always been able to get people to come together around a table and talk through controversial or challenging topics — from copyright restrictions to stewardship responsibilities and many others. The Lab community is still emerging. There are some extraordinarily strong players that have a lot to share and there are a lot of opportunities for labs that haven’t yet been developed. The Library of Congress could provide valuable leadership by convening the full spectrum of this community to make sure that emerging successes are circulated and pitfalls are documented. It could really help move things to another level.

Dan: I agree, the possibilities of building communities around opening up access to digital collections, connecting students with collections and subject expertise across institutions, and convening practitioners to share what works by building networks of potential collaborators across disciplines and distances are compelling.  We heard from many people that public goodwill toward the Library of Congress is strong, which affords that ability to draw people with mutual interests together.  When the Library puts an event together, people will travel great distances and tune in from all over the net, as the recent #asdata event demonstrated. Similarly, when Library staff show up and participate in community initiatives and events, people take notice and take their contributions to heart.  A Lab at the Library of Congress could be a great new conduit for this kind of leadership, amplifying the great service innovations of many great peer institutions while assembling a mix of services that fit the unique possibilities and constraints at LC.

Thank you both again for the time and effort you put into the report (PDF). NDI is excited to work toward establishing a Library of Congress Lab in the coming year, we’ll keep you all posted on our progress.


Subscribe to code4lib aggregator