You are here

Feed aggregator

Ed Summers: CDA

planet code4lib - Wed, 2017-03-08 05:00

This week we read three chapters drawn from from Rogers (2011) that center on the subject of Critical Discourse Analysis:

  • Critical Approaches to Discourse Analysis by Rebecca Rogers
  • Figured Worlds and Discourses of Masculinity: Being a Boy in a Literacy Classroom by Josephine Marsh and Jayne C. Lammers.
  • Learning as Social Interaction: Interdiscursivity in a Teacher and Researcher Study Group by Cynthia Lewis and Jean Ketter

The Rogers piece offers an overview of Critical Discourse Analysis (CDA) that is oriented to the field of education. And the second two papers are examples of applying or using CDA as a theory and method to answer a particular research problem.

Like other types of discourse analysis CDA studies the use of language but with specific attention to how language is used to construct social practices, and also their relation to power. Rogers points out that there is some distinction to be made between uppercase Critical Discourse Theory, which is in the tradition of Norman Fairclough’s work, and critical discourse theory, which is a bigger tent that includes a variety of theories associated with critical inquiry into language practices. Rogers also makes the point that really any analysis of language can fit under this umbrella:

Because language is a social practice and because not all social practices are created and treated equally, all analyses of language are inherently critical.

Systemic functional linguistics (Halliday, 1978) serves as a foundation for CDA. It is a socio-semiotic theory that emphasizes choice and meaning makers as agents who make decisions about the social functions of their language use. But the origins of CDA can be traced backwards to Bakhtin, DuBois, Spender, Foucault, and Wittgenstein. Other seminal books were Fowler, Hodge, Kress, & Trew (1979) and Kress:1982. It also draws on critical social theory which generally rejects the ideas that:

  • social practices represent reality
  • truth is the result of science and logic
  • truth is neutral (it doesn’t reflect particular interests)
  • individualism (this idea wasn’t fleshed out)

In CDA as in critical social theory, both theory and practice are mutually dependent and inform each other. It is important that analyses are connected to a theory of the social world and a theory of language.

Rogers interviewed three prominent individuals who have helped formulate and use critical discourse analysis. Gee, Fairclough and Kress. She emphasizes that there is a lot of crosstalk and cross-fertilization of ideas between them, and that they aren’t meant to be representative of categories or schools of CDA work.

Gee is particularly interested that language use is about relationships, identities and figured worlds. He has a framework that includes seven building tasks that are dimensions of language use: significance, activities, identities, relationships, politics, connections, sign-systems, and knowledge.

For Norman Fairclough language both reproduces and transforms social structures. His work is grounded on Marx, Foucault and Bahktin – as well as systemic functional linguistics and ethnography of communication. He uses tools such as orders of discourse, interdiscursivity and dialectics to analyze discourse.

And lastly Gunther Kress looks at other ways of making meaning than text. He is interested in how images, body language, color, movement and space/time are used along with texts to shape discourse.

Gee, Fairclough and Kress all got together as part of the New London Group and agreed that “design as a way of describing grammar as a social entity”. What this means wasn’t entirely clear to me on reading this chapter. But it seems like an interesting idea to follow up on. In CDA problems of text and context that have troubled discourse analysis are resolved by focusing instead on people and events.

Figured Worlds and Discourses of Masculinity

The Marsh and Lammers chapter adapts Gee’s guidelines for CDA to how an Chavos’ (a 18 year old Mexican-American) constructions of masculinity shaped his participation in literacy activities, and in turn what it meant to be a boy in the literacy classroom. Chavos was one of 21 individuals that participated in a “narrative inquiry” after which the authors focused on Chavos in order to a) complicate simplistic notions of male stereotypes and b) to show how CDA can help us understand the multiple Discourses that inform our beliefs about masculinity.

The analytic tools they drew on from Gee were:

  • discourses (looked at word usage in different contexts)
  • social languages (how a particular identity is enacted)
  • situated meanings
  • figured worlds (this was focus of the analysis) - what is normal and typical within a particular discourse

Marsh and Lammers used interview data, but also observational data such as when Marsh happened to hear Chavos say a class “sucked” to a friend. It seems like having access to spoken language outside the interview would be important for examining situated meanings.

They found snippets that seemed to speak to how Chavo felt when participating in the literacy classroom, and to constructions of masculinity. Then they organized transcripts into lines/stanzas as defined by Gee (2005). Once Chavo was identified as a subject they interviewed other people around him, including his mother and his honors humanities teacher.

In the interviews with her mother, the analyst really focuses on what the mother says about her son, specifically about how he was at different points in his life. The analyst paid close attention to the tone that was used when she spoke about Chavos’ literacy. This made me think that she must have been listening to the recorded segments again, and not relying too much on the transcripts once stanzas of interest were found.

She also focused on the mother’s use of “reported speech”, or when the speaker uses quotes of people (could be themselves) speaking at another time. These were important moments to focus on because they emphasized and supported her beliefs.

Descriptive nominalizations figured into the analysis of the teachers speech. Named groups like Literature Kid and Humanities Kid were important signposts to how the teacher constructed groups of students. Observational data figures pretty heavily into the analysis of Chavo–not just spoken language, but also affect and behavior in the classroom. This recalls Kress’ approach where more than just text is important.

The authors categorized I Statements into cognitive or action from Gee (2005). The cognitive statements allowed Chavo to express his opinions, knowledge and experiences of literacy. The actions provided insight into how Chavo made decisions about his identity.

I was struck by the authorial voice employed in this article. It’s important for her to reflect on her own identity and positionality relative to her interview subjects and the people she is observing. She is very up front about having children who were peers of Chavo for example. This felt a bit awkward for some reason, but also appropriate. I was left wondering why she chose to interview the mother and not the father, since masculinity was such a central idea to the research.

Learning as Social Interaction

Lewis and Ketter look at identities in practice and communities of practice from Wenger (1999) and specifically how the idea of interdiscursive demands relates to Fairclough (1992) ideas about interdiscursivity, which is the presence or trace of one discourse within another. Wenger argues that boundaries between discourses are interesting places to focus analysis because they are where new knowledge is produced and identities can be transformed.

The authors chose salient transcripts from their ethnographic study that were relevant to the research question that is about learning and social interaction.The focus was on segments where established Discourses were sustained or disrupted. The researchers also position their own experience as participants in identifying these discourses.

The transcripts were broken down into topical episodes, which were a series of turns that related to a topic or theme. These episodes were then examined to identify discourses where a discourses are “systematic clusters of themes, statements, ideas, and ideologies [that] come into play in the text”. Then they coded for categories of discourse.

A close analysis of genre and voice was performed on the episodes. Where Chouliaraki & Fairclough (1999) defines genre as “the language (and other semiosis) tied to a particular social activity” and voice as “the sort of language used for a particular category of people and closely linked to their identity”. Language that speaks to who they are and what they do. Attention was paid to where fixed discourses are most likely interrupted, which is when more dialogic conversations occur (Bakhtin, 1981). Also attention was paid to features such as:

  • intensifiers
  • repetition
  • pronoun usage
  • experiential value (Fairclough)
  • personal story

Self-reflection figured prominently in their own analysis, where they examine their own authority and teacherly moves, even when they were trying to insulate themselves from those moves.


These readings reminded me a lot of my independent study last semester where I researched practice theory. These CDA readings mentioned communities of practice a great deal, more so than other discourse analysis methods we’ve surveyed so far. In a lot of ways what I was trying to do in my recent study of appraisal in web archives was to get at what the emerging community of practice looks like when it comes to deciding how to archive content from the web. So it might be useful to do a crash course in communities of practice as it relates to CDA, to see if there is a useful method for me study my interview data.

Specifically I’d like to follow up on Gee’s Seven Building Tasks, which sounds like it could provide some useful pointers and structure. Also I feel like I need to get a better sense of the specific features that are examined like intensifiers, repitition, pronoun usage, story, etc. Perhaps these can be found in Gee (2011). It’s funny my friend Trevor was independently recommending the book to me recently since he used it in his own dissertation work.

The method used by Lewis and Ketter where they took themes from their interviews and then used their analysis of them to select to some interviews for closer analysis sounds like something I could map onto my own research. I already have done a thematic analysis of my 28 interviews, but it could be fruitful to take a deep dive into some of the salient interviews to see how CDA can help with the analysis. What are the Discourses of web archiving work that are at play, and how are they operationalized in the speech? Can they help me speak to how communities of practice are being formed?

I also like CDA (and EC) because it could provide a framework for looking at collection development policies and the interviews together. Specifically the idea of interdiscursivity that was mentioned could help see what relationship there might be between how archivists talk about collecting from the web in relation to any written policies their organization has about it. The little I was reading about Bahktin made me want to learn more – it seems like he could provide a bit of a stronger tie to the humanities if I wanted to position my work with web archives more in the digital humanities space.

Finally I was reminded of Proferes (2015) work (here at UMD currently) on technical discourse or communicative practices involving or about technology. His dissertation and previous work might provide some useful reference points for methods and theories from discourse analysis that are useful for studying the situated technology of web archives. I should follow up with Nick about that sometime if I can bend his ear.


Bakhtin, M. M. (1981). The dialogic imagination: Four essays. (M. Holquist, Ed.). University of Texas Press.

Chouliaraki, L., & Fairclough, N. (1999). Discourse in late modernity: Rethinking critical discourse analysis. Edinburgh University Press.

Fairclough, N. (1992). Discourse and social change. Oxford: Polity press.

Fowler, R., Hodge, R., Kress, G., & Trew, T. (1979). Language and control. Routledge & Kegan Paul London.

Gee, J. P. (2011). An introduction to discourse analysis: Theory and method. Routledge.

Halliday, M. A. K. (1978). Language as social semiotic. London Arnold.

Proferes, N. J. (2015). Informational power on Twitter: A mixed-methods exploration of user knowledge and technological discourse about information flows (PhD thesis). University of Wisconsin, Milwaukee. Retrieved from

Rogers, R. (2011). An introduction to critical discourse analysis in education. Routledge.

Wenger, E. (1999). Communities of practice: Learning, meaning, and identity. Cambridge University Press.

District Dispatch: Google: libraries get youth excited about CS

planet code4lib - Tue, 2017-03-07 22:00

As part of our participation in Teen Tech Week, today we are pleased to share a post about our collaboration with Google on Libraries Ready to Code. This post by Hai Hong, a Computer Science Education Program Manager at Google, introduces the partnership between ALA’s Office for Information Technology Policy (OITP) and Google to equip librarians with the skills, support, and resources they need to provide CS learning opportunities for youth through libraries. It also highlights the reasons behind the Ready to Code initiative and calls attention to the unique and integral role libraries have in ensuring youth have equitable access to CS learning opportunities. You can find the original post  on Google’s blog, The Keyword.  

Google’s Hai Hong and his son programming in Scratch at the Mountain View Public Library, California. Photo credit: Maria Hong

I grew up in a library. Well, sort of. My family arrived in the United States as refugees when I was a toddler and, living in a community without many resources or youth programs, my parents were unsure of what to do with me outside of school. This was especially true in the summers, so they would take me to our local library several times a week — it was free and air-conditioned. I was a regular at every literacy program and summer reading competition, and it was through these programs that I honed my reading skills and developed a love of learning that have guided me ever since.

Now, decades after those visits to our local branch, I’m excited to share that Google is partnering with the American Library Association (ALA) on Libraries Ready to Code, a new project to help librarians across the U.S. inspire youth to explore computer science (CS). This work builds on previous Google support for library programs, including Wi-Fi hotspot lending.

Libraries are, and have always been, at the heart of communities throughout the country. They play a unique role in education, inspiring youth (and adults!) to be lifelong learners. More than just a place to borrow books, libraries provide access to critical knowledge, workforce skills, and opportunities to become civically engaged. As the world changes, libraries have adapted with new services, media and tools. They promote digital inclusion—providing free access to digital content, hardware, software, and high-speed Internet.

Photo credit: Westchester Square Public Library (Bronx, NY)

And, increasingly, libraries are recognizing the importance of exposing youth to CS and computational thinking (CT) skills—arguably, the “new literacy” of the 21st century. “Libraries and library staff can create opportunities for youth to gain basic exposure and a basic interest in coding. From there, with support and mentorship from librarians and staff, they can develop long term engagement and possibly computer science as an envisioned future,” says Crystle Martin, Secretary of the Young Adult Library Services Association (YALSA).

While 40 percent of U.S. schools offer CS courses that include programming, access is not universal and demographic disparities exist. Libraries can help broaden that access: with more than 100,000 public and school libraries in the U.S., including in rural and lower-income areas, more than 306 million Americans live within a service area. But to expand access to CS, we need to provide librarians with the resources and understanding to curate and implement programs that suit their communities’ needs.

Through Libraries Ready to Code, Google and ALA will help equip librarians with skills to provide CS learning opportunities like Google’s CS First program, which New York Public Libraries are already using for NY coding clubs. The project will support university faculty at Library and Information Science (LIS) schools in redesigning their tech and media courses. They’ll integrate content on facilitating CS activities and teaching CT, and after the courses are evaluated, we’ll share these model courses with LIS schools nationally.

Ready to Code isn’t intended to transform librarians into expert programmers or computer scientists. Rather, we want to provide them with the knowledge and skills to do what they do best: empower youth to learn, create, problem solve, and develop the skills needed in tomorrow’s workforce—all while having fun, of course.

The next time you’re at your local library, find out if they are Ready to Code. If not, they can visit the Ready to Code website to learn about resources to get started and request more information. Meanwhile, I’ll be at my local branch with my five-year old — checking out books and learning to code together.

The post Google: libraries get youth excited about CS appeared first on District Dispatch.

Islandora: iCampEU Registration Now Open!

planet code4lib - Tue, 2017-03-07 17:05

Join Islandora for a three-day camp hosted on the campus of TU Delft in the Netherlands! Our first iCamp outside of North America since 2015, this event will comprise of two days of conference-style sessions and one day of hands-on training in a workshop with tracks for either developers or front-end administrators. You can sign up here and book a room at an iCamp discount from our list of recommended hotels and travel suggestions.

District Dispatch: ALA adds name to keep the net neutral

planet code4lib - Tue, 2017-03-07 14:19

Today, the American Library Association (ALA) joined a diverse group of consumer, media, technology, library, arts, content creators, civil liberties, and civil rights advocates urging federal lawmakers to oppose legislation and regulatory actions that would threaten net neutrality and roll back the important protections put in place by the Federal Communications Commission in 2015, and to continue to enforce the Open Internet Order as it stands. In a letter to Federal Communications Commission Chairman Ajit Pai, Senate Commerce Committee Chairman John Thune (R-SD) and Ranking Member Bill Nelson (D-FL) we joined other organizations to emphasize that continued economic, social, and political growth and innovation, is dependent on an open and accessible internet.

FCC Building in Washington, D.C.

Equitable access to information is a core principle for libraries. And the rules put in place in 2015–and affirmed by an appeals court– ensure the strongest possible protections for equitable access to online information, applications and services for all. The letter we co-signed today is in response to statements by Chairman Pai and news reports that indicate strong net neutrality protections may be in danger. ALA will continue to be vigilant on this issue and more action may be needed soon. We ask library advocates to stand by.

You can read the coalition letter or visit the Battle for the Net website for more information.

The post ALA adds name to keep the net neutral appeared first on District Dispatch.

State Library of Denmark: Visualising Netarchive Harvests

planet code4lib - Tue, 2017-03-07 12:52


An overview of website harvest data is important for both research and development operations in the netarchive team at Det Kgl. Bibliotek. In this post we present a recent frontend visualisation widget we have made.

From the SolrWayback Machine we can extract an array of dates of all harvests of a given URL. These dates are processed in the browser into a data object containing the years, months, weeks and days to enable us to visualise the data. Futhermore the data is given an activity level from 0-4.

The high-level overview seen below is the year-month graph. Each cell is coloured based on the activity level relative to the number of harvests in the most active month. For now we use a linear calculation so gray means no activity, activity level 1 is 0-25% of the most active month, and level 4 is 75-100% of the most active month. As GitHub fans, we have borrowed the activity level colouring from the user commit heat map.


We can visualise a more detailed view of the data as either a week-day view of the whole year, or as a view of all days since the first harvest. Clicking one of these days reveals all the harvests for the given day, with links back to SolrWayback to see a particular harvest.


In the graph above we see the days of all weeks of 2009 as vertical rows. The same visualisation can be made for all harvest data for the URL, as seen below (cut off before 2011, for this blog post).


There are both advantages and disadvantages to using the browser-based data calculation. One of the main advantages is a very portable frontend application. It can be used with any backend application that outputs an array of dates. The initial idea was to make the application usable for several different in-house projects. Drawbacks to this approach is, of course, the scalability. Currently the application processes 25.000 dates in about 3-5 seconds on the computer used to develop the application (a 2016 quad core Intel i5).

The application uses the frontend library VueJS and only one other dependency, the date-fns library. It is completely self-contained and it is included in a single script tag, including styles.

Ideas for further development.

We would like to expand this to also include both:

  1. multiple URLs, which would be nice for resources that have changed domain, subdomain or protocol over time, e.g. the URL, and could be used for the danish newspaper Politiken.
  2. domain visualisation for all URLs on a domain. A challenge here will of course be the resources needed to process the data in the browser. Perhaps a better calculation method must be used – or a kind of lazy loading.

District Dispatch: Now accepting Patterson copyright award nominations

planet code4lib - Mon, 2017-03-06 22:57

The nomination period is open for the L. Ray Patterson Award, an American Library Association-sponsored honor that recognizes particular individuals or groups who “embody the spirit of the U.S. Copyright law as voiced by the framers of our Constitution: to advance the knowledge of science and useful arts.” Nominations will be accepted through May 15, 2017.

Appropriate nominees for the Patterson Award are persons or groups who — over a stretch in time — have made significant and consistent contributions in the areas of academia, law, politics, public policy, libraries or library education to the pursuit of copyright principles as outlined below.

The award is named after L. Ray Patterson, a key legal figure who explained and justified the importance of the public domain and fair use. He helped articulate that copyright law was negatively shifting from its original purpose and overly favoring rights of copyright holders. His book, The Nature of Copyright: A Law of Users’ Rights, is the definitive book on the constitutional underpinnings of copyright and the critical importance of the public domain.
Please include illustrative examples of how your nominee has contributed to the pursuit of the fundamental tenets of copyright law. Nominees who have worked or collaborated with libraries will be given special consideration.

Send letters of nomination outlining a candidate’s qualifications for this award to:

Carrie Russell, Director, Program on Public Access to Information
ALA Office for Information Technology Policy
1615 New Hampshire Avenue NW, First Floor
Washington, D.C. 20009

Submissions can also be emailed to Carrie at For more information, visit

The post Now accepting Patterson copyright award nominations appeared first on District Dispatch.

Harvard Library Innovation Lab: LIL Talks: Ania

planet code4lib - Mon, 2017-03-06 22:27

We were fortunate to have Harvard scholar (and LIL friend) Ania Aizman talk to us about Anarchism. She clarified what it was, discussed some of its different branches and how they overlap with familiar groups/events like the Occupy movement.

We discussed “mic checks” and dug into the emergence of anarchism in Russian history. Her absorbing talk took us right to the end of our available time – thanks Ania!

District Dispatch: Teen Tech Week: computational literacy

planet code4lib - Mon, 2017-03-06 19:38

Today we kick of Teen Tech Week with a guest post by OITP Youth & Technology Fellow, Chris Harris. Chris is the director of the School Library System for the Genesee Valley Educational Partnership in rural western New York. Read further to build your understanding and appreciation of “computational thinking.” Follow #readytocode and #TTW17 for posts throughout the week.

What does it mean to be computer literate? When Seymour Papert first began teaching computer science to children with LOGO in the 1970s and 80s it was a very holistic concept. Daniel Watt, one of the co-developers of the LOGO language, defined computer literacy as being able to program as well as use a computer. Over time, the definition began to change. In 1990, educational computer researcher Margaret Neiss dismissed programming as being an outdated skill unnecessary for teachers to master. By 2000, a survey seeking to rank the importance of computer competencies for teachers in Kentucky did not even include programming as an option. As the researchers there put it, the goal of the ISTE standards (pdf) from 1998 was to “focus on student knowledge and student use of technology rather than what the teacher needs to know about technology and to be able to do with technology”.

But now, things are changing. Initiatives such as and the Hour of Code have realized significant penetration into K-12 classrooms and libraries since their launch in 2013 with a reported 497,950 teachers registered to teach the introductory courses. Additional domestic policy work has been undertaken by the Computer Science Teachers Association, formed in 2004 as an extension of the Association for Computing Machinery to provide support and advocacy for teachers of computer science. In late 2016, a new K-12 Computer Science Framework was released by a collaborative headed by the Association for Computing Machinery and the Computer Science Teachers Association. The framework provides a roadmap for possible adoption in states or local districts that includes a strong focus on computational thinking.

Chris Harris

But what is computational thinking? The concept of computational thinking extends back to Papert’s first use of the term to describe a way of thinking deeply about the abilities of a computer to work and solve problems. More recently, the term was adopted by Jeannette Wing in her seminal article (pdf) defining a different approach to the field of computer science that sought to identify “a universally applicable attitude and skill set” that everyone should learn . Modern definitions of computational thinking focus on four concepts: (a) decomposition, or breaking down a problem into parts; (b) pattern recognition, or the ability to interpret data; (c) abstraction, or an understanding of generalized principles; and (d) algorithm design, or the creation of explicit directions for work.

To understand computational thinking in libraries, the ALA Libraries Ready to Code program created a working definition of computational thinking. For libraries, computational thinking refers to a set of problem-solving and automation skills foundational to computer science though also transferable to many fields and applicable to college and career readiness. It is a way of analyzing and breaking down (deconstructing) problems into solvable units, using the power of computing to solve those problems, and creating personal meaning by processing information and creating connections to transform data into understanding.

Computational thinking is a lens for understanding and viewing the foundational aspects of computer science separated from the application of computer science in writing code. This was described by media theorist Douglas Rushkoff as invalidating the metaphor about learning programing as being like having to be a mechanic to drive a car. The real comparison, Rushkoff argued, was that a lack of computational thinking relegated a person to being a passenger in the car instead of the driver. For Rushkoff computational thinking “is the only way to truly know what is going on in a digital environment, and to make willful choices about the roles we play”.

So how might you infuse computational thinking into your library programs? Start by thinking about some of the core aspects to computational thinking: breaking down and analyzing problems, finding ways to solve the problems; and thinking about how to create an algorithmic or computational solution to the problem. Even in early childhood programs, you can engage children and parents with computational thinking by focusing on sequencing and conditional logic like If/Then statements. Play a little game: IF you are wearing green, THEN stand up. IF you like apples, THEN sit down. For older students, consider enriching and extending lessons by including problem identification and planning and writing pseudocode.

In the end, libraries must work to become ready to code to maintain leadership as a key institution helping children and youth prepare for a digital future. While I am not suggesting that every library should be teaching coding, or even that every student should learn coding, it is critical that we are all working towards a common understanding of basic computational literacy. We must all understand computer programming and code to thrive.

This post was developed from material in preparation for dissertation work by Christopher Harris, doctoral candidate at St. John Fisher College, Rochester, NY.


  1.  Watt, D. H. “Computer Literacy: What Should Schools Be Doing about It.” Classroom Computer News 1.2 (1980): 1–26.
  2. Niess, Margaret L. “Preparing Computer-Using Educators for the 90s.” Journal of Computing in Teacher Education 7.2 (1990): 11–14.
  3. Scheffler, Frederick L., and Joyce P. Logan. “Computer Technology in Schools: What Teachers Should Know and Be Able to Do.” Journal of Research on Computing in Education 31.3 (1999): 305.
  4. Ibid, p. 310.
  6. Wing, Jeannette M. “Computational Thinking.” Commun. ACM 49.3 (2006): 33–35. ACM Digital Library. p. 33.
  8. Youth & Technology Policy Forum Listening Session, ALA Midwinter, 2016
  9. Douglas Rushkoff. Program or be Programmed (2011). p. 8.

The post Teen Tech Week: computational literacy appeared first on District Dispatch.

Library of Congress: The Signal: Women’s History Month Wikipedia Edit-a-thon

planet code4lib - Mon, 2017-03-06 16:59

This is a guest post from Sarah Osborne Bender, Director of the Betty Boyd Dettre Library and Research Center at the National Museum of Women in the Arts.

I graduated from library school in 2001, just months after Wikipedia was launched. So as a freshly minted information professional, it is no surprise that I fell in with the popular skepticism of the time: How could you trust an “encyclopedia” written by your next-door neighbor or your local barista? What about entries on movie stars, which might be written by their publicists?

Art + Feminism Wikipedia Edit-a-thon from 2015 at the National Museum for Women in the Arts.

It was my participation four years ago in the Women in the Arts Meetup & Edit-a-thon at the Archives of American Art that turned me into a Wikipedian and a true believer in Wikipedia as an essential and democratic resource. The event was organized for Women’s History Month and efforts focused on “notable women artists and art-world figures.” After an orientation to the policies and editing practices of Wikipedia, I happily dove in to improving the brief the article for Edith Halpert. Halpert, a gallery owner and collector in mid-century New York, propelled the careers of some of America’s most important modernist painters. Using both online and print resources, I took a relatively skeletal entry and turned it in to a more complete overview of her life and legacy, learning quite a lot about Halpert along the way. With experienced Wikipedians at my side, I had guidance on keeping my tone neutral, adding new sections, and finding an appropriate image to upload to the info box. The editing process required great focus and was engrossing. I felt a rush when I clicked “Save changes” at the end, knowing that anyone who looked up Edith Halpert would read what I had just contributed.

On Saturday, March 11, the National Museum of Women in the Arts will hold its fourth annual Art+Feminism Wikipedia Edit-a-thon to train new editors and address the low representation of women artists on Wikipedia. According to a Wikipedia survey of its users, only 10% of editors are female. And it is common knowledge that women artists have never been equally represented in gallery shows, museum collections, and art history texts. Art+Feminism is a worldwide effort to address these issues. Last year, more than 2,500 participants in 28 countries at over 175 venues worked to improve articles about women in the arts. The movement keeps growing, and the motivation to represent women and reliable resources may never have been greater.

LITA: LITA Forum Assessment Report

planet code4lib - Mon, 2017-03-06 16:29

Re-envisioning LITA Forum

In 2016, the Library and Information Technology Association charged a Forum Assessment & Alternatives Task Force with assessing the impact of LITA Forum; evaluating its position within the library technology conference landscape; and recommending next steps. The goal of this work is to ensure that the time and resources spent on Forum are highly beneficial to both the membership and the division.


Over the course of 2016, the Forum Assessment Task Force explored the impact and perceptions of the LITA Forum among LITA members and library technologists in general. The Task Force combined existing data already gathered from surveys, as well as conducted its own survey, a matrix mapping exercise, and a series of focus groups, to understand how the LITA Forum is perceived, identify strengths and weaknesses, and make recommendations for the Forum’s continued success.

Summary of Findings

Generally, more than 40% of LITA members identify as an administrator, with nearly 40% of membership working in an academic library and 17% in public libraries. Of participants who typically attend Forum, approximately 50% of attendees identify as a librarian, 16% as an administrator, and 9% as IT, with 68% working in 4 year academic libraries and 13% as public librarians.

Administrators are more likely to be LITA members but not attend Forum, with most attendees being early-career librarians, attracting more 4-year academic librarians than does LITA membership as a whole. Attendees come to Forum to keep up with technology trends, network, and learn new things. Cost is a major impediment for attendance, as is location, and these two factors are frequently intertwined. When individuals determine how to spend travel/professional development funds, LITA Forum is frequently not high on their individual conference lists, and content of conference is also very important in deciding to attend a conference. Keynotes are not important to conference attendees, and there is interest in having conference sessions recorded for future use as well as having a virtual, streaming option for those unable to travel. Networking and informal interaction are strengths of Forum, and there is a belief that administrators grow out of Forum, so they need to better define the audience.


The key characteristics of the Forum that should be preserved include keeping the Forum’s size relatively small, focusing on networking and allowing time for conversation and collaboration and maintaining emphasis on trends for the future, through which attendees can learn about new things coming, rather than exclusively about case studies of what other libraries have been doing.

  • Define Forum’s identity and rename it to reflect identity
  • Major features of Forum need to have guidelines to planning and extensive documentation
  • Involve public librarians & reach out to them as central participants in the Forum
  • Examine outsourcing conference logistics and support
  • Conference length should be two full days, on a Friday and Saturday
  • Develop an attendee mentor program
  • Make some subcommittees permanently a part of the Forum planning committee
  • LITA staff work to improve relationships with vendors
  • Investigate having Forum rotate among a small number of cities
  • Investigate and implement more virtual keynotes and/or presentations
  • Develop conference topics early, have specific tracks
  • Make changes to presentations to vary the formats, ensure good speakers/presentations, and have them include more interaction
  • Have up to half of presentations be invited
  • Reduce the number of keynotes

Committee Membership:

The Task Force met virtually from Midwinter 2016 and ran through Midwinter of 2017 and included ten members (listed below), including representation from liaisons to the following LITA committees:

  • Membership
  • Assessment
  • Forum Planning 2016
  1. Jenny Taylor (Co-Chair), University of Illinois at Chicago
  2. Ken Varnum (Co-Chair), University of Michigan Library
  3. Laureen Cantwell, Colorado Mesa University
  4. Melody Condron, University of Houston Libraries
  5. Cinthya Ippoliti, Oklahoma State University
  6. Jenny Levine, LITA Executive Director
  7. Hong Ma, Loyola University of Chicago
  8. Christine Peterson, Amigos Library Services
  9. Whitni Watkins, Analog Devices
  10. Tiffany Williams, South Holland Public Library
  11. Mark A. Beatty, Staff Support

For the complete report, please go to:

*Plus/minus photo courtesy of Mitchell Eva, Noun Project

Terry Reese: MarcEdit Update (all versions); post-mortem analysis of the website (continued) downtime

planet code4lib - Mon, 2017-03-06 05:06
*** Update: as of 11:10 AM EST, 3/6/2017 — it appears all resources are back online and available.  Much thanks to Darryl, a helpful support agent that found whatever magic setting that was keeping the DNS from refreshing. ***

What a long weekend this has been.  If you are a MarcEdit user, you are probably aware that since late Thursday (around 11 pm EST), the MarcEdit website has shifted between being up, down, to currently (as of writing) completely disappeared.  I have Bluehost to thank for that; though more on that below.  More importantly, if you are a MarcEdit user, you may have found yourself having problems with the application.  On Windows, if you have plugins installed, you will find that the program starts, and then quits.  On a Mac, if you have automatic updates enabled or plugins installed, you see the same thing…the program starts, and then quits.  I’ve received a lot of email about this (+2k as of right now), and have answered a number of questions individually, via the listserv (on Friday) and via Twitter – but generally, this whole past couple of days has been a real pain in the ass.  So, let’s start with the most important thing that folks want to know – how do you get past the crashing.  Well, you have a couple of options.

Update MarcEdit

I know, I know – how can you update MarcEdit when the MarcEdit website isn’t resolving.  Well, the website is there – what isn’t resolving is the DNS (more on why below).  The MarcEdit website is a subdomain on my primary domain,  It points to a folder on that domain, and that folder is still there, and can be accessed directly via the direct path to the file (rather than though the dedicated dns entry).  So, where can you get them?  Directly, the links to the files are as follows:

You can see the change logs for each of these versions, if you access the links directly:

My hope, is that sometime between 12 am – 8 am 3/6/2017; the DNS entry will reset and will come back to life.  If it does, the problems folks are having with MarcEdit will resolve themselves, users will be prompted to download the referred updates above, and I can put this unpleasantness behind me.  However, I will admit that my confidence in my webhost is a little shaken, so I’m not confident that everything will be back to normal by morning – and if it isn’t; the links above should get users to the update that will correct the issues that we are seeing.

Disabling AutoChecking

If you are in a position where you cannot update MarcEdit using one of the links above, and the website has not come back up yet – you will need to take a couple of steps to keep the automatic checks from running and causing the unhandled exception.  I’ve noted this on the listserv, but I’ll document the process here:

  • On Windows:
    The process that is causing the crash is the plugin autochecking code.  This code will only run if you have installed a plugin.  To disable autochecking, you will want to remove your plugins.  The easiest way to do that is to go to your User Application folder, find the plugins directory, and do one of two things: 1) delete your plugins (assuming you’ll just put them back later) or 2) rename the plugin folder to plugin.bak and create an empty plugins folder.  The next time you restart MarcEdit – it should function normally.  You will be able to re-enable your plugins when MarcEdit’s website comes back up; but the only permanent solution will be to update your version of MarcEdit.
  • On MacOS:
    The process that is causing the crash is the plugin autochecking and the application autochecking.  You will need to disable the parts that apply to you.  As with the Windows instructions – this is a temporary measure – the permanent fix would be to update MarcEdit using the website (if is live), or directly using one of the links above.
    Disable autoupdate: You will need to open the config.xml file in the application user directory.  This is found on a MacOs system at: /Users/your_user_name/marcedit/configs – open the config.xml file.  Find the <settings></settings> XML block.  You are looking for the <autoupdate> element.  If it’s not present, autoupdating is automatically enabled.  Add the following snipped into the <settings> block.  <autoupdate>0</autoupdate>  This will disable automatic update checking.
    Disable plugin checking: You will find the plugins folder (assuming you’ve downloaded a plugin) in the application user space.  This is found at: /Users/your_user_name/marcedit/plugins – as with the Windows instructions, either delete the contents of the folder, or rename the folder as plugins.bak and create a new empty folder in its place.  Doing those two steps will disable the automatic checking, and allow you to use MarcEdit normally.
So what the hell happened?

Well, there are two answers to this question – and it was the convergence of these two issues that caused the current unpleasantness.

On the MarcEdit-side

I really wish that I could say that this was all on my webhost, but I can’t.  MarcEdit is a large, complicated application, and in order to keep it running well, the tool has to do a lot of housekeeping and data validation.  To ensure that all these tasks don’t cause the application to grind to a halt – I thread these processes.  Generally, with each function that falls into their own separate thread, I include a try{…}catch{…}finally{…} block to handle exceptions that might popup.  Additionally, I have an exception handler that handles errors that fall outside of these blocks – they protect the program from crashing.  So, what happened here?  The problem is the threading.  By running these functions in their own threads, they fall outside of MarcEdit’s global exception handler.  This is why it is important that each of these thread blocks have their own exception handling, and that it be very good.  And, for whatever reason, I didn’t provide explicit handling of connection issues in the checkplugins code.  This is why the website being down has impacted the program.

On the Website side

I have used Bluehost as my website provider for years.  They have been a trusted partner and while I’ve had the occasional blip, this is the first time that I can honest say that I’ve felt like they’ve completely dropped the ball.  The problem here, started with a planned server update.  Bluehost notified me that there would be a brief period of inconsistent access while they updated some server components.  This is routine stuff, and should have amounted to no more than a few minutes of downtime.  I host my website on their cloud infrastructure, which is distributed across multiple nodes – I figured no big deal, this is exactly what a cloud infrastructure was designed for – if they have a problem at one data center – you just enable access through another one.  Well, that’s not how it happened.  When I got up on Friday, there was an ominous, and cryptic message from Bluehost letting me that there was some trouble with the update, and they would be restoring access to service throughout the day.  Why did this affect all the nodes?  What actually happened?  Who knows.  All I know is that all Friday, all content was inaccessible and contacting Bluehost was impossible.  When things didn’t come back up by 7 pm EST, I stayed on hold for 4 hours before giving up and hoping things would look better in the morning.

Saturday morning…things didn’t look better.  Access to the administrative dashboard had been restored, but the database server was inaccessible and all subdomains were broken.  I finally got ahold of support around noon, after waiting on hold for another 3 hours – and they filed a ticket, and asked me to check back in an hour because they were going to reset the DNS.  By Saturday evening, the database server was alive again, my second hosted domain was at least visible, but the other subdomains still didn’t work.  I waited on hold again for another 2 hours, and got a hold of another person from support.  This time, we talked about what I was seeing in my DNS record and the Subdomain editing tools.  Things weren’t right.  They said they would file another ticket, I thanked them, and went to bed.

Sunday morning – things are moving in the right direction.  All content on the domain is accessible, the blog is accessible, and the secondary domain mostly works.  My other subdomains have all disappeared.  I contacted support again – they re-established the subdomains, manually edited the dns record…and this is where we stand. At this point, I believe that once the DNS propagates, the last of my subdomains (include will become live again.  At least that is my hope.  I’m really, really, really hoping that I don’t have to take this up with support again.

As I noted above, this is the first time that I can honestly say, I have been really unhappy with the way Bluehost has handled the outage.  These things happen (they shouldn’t), but I understand it.  I work in IT, I’ve been in the position that they are in now; but I find that the best course of action is to own it, and communicate with your customers proactively, and often.  What has disappointed me the most from Bluehost has been the blackhole of information.  Unless I contact support, they aren’t saying anything – and their front line support has no way to check the status of my tickets because they were escalated to their internal support units.  So, I’m basically just twittling my thumbs until someone finally decided to let me know what’s going on, or things just start working.

In Summary

So, to summarize – there is a fix, and you can access it at the links above or you can disable the offending code by following the instructions above.  When will the website be live, your guess is as good as mine at this point, but hopefully soon.  Long-term; I have a github account that I use to host code, and have setup a page at: to host information about those projects.  Given the pain this has caused, I will likely invest some time in setting up a mirror of the domain over there; so that if something comes up, I users can still get access to the resources that they need to work.

If you have questions, feel free to let me know.


Harvard Library Innovation Lab: LIL Talks: Adam

planet code4lib - Mon, 2017-03-06 03:39

Adam shared two different topics on February 24, 2017 — Mardi Gras and how to be deposed

Adam grew up in New Orleans and it was clear from his talk that the gravity of MG still pulled on him. 

Adam reviewed the history of the yearly celebration and highlighted the fascinating tradition of the social orgs that fuel the celebration – the krewes

The thing that stuck with me a week later as i reflect on Adam’s talk — Mardi Gras is different things to different people. For wild, party seeking, spring breakers, it’s one thing. And, for families that march as high school band members, and for community leaders (in and far away from the French Quarter) that network by shaking a ten thousand hands —  it’s another thing.

Hard Right Turn — It’s a two for one talk today!!

Adam also used his experience as a practicing litigator to instruct us on how to behave when being deposed. Fascinating!! Adam shared his guidelines — something like,  1/ tell the truth  2/ take your time when responding to the question  3/ only respond to the question by being focused in your response

We watched three entertaining and engrossing depositions — Joe Jamail, Lil Wayne (oh, I wish he had a library rhyme. please, please, please toss us a bone Lil wayne!!), Donald Trump — and enjoyed king cake and coffee!

Galen Charlton: On triviality

planet code4lib - Sun, 2017-03-05 15:59

Just now I read a blog post by a programmer whose premise was that it would be “almost trivial” to do something — and proceeded to roll my eyes.

However, it then occurred to me to interrogate my reaction a little. Why u so cranky, Galen?

On the one hand, the technical task in question, while certainly not trivial in the sense that it would take an inexperienced programmer just a couple minutes to come up with a solution, is in fact straightforward enough. Writing new software to do the task would require no complex math — or even any math beyond arithmetic. It could reasonably be done in a variety of commonly known languages, and there are several open source projects in the problem space that could be used to either build on or crib from. There are quite a few potential users of the new software, many of who could contribute code and testing, and the use cases are generally well understood.

On the other hand (and one of the reasons why I rolled my eyes), the relative ease of writing the software masks, if not the complexity of implementing it, the effort that would be required to do so. The problem domain would not be well served by a thrown-over-the-wall solution; it would take continual work to ensure that configurations would continue to work and that (more importantly) the software would be as invisible as possible to end users. Sure, the problem domain is in crying need of a competitor to the current bad-but-good-enough tool, but new software is only the beginning.

Why? Some things that are not trivial, even if the coding is:

  • Documentation, particularly on how to switch from BadButGoodEnough.
  • Community-building, with all the emotional labor entailed therein.

On the gripping hand: I nonetheless can’t completely dismiss appeals to triviality. Yes, calling something trivial can overlook the non-coding working required to make good software actually succeed. It can sometimes hide a lack of understanding of the problem domain; it can also set the coder against the user when the user points out complications that would interfere with ease of coding. The phrase “trivial problem” can also be a great way to ratchet up folks’ imposter syndrome.

But, perhaps, it can also encourage somebody to take up the work: if a problem is trivial, maybe I can tackle it. Maybe you can too. Maybe coming up with an alternative to BadButGoodEnoughProgram is within reach.

How can we better talk about such problems — to encourage folks to both acknowledge that often the code is only the beginning, while not loading folks down with so many caveats and considerations that only the more privileged among us feel empowered to make the attempt to tackle the problem?

Cynthia Ng: Yet Another Tips Post on Job Applications

planet code4lib - Sat, 2017-03-04 20:18
There is so much literature out there already on how to write job applications (namely cover letters and resumes) that I wasn’t sure I was going to write this post, but based on the job applications that I was looking over, I’m almost amazed at how many glaring errors people still make. Applying for jobs … Continue reading Yet Another Tips Post on Job Applications

Hydra Project: The Digital Repository of Ireland joins Hydra

planet code4lib - Sat, 2017-03-04 14:11

We are delighted to announce Hydra’s latest formal Partner, the Digital Repository of Ireland (DRI).  DRI is Ireland’s national trusted digital repository for Humanities, Social Sciences and Cultural data. The repository links together and preserves both historical and contemporary data held by Irish institutions, providing a central internet access point and interactive multimedia tools. As a national e-infrastructure for the future of education and research in the humanities and social sciences, DRI is available for use by the public, students and scholars.

DRI Director, Natalie Harrower, said: “While DRI has been involved and welcomed in the HydraSphere for many years now, we feel that the time is right to ‘step up’ and become Hydra Partners. We look forward to contributing more to the community and to helping Hydra grow in Europe and further afield.”

DRI became a partner at the end of 2016 but, for some reason, whilst we announced the fact on our mailing lists we failed to do so on this blog.  Our sincere apologies to the team at DRI for the omission.

Jonathan Rochkind: idea i won’t be doing anytime soon of the week: replace EZProxy with nginx

planet code4lib - Sat, 2017-03-04 03:03

I think it would be possible, maybe even trivial, to replace EZProxy with nginx, writing code in lua using OpenResty.  I don’t know enough nginx (or lua!) to be sure, but some brief googling suggests to me the tools are there, and it wouldn’t be too hard. (It would be too hard for me in C, I don’t know C at all. I don’t know lua either, but it’s a “higher level” language without some of the C pitfalls). nginx is supposed to be really good at proxying, right? That was it’s original main use case — although as ‘reverse proxy’, but I don’t know if there’s any reason it wouldn’t work well as a forward proxy too — handle the EZProxy style load of lots and lots of sessions to lots and lots of sites?  Maybe. nginx is a workhorse.

You could probably even make it API-compatible with EZProxy on both the client and config file ends.

The main motivation of this is not to get something for ‘free’, EZProxy’s price is relatively reasonable. But EZProxy is very frustrating in some ways and lacks the configurability, dynamic extension points, precision of logging and rate limiting, etc., that many often want. And EZProxy development seems basically… well, in indefinite maintenance mode.  So the point would be to get a living application again that evolves to meet our actual present sophisticated needs.

I def won’t be working on this any time soon, but it sure would be a neat project. My present employer is more of a museum/archive/cultural institution than with the use cases of an academic or public library, we don’t have much EZProxy need.

(As an aside, it is an enduring frustrating and disappoint to me that the library community doesn’t seem much interested in putting developer ‘innovation’ resources towards, well, serving the research, teaching, and learning needs of scholars and students, which I thought was actually the primary mission of these institutions. The vast majority of institutional support for innovative development is just in the archival/institutional repository domain. Which is also important, but generally not to the day-to-day work of (eg) academic library patrons…. for some reason, actually improving services to patrons isn’t a priority for most administrators/institutions, I’m not really sure why).

Filed under: General

David Rosenthal: The Amnesiac Civilization: Part 1

planet code4lib - Fri, 2017-03-03 21:00
Those who cannot remember the past are condemned to repeat it
George Santayana: Life of Reason, Reason in Common Sense (1905)Who controls the past controls the future. Who controls the present controls the past.
George Orwell: Nineteen Eighty-Four (1949)Santayana and Orwell correctly perceived that societies in which the past is obscure or malleable are very convenient for ruling elites and very unpleasant for the rest of us. It is at least arguable that the root cause of the recent inconveniences visited upon ruling elites in countries such as the US and the UK was inadequate history management. Too much of the population correctly remembered a time in which GDP, the stock market and bankers' salaries were lower, but their lives were less stressful and more enjoyable.

Two things have become evident over the past couple of decades:
  • The Web is the medium that records our civilization.
  • The Web is becoming increasingly difficult to collect and preserve in order that the future will remember its past correctly.
This is the first in a series of posts on this issue. I start by predicting that the problem is about to get much, much worse. Future posts will look at the technical and business aspects of current and future Web archiving. This post is shorter than usual to focus attention on what I believe is an important message

In a 2014 post entitled The Half-Empty Archive I wrote, almost as a throw-away:
The W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.The link was to a post by Cory Doctorow in which he wrote:
We are Huxleying ourselves into the full Orwell.He clearly understood some aspects of the problem caused by DRM on the Web:
Everyone in the browser world is convinced that not supporting Netflix will lead to total marginalization, and Netflix demands that computers be designed to keep secrets from, and disobey, their owners (so that you can’t save streams to disk in the clear).Two recent developments got me thinking about this more deeply, and I realized that neither I nor, I believe, Doctorow comprehended the scale of the looming disaster. It isn't just about video and the security of your browser, important as those are. Here it is in as small a nutshell as I can devise.

Almost all the Web content that encodes our history is supported by one or both of two business models: subscription, or advertising. Currently, neither model works well. Web DRM will be perceived as the answer to both. Subscription content, not just video but newspapers and academic journals, will be DRM-ed to force readers to subscribe. Advertisers will insist that the sites they support DRM their content to prevent readers running ad-blockers. DRM-ed content cannot be archived.

Imagine a world in which archives contain no subscription and no advertiser-supported content of any kind.

LITA: Meet the Book Pirates

planet code4lib - Fri, 2017-03-03 16:09

They’re not interested in stealing, nor are they based at sea. They’re not drinking lots of rum (that I know of), and there’s not a parrot in sight.

No, these pirates apply their swashbuckling spirit to promoting children’s literacy. The Book Pirates, or Buecherpiraten, as they’re called in their native German, is a charitable organization based out of Luebeck in northern Germany. They use the combined powers of digital publishing and self publishing to empower children and young people ages 3 to 19 to tell their own story, in their own mother tongue.

Anyone is welcome to create their own original picture book, with original artwork and story, and the Book Pirates will publish it on their website. Then, anyone can download the book for free by selecting their first language, second language, and preferred format, currently offering smartphone/tablet, ring-bound book, and regular book formats. Some translations even have audiobook options available.

The Book Pirates launched the 1001 Languages project in April 2016 to expand their language offerings from their six base languages, for which they have professional translators — Arabic, English, French, Mandarin, Russian, and Spanish — and their own language — German — to encompass as many languages as possible — Serbian, Finnish, Ukranian, Frisian, Farsi, Dari, Sisiwati, Danish, and Tigrinya, to name a few.

These translations are sourced from native speakers who volunteer to translate and proofread. Each translation is confirmed and proofread by a second native speaker to ensure accuracy.


While the Book Pirates and their partners host workshops to help the picture book making process, submissions are welcome from any group of children and young people. The project has found particular success among refugee populations. Young refugees enjoy the book making process, and the end result can be read by their parents in their mother tongue. Additionally, that same story can be read independently by the child in their new language, which assists in the language acquisition process.

The group’s home base is a children’s book center in Luebeck, where visitors can curl up with a good book in their courtyard, on a hammock, or amongst their colorful and craft-friendly rooms. The center spearheads and hosts numerous other children’s literacy initiatives. Book Pirates could make a great program for libraries with young writers, especially because the digital publishing aspect educates them on the technical aspects involved. To learn more,  to submit picture books from your children’s group, and to get involved as a translator, visit their website, which is available in all of their base languages.

David Rosenthal: Notes from FAST17

planet code4lib - Fri, 2017-03-03 16:00
As usual, I attended Usenix's File and Storage Technologies conference. Below the fold, my comments on the presentations I found interesting.

Kimberly Keeton's keynote, Memory-Driven Computing reported on HPE's experience in their "Machine" program of investigating the impact of Storage Class Memory (SCM) technology on system architecture. HPE has built, and open-sourced, a number of components of a future memory-based architecture. They have also run simulations on a 240-core 12TB DRAM machine, which show the potential for very large performance improvements in a broad range of applications processing large datasets. It is clearly important to research these future impacts before they hit.

The environment they have built so far does show the potential performance gains from a load/store interface to persistent storage. It is, however, worth noting that DRAM is faster than SCMs, so the gains are somewhat exaggerated. But Keeton also described the "challenges" that result from the fact that it doesn't provide a lot of services that we have come to depend upon from persistent storage, including access control, resilience to failures, encryption and naming. The challenges basically reduce to the fact that the potential performance comes from eliminating the file system and I/O stack. But replacing the necessary services that the file system and I/O stack provides will involve creating a service stack under the load/store interface, which will have hardware and performance costs. It will also be very hard, since the finer granularity of the interface obscures information that the stack needs.

Last year's Test of Time Award went to a NetApp paper from 2004. This year's award went to a NetApp paper from 2002. Not content with that, one of the two Best Paper awards went to Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL by Ram Kesavan et al from NetApp and UW Madison. This was an excellent paper describing how the fix they developed for a serious scaling problem that impacted emerging workloads for NetApp filers created two problems. First, it slightly degraded performance on legacy workloads, i.e. for the majority of their customers. Second, and much worse, in certain very rare cases it caused the filer to go catatonic for about 30 seconds. The story of how they came up with a second solution that fixed both is interesting.

A big theme was how to deal with devices, such as Shingled Magnetic Recording (SMR) hard drives and flash SSDs, that do expensive things like garbage collection autonomously. Murphy's Law suggests that they will do them at inconvenient times.

Now that SMR disks are widely available, the technology rated an entire session. Abutalib Aghayev and co-authors from CMU, Google and Northeastern presented Evolving Ext4 for Shingled Disks. They report:
For non-sequential workloads, [drive-managed SMR] disks show bimodal behavior: After a short period of high throughput they enter a continuous period of low throughput.

We introduce ext4-lazy, a small change to the Linux ext4 file system that significantly improves the throughput in both modes. We present benchmarks on four different drive-managed SMR disks from two vendors, showing that ext4-lazy achieves 1.7-5.4x improvement over ext4 on a metadata-light file server benchmark. On metadata-heavy benchmarks it achieves 2-13x improvement over ext4 on drive-managed SMR disks as well as on conventional disks.This work is impressive; a relatively small change caused a really significant performance improvement. We use Seagate's 8TB drive-managed SMR disks in our low-cost LOCKSS box prototype. We need to investigate whether our workload triggers this behavior and, if so, try their fix.

Among the many papers on flash, one entire session was devoted to Open-Channel SSDs, an alternate architecture for flash storage providing a lower-level interface that allows the Flash Translation Layer to be implemented on the host. Open-Channel SSD hardware is now available, and Linux is in the process of releasing a subsystem that uses it. LightNVM: The Linux Open-Channel SSD Subsystem by Matias Bjørling and co-authors from CNEX Labs, Inc. and IT University of Copenhagen described the subsystem and its multiple levels in detail. From their abstract:
We present our experience building LightNVM, the Linux Open-Channel SSD subsystem. We introduce a new Physical Page Address I/O interface that exposes SSD parallelism and storage media characteristics. LightNVM integrates into traditional storage stacks, while also enabling storage engines to take advantage of the new I/O interface. Our experimental results demonstrate that LightNVM has modest host overhead, that it can be tuned to limit read latency variability and that it can be customized to achieve predictable I/O latencies.Other papers in the session described how wear-leveling of Open-Channel SSDs can be handled in this subsystem, and how it can be used to optimize the media for key-value caches.

It is frequently claimed that modern file systems have engineered away the problem of fragmentation and the resulting performance degradation as the file system ages, at least in file systems that aren't nearly full. File Systems Fated for Senescence? Nonsense, Says Science! by Alex Conway et al from Rutgers, UNC, Stony Brook, MIT and Farmingdale State College demonstrated that these claims aren't true. They showed that repeated git pull invocations on a range of modern file systems caused throughput to decrease by factors of 2 to 30 despite the fact that they were never more than 6% full:
Traditional file systems employ heuristics, such as collocating related files and data blocks, to avoid aging, and many file system implementors treat aging as a solved problem. ... However, this paper describes realistic as well as synthetic workloads that can cause these heuristics to fail, inducing large performance declines due to aging. ... BetrFS, a file system based on write-optimized dictionaries, exhibits almost no aging in our experiments. ... We present a framework for understanding and predicting aging, and identify the key features of BetrFS that avoid aging. Bharath Kumar Reddy Vangoor and co-authors from IBM Almaden and Stony Brook looked at the performance characteristics of a component of almost every approach to emulation for digital preservation in To FUSE or Not to FUSE: Performance of User-Space File Systems. From their abstract:
Nowadays, user-space file systems are often used to prototype and evaluate new approaches to file system design. Low performance is considered the main disadvantage of user-space file systems but the extent of this problem has never been explored systematically. ... In this paper we analyze the design and implementation of the most widely known user-space file system framework—FUSE—and characterize its performance for a wide range of workloads. ... Our experiments indicate that depending on the workload and hardware used, performance degradation caused by FUSE can be completely imperceptible or as high as –83% even when optimized; and relative CPU utilization can increase by 31%. I haven't heard reports of FUSE causing performance problems in emulation systems, but this issue is something to keep an eye on as usage of these systems increases.

I really liked High Performance Metadata Integrity Protection in the WAFL Copy-on-Write File System by Harendra Kumar and co-authors from UW Madison and NetApp. If you're NetApp and have about 250K boxes in production at customers, all the common bugs have been found. The ones that haven't yet been found happen very rarely, but they can have a big impact not just on the customers, whose systems are mission-critical, but also within the company, because they are very hard to diagnose and replicate.

The paper described three defensive techniques that catch malign attempts to change metadata and panic the system before the metadata is corrupted. After this kind of panic there is no need for a lengthy fsck-like integrity check of the file system, so recovery is fast. Deployment of the two low-overhead defenses reduced the incidence of recoveries by a factor of three! Over the last five years 83 systems have been protected from 17 distinct bugs. This implies that the annual probability of one of these bugs occurring in a system is about 0.007%, which gives you some idea of how rare they are.

vNFS: Maximizing NFS Performance with Compounds and Vectorized I/O by Ming Chen and co-authors from Stony Brook University, IBM Research-Almaden and Ward Melville High School (!) described a client-side library that exported a vectorized I/O interface allowing multiple operations to multiple files to be aggregated. The NFS 4.1 protocol supports aggregated (compound) operations, but they aren't much used because the POSIX I/O interface doesn't support aggregation. The result is that operations are serialized, and each operation takes a network round-trip latency. Applications that were ported to used the aggregating library showed really large performance improvements, up to two orders of magnitude, running against file systems exported by unmodified NFS 4.1 servers.

The Work In Progress session featured some interesting talks:
  • On Fault Resilience of File System Checkers by Om Rameshwar Gatla and Mai Zheng of New Mexico State described fault injection in file system checkers. It was closely related to the paper by Ganesan et al on fault injection in distributed storage systems, so I discussed it in my post on that paper.
  • 6Stor: A Scalable and IPv6-centric Distributed Object Storage System by Ruty Guillaume et al discussed the idea of object storage using the huge address space of IPv6 to assign an IPv6 address to both each object's metadata and its data. At first sight, this idea seems nuts, but it sort of grows on you. Locating objects in distributed storage requires a process that is somewhat analogous to the network routing that is happening underneath.
  • Enhancing Lifetime and Performance of Non-Volatile Memories through Eliminating Duplicate Writesby Pengfei Zuo et al from Huazhong University of Science and Technology and Arizona State University made the interesting point that, just like hard disks, the fact that non-volatile memories like flash and SCMs retain data when power is removed means that they need encryption. This causes write amplification - modifying part of the plaintext causes the entire ciphertext to change.
  • A Simple Cache Prefetching Layer Based on Block Correlation by Juncheng Yang et al from Emory University showed how using temporal correlations among blocks to drive pre-fetching could improve cache hit rates.

OCLC Dev Network: DEVCONNECT Keynote Speaker Announced

planet code4lib - Fri, 2017-03-03 16:00

OCLC is pleased to announce Jennifer Vinopal as the DEVCONNECT opening keynote speaker. Jennifer’s address will focus on how IT projects advance organizational change.


Subscribe to code4lib aggregator