You are here

Feed aggregator

Hydra Project: Connect 2017 dates confirmed

planet code4lib - Sun, 2016-12-18 09:38

We are pleased to be able to confirm the dates for Hydra Connect 2017.

HC2017 will take place from Monday 11th to Thursday 14th September hosted by Northwestern University.  Please reserve the dates in your calendar.  Connect 2016 saw some 260 people from almost 90 institutions meet in Boston for a very successful event; help make Connect 2017 even bigger and better!

Mark E. Phillips: Removing leading or trailing white rows from images

planet code4lib - Sat, 2016-12-17 15:09

At the library we are working on a project to digitize television news scripts from KXAS, the NBC affiliate from Fort Worth.  These scripts were read on the air during the broadcast and are a great entry point into a vast collection of film and tape collection that is housed at the UNT Libraries.

To date we’ve digitized and made available over 13,000 of these scripts.

In looking at workflows we noticed that sometimes the scanners and scanning software would leave several rows of white pixels at the leading or trailing end of the image.

It is kind of hard to see that because this page has a white background so I’ll include a closeup for you.  I put a black border around the image to help the white stand out a bit.

Detail of leading white edge

One problem with these white rows is that they happen some of the time but not all of the time.  Another problem is that the number of white lines isn’t uniform, it will vary from image to image when it occurs. The final problem is that it is not consistently at the top or at the bottom of the image. It could be at the top, the bottom, or both.

Probably the best solution to this problem is going to be getting different control software for the scanners that we are using.  But that won’t help the tens of thousands of these image that we have already scanned and that we need to process.

Trimming white line Manual

There are a number of ways that we can approach this task.  First we can do what we are currently doing which is to have our imaging students open each image and manually crop them if needed.  This is very time consuming.


There is a tool in photoshop that can sometimes be useful for this kind of work.  It is the “Trim” tool.  Here is the dialog box you get when you select this tool.

Photoshop Trim Dialog Box

This allows you to select if you want to remove from the top of bottom (or left or right).  The tool wants you to select a place on the image to grab a color sample and then it will try and trim off rows of the image that match that color.

Unfortunately this wasn’t an ideal solution because you still had to know if you needed to crop from the top or bottom.


Imagemagick tools have an option called “trim” that does a very similar thing to the Photoshop Trim tool.  It is well described on this page.

By default the trim option here will remove edges around the whole image that match a pixel value.  You are able to adjust the specificity of the pixel color if you add a little blur but it isn’t an ideal solution either.

A little Python

My next thing to look at was to use a bit of Python to identify the number of rows in an image that are white.

With this script you feed it an image filename and it will return the number of rows from the top of the image that are at least 90% white.

The script will convert the incoming image into a grayscale image, and then line by line count the number of pixels that have a pixel value greater than 225 (so a little white all the way to white white).  It will then count a line as “white” if more than 90% of the pixels on that line have a value of greater than 225.

Once the script reaches a row that isn’t white, it ends and returns the number of white lines it has found.  If the first row of the image is not a white row it will immediately return with a value of 0.

The next thing to go back to Imagemagick but this time use the -chop flag to remove the number of rows from the image that the previous script specified.

mogrify -chop 0x15 UNTA_AR0787-010-1959-06-14-07_01.tif

We tell mogrify to chop off the first fifteen rows of the image with the 0x15 value.  This means an offset of zero and then remove fifteen rows of pixels.

Here is what the final image looks like without the leading white edge.

Corrected image

In order to count the rows from the bottom you have to adjust the script in one place.  Basically you reverse the order of the rows in the image so  you work from the bottom first.  This allows you to apply the same logic to finding white rows as we did before.

You have to adjust the Imagemagick command as well so that you are chopping the rows from the bottom of the image and not the top.  You do this by specifying -gravity in the command.

mogrify -gravity bottom -chop 0x15 UNTA_AR0787-010-1959-06-14-07_01.tif

With a little bit of bash scripting these scripts can be used to process a whole folder full of images and instructions can be given to only process images that have rows that need to be removed.

This combination of a small Python script to gather image information and then passing that info on to Imagemagick has been very useful for this project and there are a number of other ways that this same pattern can be used for processing images in a digital library workflow.

If you have questions or comments about this post,  please let me know via Twitter.

Library of Congress: The Signal: New FADGI Guidelines for Embedded Metadata in DPX Files

planet code4lib - Fri, 2016-12-16 17:04

The Federal Agencies Digitization Guidelines Initiative Audio-Visual Working Group is pleased to announce that its new draft publication, Embedding Metadata in Scanned Motion Picture Film Files: Guideline for Federal Agency Use of DPX Files, is available for public comment.

FADGI’s Embedded Metadata Guidelines for DPX Files

The Digital Picture Exchange format typically stores image-only data from scanned motion picture film or born-digital images created by a camera that produces a DPX output. Each DPX file represents a single image or frame in a sequence of a motion picture or video data stream. As a structured raster image format, DPX is intended to carry only picture or imagery data, with corresponding sound carried in a separate format, typically the WAVE format.  In practice, this means that a single digitized motion picture film will consist of a sequence of tens of thousands of individual DPX files, each file corresponding to a frame of scanned film with sequentially numbered file names, as well as a separate audio file for sound data.

Film reel. Photo courtesy of emma.buckley on Flickr. CC BY-ND 2.0.

This document is limited in scope to embedded-metadata guidelines and does not look to define other technical characteristics of what a DPX file might carry, including image tonal settings, aspect ratios, bit depths, color models and resolution.  Recommended capture settings are defined for a variety of source material in the companion FADGI document, Digitizing Motion Picture Film: Exploration of the Issues and Sample SOW.

The new guidelines define FADGI implementations for embedded metadata in DPX file headers, including core fields defined by the SMPTE ST268 family of standards as well as selected fields Strongly Recommended, Recommended or Optional for FADGI use. The non-core fields take advantage of existing header structures as well as define new metadata fields for the User Defined fields to document, among other things, digitization-process history.

For all metadata fields addressed in this guideline, FADGI has definitions, including those for fields that do not have a definition in SMPTE ST268. For all core fields, the FADGI use complies with the SMPTE use. In non-core fields, especially those with no definition from SMPTE, FADGI defines the use. One example is the Creator field, which FADGI proposes to be used for the name of the institution or entity responsible for the creation, maintenance and preservation of this digital item. This use aligns with other FADGI embedded-metadata guidelines for BWF and MXF AS-07 files.

Metadata has wings. Photo courtesy of Gideon Burton on Flickr. CC BY-SA 2.0

FADGI draws inspiration from the EBU R98-1999: Format for the <CodingHistory> Field in Broadcast Wave Format (PDF) document for defining a use for field 76, User Defined Data Header, to summarize data on the digitizing process, including signal-chain specifics and other elements. The Digitization Process History field employs a defined string variable for each parameter of the digitization process: the first line documents the source film reel, the second line contains data on the capture process and the third line of data records information on the storage of the file.

Initiated in spring 2016, this project is led by the FADGI Film Scanning subgroup with active participation from the Smithsonian’s National Museum of African American History and Culture (NMAAHC), the National Archives and Records Administration (NARA), NASA’s Johnson Space Center and the Library of Congress, including Digital Collections and Management Services, the Packard Campus for Audio-Visual Conservation and the American Folklife Center. The fast pace of this project marks a new work ethos for the always-collaborative FADGI, moving from initial concept to draft publication in under eight months. This iterative nature allows the working group to be more agile in its project management and to get stakeholder feedback early in the project.

At the 2016 Association of Moving Images annual conference in Pittsburgh, FADGI members Criss Kovac from NARA and Blake McDowell from NMAAHC presented a well-received poster (PDF) and handout (PDF) on the project.

FADGI welcomes comments on the draft guidelines through February 28, 2017.

Jonathan Rochkind: Getting full-text links from EDS API

planet code4lib - Thu, 2016-12-15 17:54

The way full-text links are revealed (or at least, um, “signalled”) in the EBSCO EDS API  is… umm… both various and kind of baroque.

I paste below a personal communication from an EBSCO developer containing an overview. I post this as a public service, because this information is very unclear and/or difficult to figure out from actual EBSCO documentation, and would also be very difficult to figure out from observation-based reverse-engineering, because there are so many cases.

Needless to say, it’s also pretty inconvenient to develop clients for, but so it goes.

There are a few places to look for full-text linking in our API.  Here’s an overview:

a.       PDF FULL TEXT: If the record has {Record}.FullText.Linkselement, and the Link elements inside have a Type element that equals “pdflink”, then that means there is a PDF available for this record on the EBSCOhost platform.  The link to the PDF document does not appear in the SEARCH response, but the presence of a pdflink-type Link should be enough to prompt the display of a PDF Icon.  To get to the PDF document, the moment the user clicks to request the full text link, first call the RETRIEVE method with the relevant identifiers (accession number and database ID from the Header of the record), and you will find a time-bombed link directly to the PDF document in the same place (FullText.Links[i].Url) in the detailed record.  You should not display this link on the screen because if a user walks away and comes back 15 minutes later, the link will no longer work.  Always request the RETRIEVE method when a user wants the PDF document.

b.       HTML FULL TEXT: If the record has {Record}.FullText.Text.Availability element, and that is equal to “1”, then that means the actual content of the article (the full text of it) will be returned in the RETIREVE method for that item.  You can display this content to the user any way you see fit.  There is embedded HTML in the text for images, internal links, etc.

c.       EBOOK FULL TEXT: If the record has {Record}.FullText.Linkselement, and the Link elements inside have a Type element that equals “ebook-epub” or “ebook-pdf”, then that means there is an eBook available for this record on the EBSCOhost platform.  The link to the ebook does not appear in the SEARCH response, but the presence of a ebook-type Link should be enough to prompt the display of an eBook Icon.  To get to the ebook document, the moment the user clicks to request the full text link, first call the RETRIEVE method with the relevant identifiers (accession number and database ID from the Header of the record), and you will find a time-bombed link directly to the ebook in the same place (FullText.Links[i].Url) in the detailed record.  You should not display this link on the screen because if a user walks away and comes back 15 minutes later, the link will no longer work.  Always request the RETRIEVE method when a user wants the ebook.

d.       856 FIELD FROM CATALOG RECORDS: I don’t think you are dealing with Catalog records, right?  If not, then ignore this part.  Look in {Record}.Items.  For each Item, check the Group element.  If it equals “URL”, then the Data element will contain a link we found in the 856 Field from their Catalog, along with the link label in the Label element.

e.       EBSCO SMARTLINKS+: These apply if the library subscribes to a journal via EBSCO Journal Service.  They are direct links to the publisher platform, similar to the custom link.  If the record has {Record}.FullText.Linkselement, and the Link elements inside have a Typeelement that equals “other”, then that means there is a direct-to-publisher-platform link available for this record.  The link to the PDF document does not appear in the SEARCH response, but the presence of a other-type Link should be enough to prompt the display of a Full Text Icon.  To get to the link, the moment the user clicks to request the full text link, first call the RETRIEVE method with the relevant identifiers (accession number and database ID from the Header of the record), and you will find a time-bombed link directly to the document in the same place (FullText.Links[i].Url) in the detailed record.  You should not display this link on the screen because if a user walks away and comes back 15 minutes later, the link will no longer work.  Always request the RETRIEVE method when a user wants the link.

f.        FULL TEXT CUSTOMLINKS: Look in the {Record}.FullText.CustomLinks element.  For each element there, you’ll find a URL in theUrl element, a label in the Text element, and an icon if provided in the Icon element.

g.       Finally, we have NON-FULLTEXT CUSTOMLINKS that point to services like ILLiad, the catalog, or other places that will not end up at full text.  You’ll find these at {Record}.CustomLinks.  For each element there, you’ll find a URL in the Url element, a label in the Text element, and an icon if provided in the Icon element.

One further variation (there are probably more I haven’t discovered yet), for case ‘d’ above, the ... object sometimes has a bare URL in it as described above, but other times has an escaped text which when unescaped becomes source for a weird  XML node which has the info you need in it. It’s not clear if this varies due to configuration, due to your ILS EDS is connecting to, or something else. In my case, I am inexplicably see it sometimes change from request to request.
Filed under: General

District Dispatch: Robert Bocher re-appointed to Universal Service Administrative Company Board of Directors

planet code4lib - Thu, 2016-12-15 17:42

Washington, D.C. – The American Library Association (ALA) applauded Robert (Bob) Bocher’s re-appointment (pdf) by the Federal Communications Commission to fill the library seat on the Board of Directors the Universal Service Administrative Company (USAC). USAC administers the $3.9 billion E-rate program in addition to the other programs that are part of Universal Service Fund (USF) on behalf of the FCC. Mr. Bocher was previously endorsed by ALA and appointed in 2015 by the FCC to finish the term of Anne Campbell who had held the seat for several terms and served on the Executive Committee. Mr. Bocher will begin his first full 3-year term in January 2017.

OITP Fellow and USAC Board Member Bob Bocher

“ALA is pleased Bob will continue to lend his expertise on behalf of libraries during his renewed tenure on the USAC Board,” Alan Inouye, director of ALA’s Office for Information Technology Policy (OITP), said. “Bob has a solid reputation across the library community for his keen knowledge and understanding of the E-rate program from both the applicant and policy perspectives. He is also extremely knowledgeable on related telecommunication and broadband issues. We are confident Bob will be a strong advocate for libraries as USAC continues to refine the implementation of major changes in the E-rate program as a result of the FCC’s major reforms of the program in 2014.”

Mr. Bocher brings a wealth of experience to the USAC Board and a long history of advocating for library technology and broadband capacity issues. He began his work with the E-rate program during its inception for the Wisconsin Department of Public Instruction as its state-wide E-rate support manager. During that time he also was an original member of ALA’s E-rate task force. Bocher continues to serve the library community as a Senior Fellow for OITP and remains in his position on the E-rate task force.

When asked about his immediate and longer term goals for his 3-year term, Mr. Bocher said, “In the immediate short-term, we need to focus our attention and resources on getting decisions made for all the 2016 funding requests. We are almost half-way through the 2016 program year and some libraries (and schools) are still waiting to know if their requests will be funded. Looking over the next several months, we have to focus on making certain the rollout of the Form 471 application window for 2017 will proceed in a smoother manner than what we experienced last year when libraries needed a separate filing window deadline. One of the key aspects of this will be making certain that the library (and school) information in the E-rate Productivity Center (EPC) is accurate and up-to-date.”

For more information about the 19-member USAC Board, including the Schools and Libraries Committee, see

The post Robert Bocher re-appointed to Universal Service Administrative Company Board of Directors appeared first on District Dispatch.

LITA: Jobs in Information Technology: December 15, 2016

planet code4lib - Thu, 2016-12-15 17:28

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

EBSCO Information Services, Library Services Engineer, Ipswich, MA

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Equinox Software: Burnham Joins Bibliomation with Equinox Support

planet code4lib - Thu, 2016-12-15 15:34


Duluth, Georgia–December 15, 2016

The partnership between Equinox and Bibliomation continues to grow as Burnham Library was successfully migrated to Evergreen in mid-November.  Burnham is a single branch library located in Bridgewater, Connecticut.  Burnham serves just over 1,100 patrons with over 31,000 holdings.  This migration follows Milford Library’s successful migration in September.

Rogan Hamby, Equinox Data and Project Analyst, worked on this project and remarked, “It was great partnering with Bibliomation on this project and wonderful to see the confidence and enthusiasm that the Burnham staff displayed during their migration.  No one undertakes a migration lightly but they made it seem effortless.  I always love seeing libraries like Burnham join the larger Evergreen community.”

Sandra Neary, Burnham Library Director, added, “I only wish we could have become a part of Bibliomation and Evergreen well before now. This has been such a positive experience in every aspect.  Questions answered in a flash, every detail attended to in such a timely manner.  Evergreen and Bibliomation are a match made in heaven! We are so happy to have made this move and look forward to a long and happy relationship.”

About Equinox Software, Inc.

Equinox was founded by the original developers and designers of the Evergreen ILS. We are wholly devoted to the support and development of open source software in libraries, focusing on Evergreen, Koha, and the FulfILLment ILL system. We wrote over 80% of the Evergreen code base and continue to contribute more new features, bug fixes, and documentation than any other organization. Our team is fanatical about providing exceptional technical support. Over 98% of our support ticket responses are graded as “Excellent” by our customers. At Equinox, we are proud to be librarians. In fact, half of us have our ML(I)S. We understand you because we *are* you. We are Equinox, and we’d like to be awesome for you. For more information on Equinox, please visit

About Bibliomation

Bibliomation is Connecticut’s largest library consortium. Sixty public libraries and seventeen schools share an Evergreen system with centralized cataloging and a shared computer network. Bibliomation is Connecticut’s only open source consortium.  Members enjoy benefits beyond the ILS. Bibliomation offers top-notch cataloging, expert IT support, discounts on technology and library products, and regular trainings and workshops on a variety of topics. Non-members can take advantage of Bibliomation’s services as well. BiblioTech, OverDrive, and a wide range of consulting services are available.  For more information on Bibliomation, please visit

About Evergreen

Evergreen is an award-winning ILS developed with the intent of providing an open source product able to meet the diverse needs of consortia and high transaction public libraries. However, it has proven to be equally successful in smaller installations including special and academic libraries. Today, over 1400 libraries across the US and Canada are using Evergreen including NC Cardinal, SC Lends, and B.C. Sitka.
For more information about Evergreen, including a list of all known Evergreen installations, see

HangingTogether: Gifts for Librarians and Archivists, 2016 edition

planet code4lib - Wed, 2016-12-14 22:19

We asked, you answered, and as usual we got a lot of great suggestions from our readers. Thank you! Crowdsourcing helps to distinguish ours from other, similar gift guides, and we are pleased to keep up our tradition.

We received a few whimsical gift ideas including a requests for a time machine which would help with (among other things) the ability to chip away at a backlog, to avoid or revisit poor acquisition or accessioning moments, and the opportunity to ask record creators questions that would help with description. Of course a owners manual would be the perfect companion stocking stuffer!

On the more practical side, why not prepare to start the new year with a Levenger Circa note-taking/organizing system? These come in different styles, are expandable, and can be customized to fit your needs. (Another reader notes that there is a generic version available through Office Depot.)
Or what about a novelty measuring tape? One reader suggests has a cat measuring tape and has gifted a giraffe measuring tape. I quite like this robot version.

For cold weather, cold stacks, or even workplaces with aging and unpredictable HVAC systems, what about a heated mouse? Be the envy of all your colleagues who are still making due with the fingerless gloves we recommended a few years ago.

Another reader suggests a book-alike cover for a tablet or similar — one example is the BookBook

Speaking of coffee, what about a double-walled insulated mug? For archivists who won’t admit to preservation colleagues that they indulge–very carefully–in drinking tea/coffee at their desks. Shh!

Library nostalgia is always in style — readers let us know about an library stamp t-shirt and a library card mug — both hearken back to the good old days before barcodes and automated circulation (does anyone really miss those times?). If you are looking for more goods in this area, both OutOfPrintClothing and UncommonGoods seem to have a lot of selection from socks and scarves to tote bags.

North Avenue Candles in Pittsburgh offers a series of “banned books” candles. Each candle is a tribute to a banned book and comes with a side label that explains why the book was banned. Available via Etsy.

And if you need to take a break from your heated mouse, why not curl up by the light of your banned book candle, sipping tea from your library nostalgia mug and read a good book? For example this 1989 classic about the thrill of finding original source materials that was recently translated into English:

1001 |a Farge, Arlette.
24010 |a Goût de l’archive. |l English
24514 |a The allure of the archives / |c Arlette Farge ; translated by Thomas Scott-Railton ; foreword by Natalie Zemon Davis.
264 1 |a New Haven : |b Yale University Press, |c [2013]

Want to read but not into archives?  Looking for something to stir your soul? “How can you go wrong with a laureate? Bob Dylan: The Lyrics 1961-2012 includes the full range of his Homeric talents.”

Finally, something we can all relate to — concerns about money and the budget. One reader had a suggestion for those concerned about threats to public library funding in the United States: make a donation to EveryLibrary and then present some homemade goodies or a bottle of something special with a note saying, “I made a donation in your name to save public libraries.” A direct donation to the library of your choice also works.

Thanks to Rob Jensen, Jill Tatem, Peter K, Rebecca Bryant, Betty Shankle, Anonymous, Roy Tennant, Steve Smith, Amy, Cynthia Van Ness, Kate Bowers, and Mike Furlough for their contributions!

About Merrilee Proffitt

Mail | Web | Twitter | Facebook | LinkedIn | More Posts (288)

Jonathan Rochkind: Maybe beware of microservices

planet code4lib - Wed, 2016-12-14 20:02

In a comment on my long, um,  diatribe last year about linked data, Eric Hellman suggested “I fear the library world has no way to create a shared technology roadmap that can steer it away from dead ends that at one time were the new shiny,” and I responded “I think there’s something to what you suggest at the end, the slow-moving speed of the library community with regard to technology may mean we’re stuck responding with what seemed to be exciting future trends…. 10+ years ago, regardless of how they’ve worked out since. Perhaps if that slow speed were taken into account, it would mean we should stick to well established mature technologies, not “new shiny” things which we lack the agility to respond to appropriately.”

I was reminded of this recently when running across a blog post about “Microservices”, which I also think were very hyped 5-10 years ago, but lately are approached with a lot more caution in the general software engineering industry, as a result of hard-earned lessons from practice.

Sean Kelly, in Microservices? Please, Don’t does write about some of the potential advantages of microservces, but as you’d expect from the title, mainfully focuses on pitfalls engineers have learned through working with microservice architectures. He warns:

When Should You Use Microservices?

“When you’re ready as an engineering organization.”

I’d like to close by going over when it could be the right time to pivot to this approach (or, if you’re starting out, how to know if this is the right way to start).

The single most important step on the path to a solid, workable approach to microservices is simply understanding the domain you’re working in. If you can’t understand it, or if you’re still trying to figure it out, microservices could do more harm than good. However, if you have a deep understanding, then you know where the boundaries are and what the dependencies are, so a microservices approach could be the right move.

Another important thing to have a handle on is your workflows – specifically, how they might relate to the idea of a distributed transaction. If you know the paths each category of request will make through your system and you understand where, how, and why each of those paths might fail, then you could start to build out a distributed model of handling your requests.

Alongside understanding your workflows is monitoring your workflows. Monitoring is a subject greater than just “Microservice VS Monolith,” but it should be something at the core of your engineering efforts. You may need a lot of data at your fingertips about various parts of your systems to understand why one of them is underperforming, or even throwing errors. If you have a solid approach for monitoring the various pieces of your system, you can begin to understand your systems behaviors as you increase its footprint horizontally.

Finally, when you can actually demonstrate value to your engineering organization and the business, then moving to microservices will help you grow, scale, and make money. Although it’s fun to build things and try new ideas out, at the end of the day the most important thing for many companies is their bottom line. If you have to delay putting out a new feature that will make the company revenue because a blog post told you monoliths were “doing it wrong,” you’re going to need to justify that to the business. Sometimes these tradeoffs are worth it. Sometimes they aren’t. Knowing how to pick your battles and spend time on the right technical debt will earn you a lot of credit in the long run.

Now, I think many library and library industry development teams actually are pretty okay at understanding the domain and workflows. With the important caveat that ours tend to end up so complex (needlessly or not), that they can be very difficult to understand, and often change — which is a pretty big caveat, for Kelly’s warning.

But monitoring?  In library/library industry projects?  Years (maybe literally a decade) behind the software industry at large.  Which I think is actually just a pointer to a general lack of engineering capabilities (whether skill or resource based) in libraries (especially) and the library industry (including vendors, to some extent).

Microservices are a complicated architecture. They are something to do not only when there’s a clear benefit you’re going to get from them, but when you have an engineering organization that has the engineering experience, skill, resources, and coordination to pull off sophisticated software engineering feats.

How many library engineering organizations do you think meet that?  How many library engineering organizations can even be called ‘engineering organizations’?

Beware, when people are telling you microservices are the new thing or “the answer”. In the industry at large, people and organizations have been burned by biting off more than they can chew in a microservice-based architecture, even starting with more sophisticated engineering organizations than most libraries or many library sector vendors have.

Filed under: General

Cynthia Ng: Notes from Front Line Technology Meetup

planet code4lib - Wed, 2016-12-14 19:55
Big thanks to West Vancouver Memorial Library for hosting their 3rd annual Front Line Technology Meetup. Heard about a lot of cool projects. West Vancouver Public Library International Games Day at Your Library (Kevin) website has lots of great resources board games, video games, etc. into library this year’s was Creative Mode in Minecraft volunteers … Continue reading Notes from Front Line Technology Meetup

Open Knowledge Foundation: Nobody wants to become an activist!

planet code4lib - Wed, 2016-12-14 13:43
Hacking the Entry Point to Digital Participation

During the recent Ultrahack 2016 tournament in Helsinki (one of the biggest hackathons in Europe for the development ideas and software), we formed a team called Two Minutes for My City that participated on the #hackthenation track. Our initial idea to improve the visualization of municipal decision processes evolved into a prototype of a mobile app that uses gamification and personalisation to encourage citizens to schedule time to follow decisions made by municipal offices. We got into the Ultrahack finals, with our prototype being one the three best projects in #hackthenation.

Using municipal open data

Helsinki has been a leader of open data in the Nordic countries and celebrated five years of successful Helsinki Region Infoshare, the open data portal of the city,in May 2016. For querying municipal decisions there is the dedicated OpenAhjo API, which has two interfaces built on it: he city’s own official portal which is still in beta and the “rivalling” portal developed by Open Knowledge Finland.

Although the official portal is more polished, it offers a view that is mostly useful for bureaucrats who already know the context of the decisions made and doesn’t provide any means of citizen participation. The independently developed portal is more robust and has implemented user registration and feedback with the precision of section of every document. However, neither of the portals offers a proper citizen-centric user experience for working with the decisions.

This is the setup we started with. Since two members of the team had already an experience of visualizing municpal decison processes from another hackathon in Tartu, the initial plan was to develop these visualization ideas further in Helsinki. However, after a night of heated discussions about principles of user experience and civic participation, we ended up with taking even one more step back toward the common citizen.

And this is the experience we want to share with you, pitching-at-a-hackathon style.

Politics is boring

Although municipal politics is the level of politics closest to citizens, people are not very interested in it. And they have a point, because local politics consists mostly of piles of rather uninteresting documents, meetings, protocols etc. So we don’t really wan’t to force people to change their mind, because they are right. Most of the time, municipal politics is really boring.

But this doesn’t mean that people don’t care about the actual decisions made at a local level. At least some of these decisions relate to things we have direct experience with, be it something from our neighborhood or related to our hobbies, occupation etc. Local issues have a high potential of being the subject of informed opinions compared to global or state level issues.

Although it might be not very glamourous to follow and discuss issues decided at the municipal offices, we still usually have opinions on these, if they are presented us in a meaningful form. And about some things we have strong opinions and we really want to be heard.

The problem is that we want to participate in decision making on our terms. And these are not the terms of officials of the city. And be aware, these are also not the same terms of those dedicated activists. But most of us would still be quite okay with dedicating one or two minutes for the city – let’s say once a week. So we built a prototype of a mobile app that lets you do exactly that.

Scheduling participation

Let’s say it’s Monday morning and you are waiting for a bus. Or you are in a traffic jam on Friday evening? Time to give your two minutes for the city? Based on your place of residence, the public services you use and the subject matters you are subscribed to we present you the most important issues brought up in the city last week.

With a simple and intuitive user interface you can skim through about three to dozen issues and give your immediate reaction to each of them. This might be equivalent of clicking a button like “yes, continue with that”, “this is important, take it seriously”, or giving your two words like “too expensive” or even something more nuanced like “take a break” or “damn hipsters”.

Maybe these are the right words and rest of the people agree with you. But for the most of time you probably just skim through the issues without making any extra efforts. Yet this is already valuable feedback to municipal offices about which things people have opinions on or care about: they will know about the “shit” before it hits the fan and they need to start finding excuses for bad decisions.

Making an impact

For most of us this is just staying informed in a very basic level of things we can relate to. We don’t want to get involved too much, because it’s a lot of hassle and we have our own lives too. But sometimes issues come up that make us willing to take action.

This is a rare moment where roles of a peaceful citizen and an activist start to overlap. Being an activist is not an easy job and it takes lots of skills to be a succesful community activist. This is the moment lots of us would feel hesitation. Should I do something? What should I do? Will it have any impact? Since I don’t have all the information, will officials consider me a troublemaker? What will my friends and colleagues think of me, if this comes public?

For that occasion we have listed the most bulletproof ways of taking the issue further. First of all you can start a discussion on the issue of your concern, which may end up as a petition or a public meeting with people with similar considerations, but there are also options to make an information request or poke the official responsible some other meaningful way. You may not want waste your time organizing a public protest in front of the city hall, when a proper way to solve the issue would be to contact an official responsible and propose your amendment to the issue discussed!

Encouraging activism

The mobile application is not meant to solve all problems of democratic participation at once. We don’t want to force people into becoming dedicated activists, if they don’t want to. But we want to make this option available at the exact moment when it’s needed and provide the relevant links and instruction manuals. But besides that people will also see how other citizens and notorious activists have used these instruments of taking things to the next level and when have they been successful.

This way the proposed model of participation on our own terms takes the role of providing civic education. Our simple mobile application is meant to be a gateway to more advanced participation portals, be it portals for freedom of information requests, discussing policy issues on digital forums, starting petitions or organizing public protests.

By hacking the citizens’ motivation to participate in municipal decision-making processes we aim to solve the specific problem of entry point to digital participation. This is the missing link between our daily lives and politics in the level closest to us.

Improve participation globally

We based our prototype on OpenAhjo API, which provides open data about the decision processes of the city of Helsinki. However, a new version of the decision API is being developed by 6aika, the cooperation body of six cities in the Helsinki region. This means that in the near future our solution could be easily adapted to automatically cover municipal decision making processes from all of these cities.

We are also promoting the standards developed in Helsinki region in Tallinn and Tartu, the two largest cities in Estonia, so we could scale the use of our application over the Finnish Gulf.

However, the problem of adequate entry point to digital participation tools is a global issue, which all of the e-democracy experiments from Buenos Aires to Tallinn struggle with. So if we hit the spot with our application, we might be able to boost participation experiments all over the world.

We intend to solve the specific problem with entry point to digital participation for good. We do it by doing four things:

  • We break the wall of bureaucracy for citizens
  • We hack the motivation using gamification
  • We personalize the citizens’ perspective to municipal decisions
  • We provide a simple yet meaningful overview of things going on in the city

Our gateway to participation needs to be backed by more advanced portals and dashboards provided by municipalities or civil society organizations. Open data is a prerequisite that enables us to do that in a proper and scalable manner.

We will make a huge impact by combining the most simple and intuitive entry point to municipal issues with the advanced features the activists are experimenting with.

We just need to build and deploy the application with the support of at least one city to start pushing the whole package all around the world.

This is our two cents for making democracy great again. Now it’s your turn!

Hydra Project: Tentative dates for Hydra Connect 2017

planet code4lib - Wed, 2016-12-14 09:25

Just a heads-up to the community that we have tentative dates for Hydra Connect 2017 – it is likely to take place Monday 11th to Thursday 14th September.  You might want to put these dates in your calendar…  Confirmation of dates and venue to follow as soon as we have them.

Karen Coyle: All the (good) books

planet code4lib - Wed, 2016-12-14 00:56
I have just re-read Ray Bradbury's Fahrenheit 451, and it was better than I had remembered. It holds up very well for a book first published in 1953. I was reading it as an example of book worship, as part of my investigation into examples of an irrational love of books. What became clear, however, is that this book does not describe an indiscriminate love, not at all.

I took note of the authors and the individual books that are actually mentioned in Fahrenheit 451. Here they are (hopefully a complete list):

Authors: Dante, Swift, Marcus Aurelius, Shakespeare, Plato, Milton, Sophocles, Thomas Hardy, Ortega y Gasset, Schweitzer, Einstein, Darwin, Gandhi, Guatama Buddha, Confucius, Thomas Love Peacock, Thomas Jefferson, Lincoln, Tom Paine, Machiavelli, Christ, Bertrand Russell.Books: Little Black Sambo, Uncle Tom's Cabin, the Bible, WaldenI suspect that by the criteria with which Bradbury chose his authors, he himself, merely an author of popular science fiction, would not have made his own list. Of the books, the first two were used to illustrate books that offended.
"Don't step on the toes of the dog lovers, the cat lovers, doctors, lawyers, merchants, chiefs, Mormons, Baptists, Unitarians, second-generation Chinese, Swedes, Italians, Germans, Texans, Brooklynites, Irishmen, people from Oregon or Mexico."
"Colored people don't like Little Black Sambo. Burn it. White people don't feel good about Uncle Tom's Cabin. Burn it. Someone's written a book on tobacco and cancers of the lungs? The cigarette people are weeping? Burn the book. Serenity, Montag."The other two were examples of books that were being preserved.

Bradbury was a bit of a social curmudgeon, and in terms of books decidedly a traditionalist. He decried the dumbing down of American culture, with digests of books (perhaps prompted by the Reader's Digest brand, which began in 1950), then "digest-digests, digest-digest-digests," then with books being reduced to one or two sentences, and television keeping people occupied but without any perceptible content. (Although he pre-invents a number of recognizable modern technologies, such as earbuds, he fails to anticipate the popular of writers like George R. R. Martin and other writers of brick-sized tomes.)

Fahrenheit 451 is not a worship of books, but of their role in preserving a certain culture. The "book-people" who each had memorized a book or a chapter hoped to see those become the canon once the new "dark ages" had ended. This was not a preservation of all books but of a small selection of books. That is, of course, exactly what happened in the original dark ages, although the potential corpus then was much smaller: only those texts that had been carefully copied and preserved, and in small numbers, were available for distribution once printing technology became available. Those manuscripts were converted to printed texts, and the light came back on in Europe, albeit with some dark corners un-illuminated where texts had been lost.

Another interesting author on the topic of preservation, but less well-known, is Louis-Sébastien Mercier, writing in 1772 in his utopian novel of the future, Memoirs of the Year Two Thousand Five Hundred.*  In his book he visits the King's Library in that year to find that there is only a small cabinet holding the entire book collection. He asks the librarians whether some great fire had destroyed the books, but they answered instead that it was a conscious selection.
"Nothing leads the mind farther astray than bad books; for the first notions being adopted without attention, the second become precipitate conclusion; and men thus go on from prejudice to prejudice, and from error to error. What remained for us to do, but to rebuild the structure of human knowledge?" (v. 2, p. 5)The selection criteria eliminated commentaries ("works of envy or ignorance") but kept original works of discovery or philosophy. These people also saw a virtue in abridging works to save the time of the reader. Not all works that we would consider "classics" were retained:
"In the second division, appropriated to the Latin authors, I found Virgil, Pliny, and Titus Livy entire; but they had burned Lucretius, except some poetic passages, because his physics they found false, and his morals dangerous." (v. 2, p.9)In this case, books are selectively burned because they are considered inferior, a waste of the reader's time or tending to lead one in a less than moral direction. Although Mercier doesn't say so, he is implying a problem of information overload.

In Bradbury's book the goal was to empty the minds of the population, make them passive, not thinking. Mercier's world was gathering all of the best of human knowledge, perhaps even re-writing it, as Paul Otlet proposed. (More on him in a moment.) Mercier's year 2500 world eliminated all the works of commentary on other works, treating them like unimportant rantings on today's social networks. Bradbury also did not mention secondary sources; he names no authors of history (although we don't know how he thought of Bertrand Russell, as philosopher or also a historian) or works of literary criticism.

Both Bradbury and Mercier would be considered well-read. But we are all like the blind men and the elephant. We all operate based on the information we have. Bradbury and Mercier each had very different minds because they had been informed by what they had read. For the mind it is "you are what you see and read." Mercier could not have named Thoreau and Bradbury did not mention any French philosophers. Had they each saved a segment of the written output of history their choices would have been very different with little overlap, although they both explicitly retain Shakespeare. Their goals, however, run in parallel, and in both cases the goal is to preserve those works that merit preserving so that they can be read now and in the future.  

In another approach to culling the mass of books and other papers, Kurt Vonnegut, in his absurdist manner, addressed the problem as one of information overload:
"In the year Ten Million, according to Koradubian, there would be a tremendous house-cleaning. All records relating to the period between the death of Christ and the year One Million A.D. would be hauled to dumps and burned. This would be done, said Koradubian, because museums and archives would be crowding the living right off the earth. The million-year period to which the burned junk related would be summed up in history books in one sentence, according to Koradubian: Following the death of Jesus Christ, there was a period of readjustment that lasted for approximately one million years." (Sirens of Titan, p. 46)While one hears often about a passion for books, some disciplines rely on other types of publications, such as journal articles and conference papers. The passion for books rarely includes these except occasionally by mistake, such as the bound journals that were scanned by Google in its wholesale digitization of library shelves, and the aficionados of non-books are generally limited to specific forms, such as comic books. In the late 19th and early 20th century, Belgian Paul Otlet, a fascinating obsessive whose lifetime and interests coincided with that our own homegrown bibliographic obsessive, Melvil Dewey, began work leading to his creation of what was intended to be a universal bibliography that included both books and journal articles, as well as other publications. Otlet's project was aimed at all knowledge, not just that contained in books, and his organization solicited books and journals from European and North American learned societies, especially those operating in scientific areas. As befits a project with the grandiose goal of cataloging all of the world's information, Otlet named it the Mundaneum. Otlet represents another selection criterion, because his Mundaneum appears to have been limited to academic materials and serious works; at the least, there is no mention of fiction or poetry in what I have read on the topic.

Among Otlet's goals was to pull out information buried in books and bring related bits of information together. He called the result of this a Biblion. This Biblion sounds somewhat related to the abridgments and re-gatherings of information that Mercier describes in his book. It also sounds like what motivated the early encyclopedists. To Otlet, the book format was a barrier, since his goal was not the preservation of the volumes themselves, but was to be a centralized knowledge base.

So now we have a range of book preservation goals, from all the books to all the good books, and then to the useful information in books. Within the latter two we see that each selection represents a fairly limited viewpoint that would result in a loss of a large number of the books and other materials that are held in research libraries today. For those of us in libraries and archives, the need is to optimize quality without being arbitrary, and at the same time to serve a broad intellectual and creative base. We won't be as perfect as Otlet or as strict the librarians in the year 2500, but hopefully our preservation practices will be more predictable than the individual choices made by Bradbury's "human books."

* In the original French, the title referred to the year 2440 ("L'An 2440, rêve s'il en fut jamais"). I have no idea why it was rounded up to 2500 in the English translation.
Works cited or used
Bradbury, Ray. Fahrenheit 451. New York: Ballantine, 1953

Mercier, Louis-Sébastien. Memoirs of the year two thousand five hundred, London, Printed for G. Robinson, 1772 (HathiTrust copy)

Vonnegut, Kurt. The Sirens of Titan. New York: Dial Press Trade Paperbacks, 2006

 Wright, Alex. Cataloging the World: Paul Otlet and the Birth of the Information Age. New York, NY : Oxford University Press, 2014.

DuraSpace News: Fedora, Museums, and the Web

planet code4lib - Wed, 2016-12-14 00:00

Austin, TX  The annual Museums and the Web Conference (MW2017) will be held April 19-22, 2017 in Cleveland, Ohio featuring advanced research and exemplary applications of digital practice of interest to museums, galleries, libraries, science centers, and archives. Formal and informal networking events are scheduled throughout the event to bring attendees–webmasters, social media managers, educators, curators, librarians, designers, directors, scholars, consultants, programmers, analysts, publishers and developers–together.

David Rosenthal: The Medium-Term Prospects for Long-Term Storage Systems

planet code4lib - Tue, 2016-12-13 16:00
Back in May I posted The Future of Storage, a brief talk written for a DARPA workshop of the same name. The participants were experts in one or another area of storage technology, so the talk left out a lot of background that a more general audience would have needed. Below the fold, I try to cover the same ground but with this background included, which makes for a long post.

This is an enhanced version of a journal article that has been accepted for publication in Library Hi Tech, with images that didn't meet the journal's criteria, and additional material reflecting developments since submission. Storage technology evolution can't be slowed down to the pace of peer review.
What is Long-Term Storage?Storage Hierarchy
Public DomainThe storage of a computer system is usually described as a hierarchy. Newly created or recently accessed data resides at the top of the hierarchy in relatively small amounts of very fast, very expensive media. As it ages, it migrates down the hierarchy to larger, slower and cheaper media.

Long-term storage implements the base layers of the hierarchy, often called "bulk" or "capacity" storage. Most discussions of storage technology focus on the higher, faster layers, which these days are the territory of all-flash arrays holding transactional databases, search indexes, breaking news pages and so on. The data in those systems is always just a cache. Long-term storage is where old blog posts, cat videos and most research datasets spend their lives.
What Temperature is Your Data?If everything is working as planned, data in the top layers of the hierarchy will be accessed much more frequently, be "hotter", than data further down. At scale, this effect can be extremely strong.

Muralidhar et al Figure 3Subramanian Muralidhar and a team from Facebook, USC and Princeton have an OSDI paper, f4: Facebook's Warm BLOB Storage System, describing the warm layer between Facebook's Haystack hot storage layer and their cold storage layers. Section 3 describes the behavior of BLOBs (Binary Large OBjects) of different types in Facebook's storage system. Each type of BLOB contains a single type of immutable binary content, such as photos, videos, documents, etc. The rates for different types of BLOB drop differently, but all 9 types have dropped by 2 orders of magnitude within 8 months, and all but 1 (profile photos) have dropped by an order of magnitude within the first week, as shown in their Figure 3.

The Facebook data make two really strong arguments for hierarchical storage architectures at scale:
  • That significant kinds of data should be moved from expensive, high-performance hot storage to cheaper warm and then cold storage as rapidly as feasible.
  • That the I/O rate that warm storage should be designed to sustain is so different from that of hot storage, at least 2 and often many more orders of magnitude, that attempting to re-use hot storage technology for warm and even worse for cold storage is futile.
The argument that the long-term bulk storage layers will need their own technology is encouraging, because (see below) there isn't going to be enough of the flash media that are taking over the performance layers to hold everything.

But there is a caveat. Typical at-scale systems such as Facebook's do show infrequent access to old data. This used to be true in libraries and archives. But the advent of data mining and other "big data" applications means that increasingly scholars want not to access a few specific items, but instead to ask statistical questions of an entire collection. The implications of this change in access patterns for long-term storage architectures are discussed below.
How Long is the "Medium Term"?Iain Emsley's talk at PASIG2016 on planning the storage requirements of the 1PB/day Square Kilometer Array mentioned that the data was expected to be used for 50 years. How hard a problem is planning with this long a horizon? Looking back 50 years can provide a clue.

Public DomainIn 1966 disk technology was about 10 years old; the IBM 350 RAMAC was introduced in 1956. The state of the art was the IBM 2314. Each removable disk pack stored 29MB on 11 platters with a 310KB/s data transfer rate. Roughly equivalent to 60MB/rack. Every day, the SKA would have needed to add nearly 17M racks, covering about 10 square kilometers.

R. M. Fano's 1967 paper The Computer Utility and the Community reports that for MIT's IBM 7094-based CTSS:
the cost of storing in the disk file the equivalent of one page of single-spaced typing is approximately 11 cents per month. It would have been hard to believe a projection that in 2016 it would be more than 7 orders of magnitude cheaper.

By Erik Pitti CC BY 2.0.The state of the art in tape storage was the IBM 2401, the first nine-track tape drive, storing 45MB per tape with a 320KB/s maximum transfer rate. Roughly equivalent to 45MB/rack of accessible data.

A 1966 data management plan would have been correct in predicting that 50 years later the dominant media would be "disk" and "tape", and that disk's lower latency would carry a higher cost per byte. But its hard to believe that any more detailed predictions about the technology would be correct. The extraordinary 30-year history of 30-40% annual cost per byte decrease of disk media, their Kryder rate, had yet to start.

Although disk and tape are 60-year old technologies, a 50-year time horizon may seem too long to be useful. But a 10-year time horizon is definitely too short to be useful. Storage is not just a technology, but also a multi-billion dollar manufacturing industry dominated by a few huge businesses, with long, hard-to-predict lead times.

Seagate 2008 roadmap Disk technology shows how hard it is to predict lead times. Here is a Seagate roadmap slide from 2008 predicting that the then (and still) current technology, perpendicular magnetic recording (PMR), would be replaced in 2009 by heat-assisted magnetic recording (HAMR), which would in turn be replaced in 2013 by bit-patterned media (BPM).

In 2016, the trade press is reporting that:
Seagate plans to begin shipping HAMR HDDs next year.ASTC 2016 roadmap Here is a recent roadmap from ASTC showing HAMR starting in 2017 and BPM in 2021. So in 8 years HAMR has gone from next year to next year, and BPM has gone from 5 years out to 5 years out. The reason for this real-time schedule slip is that as technologies get closer and closer to the physical limits, the difficulty and above all cost of getting from lab demonstration to shipping in volume increases exponentially.

A recent TrendFocus report suggests that the industry is preparing to slip the new technologies even further:
The report suggests we could see 14TB PMR drives in 2017 and 18TB SMR drives as early as 2018, with 20TB SMR drives arriving by 2020.Here, the medium term is loosely defined as the next couple of decades, or 2-3 times the uncertainty in industry projections.
What Is The Basic Problem of Long-Term Storage?The fundamental problem is not storing bits safely for the long term, it is paying to store bits safely for the long term. With an unlimited budget an unlimited amount of data could be stored arbitrarily reliably indefinitely. But in the real world of limited budgets there is an inevitable tradeoff between storing more data, and storing the data more reliably.

Historically, this tradeoff has not been pressing, because the rate at which the cost per byte of storage dropped (the Kryder rate) was so large that if you could afford to keep some data for a few years, you could afford to keep it "forever". The incremental cost would be negligible. Alas, this is no longer true.

Cost vs. Kryder rateHere is a graph from a model of the economics of long-term storage I built back in 2012 using data from Backblaze and the San Diego Supercomputer Center. It plots the net present value of all the expenditures incurred in storing a fixed-size dataset for 100 years against the Kryder rate. As you can see, at the 30-40%/yr rates that prevailed until 2010, the cost is low and doesn't depend much on the precise Kryder rate. Below 20%, the cost rises rapidly and depends strongly on the precise Kryder rate.

2014 cost/byte projectionAs it turned out, we were already well below 20%. Here is a 2014 graph from Preeti Gupta of UC Santa Cruz plotting $/GB against time. The red lines are projections at the industry roadmap's 20% and a less optimistic 10%. It shows three things:
  • The slowing started in 2010, before the floods hit Thailand.
  • Disk storage costs in 2014, two and a half years after the floods, were more than 7 times higher than they would have been had Kryder's Law continued at its usual pace from 2010, as shown by the green line.
  • If the industry projections pan out, as shown by the red lines, by 2020 disk costs per byte will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
The total cost of delivering on a commitment to store a fixed-size dataset for the long term depends strongly on the Kryder rate, especially in the first decade or two. Industry projections of the rate have a history of optimism, and are vulnerable to natural disasters, industry consolidation, and so on. We aren't going to know the cost, and the probability is that it is going to be a lot more expensive than we expect.
How Much Long-Term Storage Do We Need?Lay people reading the press about storage, a typical example is Lauro Rizatti's recent article in EE Times entitled Digital Data Storage is Undergoing Mind-Boggling Growth, believe two things:
  • per byte, storage media are getting cheaper very rapidly (Kryder's Law), and
  • the demand for storage greatly exceeds the supply.
These two things cannot both be true. If the demand for storage greatly exceeded the supply, the price would rise until supply and demand were in balance.

In 2011 we actually conducted an experiment to show that this is what happens. We nearly halved the supply of disk drives by flooding large parts of Thailand including the parts where disks were manufactured. This flooding didn't change the demand for disks, because these parts of Thailand were not large consumers of disks. What happened? As Preeti Gupta's graph shows, the price of disks immediately nearly doubled, choking off demand to match the available supply, and then fell slowly as supply recovered.

So we have two statements. The first is "per byte, storage media are getting cheaper very rapidly". We can argue about exactly how rapidly, but there are decades of factual data recording the drop in cost per byte of disk and other storage media. So it is reasonable to believe the first statement. Anyone who has been buying computers for a few years can testify to it.

SourceThe second is "the demand for storage greatly exceeds the supply". The first statement is true, so this has to be false. Why do people believe it? The evidence for the excess of demand over supply in Rizatti's article is a graph with blue bars labeled "demand" overwhelming orange bars. The orange bars are labeled "output", which appears to represent the total number of bytes of storage media manufactured each year. This number should be fairly accurate, but it overstates the amount of newly created information stored each year for many reasons:
  • Newly manufactured media does not instantly get filled. There are delays in the distribution pipeline - for example I have nearly half a terabyte of unwritten DVD-R media sitting on a shelf. This is likely to be a fairly small percentage.
  • Some media that gets filled turns out to be faulty and gets returned under warranty. This is likely to be a fairly small percentage.
  • Some of the newly manufactured media replaces obsolete media, so isn't available to store newly created information.
  • Because of overhead from file systems and so on, newly created information occupies more bytes of storage than its raw size. This is typically a small percentage.
  • If newly created information does actually get written to a storage medium, several copies of it normally get written. This is likely to be a factor of about two.
  • Some newly created information exists in vast numbers of copies. For example, my iPhone 6 claims to have 64GB of storage. That corresponds to the amount of newly manufactured storage medium (flash) it consumes. But about 8.5GB of that is consumed by a copy of iOS, the same information that consumes 8.5GB in every iPhone 6. Between October 2014 and October 2015 Apple sold 222M iPhones, So those 8.5GB of information are replicated 222M times, consuming about 1.9EB of the storage manufactured in that year.
The mismatch between the blue and orange bars is much greater than it appears.

What do the blue bars represent? They are labeled "demand" but, as we have seen, the demand for storage depends on the price. There's no price specified for these bars. The caption of the graph says "Source: Recode", which I believe refers to a 2014 article by Rocky Pimentel entitled Stuffed: Why Data Storage Is Hot Again. (Really!). Based on the IDC/EMC Digital Universe report, Pimentel writes:
The total amount of digital data generated in 2013 will come to 3.5 zettabytes (a zettabyte is 1 with 21 zeros after it, and is equivalent to about the storage of one trillion USB keys). The 3.5 zettabytes generated this year will triple the amount of data created in 2010. By 2020, the world will generate 40 zettabytes of data annually, or more than 5,200 gigabytes of data for every person on the planet.The operative words are "data generated". Not "data stored permanently", nor "bytes of storage consumed". The numbers projected by IDC for "data generated" have always greatly exceeded the numbers actually reported for storage media manufactured in a given year, which in turn as discussed above exaggerate the capacity added to the world's storage infrastructure.

The assumption behind "demand exceeds supply" is that every byte of "data generated" in the IDC report is a byte of demand for permanent storage capacity. Even in a world where storage was free there would still be much data generated that was never intended to be stored for any length of time, and would thus not represent demand for storage media.

WD resultsIn the real world data costs money to store, and much of that money ends up with the storage media companies. This provides another way of looking at the idea that Digital Data Storage is Undergoing Mind-Boggling Growth. What does it mean for an industry to have Mind-Boggling Growth? It means that the companies in the industry have rapidly increasing revenues and, normally, rapidly increasing profits.

Seagate resultsThe graphs show the results for the two companies that manufacture the bulk of the storage bytes each year. Revenues are flat or decreasing, profits are decreasing for both companies. These do not look like companies faced by insatiable demand for their products; they look like mature companies facing increasing difficulty in scaling their technology.

For a long time, discussions of storage have been bedevilled by the confusion between IDC's projections for "data generated" and the actual demand for storage media. The actual demand will be much lower, and will depend on the price.
Does Long-Term Storage Need Long-Lived Media?Every few months there is another press release announcing that some new, quasi-immortal medium such as 5D quartz or stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but a study of the market for them revealed that no-one would pay the relatively small additional cost. The drives currently marketed for "archival" use have a shorter warranty and a shorter MTBF than enterprise drives, so they're not expected to have long service lives.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center racks or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

The reason why disks are engineered to have a 5-year service life is that, at 30-40% Kryder rates, they were going to be replaced within 5 years simply for economic reasons. But, if Kryder rates are going to be much lower going forward, the incentives to replace drives early will be much less, so a somewhat longer service life would make economic sense for the customer. From the disk vendor's point of view, a longer service life means they would sell fewer drives. Not a reason to make them.

Additional reasons for skepticism include:
  • Our research into the economics of long-term preservation demonstrates the enormous barrier to adoption that accounting techniques pose for media that have high purchase but low running costs, such as these long-lived media.
  • Since the big problem in digital preservation is not keeping bits safe for the long term, it is paying for keeping bits safe for the long term, an expensive solution to a sub-problem can actually make the overall problem worse, not better.
  • These long-lived media are always off-line media. In most cases, the only way to justify keeping bits for the long haul is to provide access to them (see Blue Ribbon Task Force). The access latency scholars (and general Web users) will tolerate rules out off-line media for at least one copy.
  • Thus at best these media can be off-line backups. But the long access latency for off-line backups has led the backup industry to switch to on-line backup with de-duplication and compression. So even in the backup space long-lived media will be a niche product.
  • Off-line media need a reader. Good luck finding a reader for a niche medium a few decades after it faded from the market - one of the points Jeff Rothenberg got right two decades ago.
Since at least one copy needs to be on-line, and since copyability is an inherent property of being on-line, migrating data to a new medium is not a big element of the total cost of data ownership. Reducing migration cost by extending media service life thus doesn't make a big difference.
Does Long-Term Storage Need Ultra-Reliable Media?The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. But current media are many orders of magnitude too unreliable for the task ahead, so you can't:
  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly.
Thus replication, and error detection and recovery, are required features of a long-term storage system regardless of the medium it uses. Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of Backblaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ... Moral of the story: design for failure and buy the cheapest components you can. Eric Brewer made the same point in his 2016 FAST keynote. For availability and resilience against disasters Google needs geographic diversity, so they have replicas from which to recover. Spending more to increase media reliability makes no sense, the media are already reliable enough. The systems that surround the drives have been engineered to deliver adequate reliability despite the current unreliability of the drives, thus engineering away the value of more reliable drives.
How Much Replication Do We Need?Facebook's hot storage layer, Haystack, uses RAID-6 and replicates data across three data centers, using 3.6 times as much storage as the raw data. The next layer down, Facebook's f4, uses two fault-tolerance techniques:
  • Within a data center it uses erasure coding with 10 data blocks and 4 parity blocks. Careful layout of the blocks ensures that the data is resilient to drive, host and rack failures at an effective replication factor of 1.4.
  • Between data centers it uses XOR coding. Each block is paired with a different block in another data center, and the XOR of the two blocks stored in a third. If any one of the three data centers fails, both paired blocks can be restored from the other two.
The result is fault-tolerance to drive, host, rack and data center failures at an effective replication factor of 2.1, reducing overall storage demand from Haystack's factor of 3.6 by nearly 42% for the vast bulk of Facebook's data. Erasure-coding everything except the hot storage layer seems economically essential.

Another point worth noting that the f4 paper makes relates to heterogeneity as a way of avoiding correlated failures:
We recently learned about the importance of heterogeneity in the underlying hardware for f4 when a crop of disks started failing at a higher rate than normal. In addition, one of our regions experienced higher than average temperatures that exacerbated the failure rate of the bad disks. This combination of bad disks and high temperatures resulted in an increase from the normal ~1% AFR to an AFR over 60% for a period of weeks. Fortunately, the high-failure-rate disks were constrained to a single cell and there was no data loss because the buddy and XOR blocks were in other cells with lower temperatures that were unaffected.Current Technology ChoicesFontana 2016 analysisRobert Fontana of IBM has an excellent overview of the roadmaps for tape, disk, optical and NAND flash (PDF) through the early 2020s. These are the only media technologies currently shipping in volume. Given the long lead times for new storage technologies, no other technology will significantly impact the bulk storage market before then.
TapeHistorically, tape was the medium of choice for long-term storage. Its basic recording technology lags hard disk by many years, so it has a much more credible technology road-map than disk. The reason is that the bits on the tape are much larger. Current hard disks are roughly 1000Gbit/in2, tape is projected to be roughly 50Gbit/in2 in 6 years time.

But tape's importance is fading rapidly. There are several reasons:
  • Tape is a very small market in unit terms:Just under 20 million LTO cartridges were sent to customers last year. As a comparison let's note that WD and Seagate combined shipped more than 350 million disk drives in 2015; the tape cartridge market is less than 0.00567 per cent of the disk drive market in unit terms
  • In effect there is now a single media supplier per technology, raising fears of price gouging and supply vulnerability. The disk market has consolidated too, but there are still two very viable suppliers plus another. Hard disk market share is:
    split between the three remaining HDD companies with Western Digital’s market share at 42%, Seagate’s at 37% and Toshiba at 21%.
  • The advent of data-mining and web-based access to archives make the long access latency of tape less tolerable.
  • The robots that, at scale, access the tape cartridges have a limited number of slots. To maximize the value of each slot it is necessary to migrate data to new, higher-capacity cartridges as soon as they appear. This has two effects. First, it makes the long service life of tape media less important. Second, it consumes a substantial fraction of the available bandwidth.
As an off-line medium, tape's cost and performance is determined by the ratio between the number of media (slots), which sets the total capacity of the system at a given cartridge technology, and the number of drives, which sets the access bandwidth. It can appear very inexpensive at a high media/drive ratio, but the potential bandwidth of the drives is likely to be mostly consumed with migrating old cartridges to new, higher-capacity ones. This is an illustration of the capacity vs. bandwidth tradeoffs explored in Steven Hetzler and Tom Couglin's Touch Rate:  A metric for analyzing storage system performance.
OpticalLike tape, optical media (DVD and Blu-ray) are off-line media whose cost and performance are determined by the media/drive ratio in their robots. They have long media life and some other attractive properties that mitigate some threats; immunity from electromagnetic pulse effects, and most are physically write-once.

Recently, Facebook and Panasonic have provided an impressive example of the appropriate and cost-effective use of optical media. The initial response to Facebook's announcement of their prototype Blu-ray cold storage system focused on the 50-year life of the disks, but it turns out that this isn't the interesting part of the story. Facebook's problem is that they have a huge flow of data that is accessed rarely but needs to be kept for the long-term at the lowest possible cost. They need to add bottom tiers to their storage hierarchy to do this.

The first tier they added to the bottom of the hierarchy stored the data on mostly powered-down hard drives. Some time ago a technology called MAID (Massive Array of Idle Drives) was introduced but didn't make it in the market. The idea was that by putting a large cache in front of the disk array, most of the drives could be spun-down to reduce the average power draw. MAID did reduce the average power draw, at the cost of some delay from cache misses, but in practice the proportion of drives that were spun-down wasn't as great as expected so the average power reduction wasn't as much as hoped. And the worst case was about the same as a RAID, because the cache could be thrashed in a way that caused almost all the drives to be powered up.

Facebook's design is different. It is aimed at limiting the worst-case power draw. It exploits the fact that this storage is at the bottom of the storage hierarchy and can tolerate significant access latency. Disks are assigned to groups in equal numbers. One group of disks is spun up at a time in rotation, so the worst-case access latency is the time needed to cycle through all the disk groups. But the worst-case power draw is only that for a single group of disks and enough compute to handle a single group.

Why is this important? Because of the synergistic effects knowing the maximum power draw enables. The power supplies can be much smaller, and because the access time is not critical, need not be duplicated. Because Facebook builds entire data centers for cold storage, the data center needs much less power and cooling, and doesn't need backup generators. It can be more like cheap warehouse space than expensive data center space. Aggregating these synergistic cost savings at data center scale leads to really significant savings.

Nevertheless, this design has high performance where it matters to Facebook, in write bandwidth. While a group of disks is spun up, any reads queued up for that group are performed. But almost all the I/O operations to this design are writes. Writes are erasure-coded, and the shards all written to different disks in the same group. In this way, while a group is spun up, all disks in the group are writing simultaneously providing huge write bandwidth. When the group is spun down, the disks in the next group take over, and the high write bandwidth is only briefly interrupted.

Next, below this layer of disk cold storage Facebook implemented the Blu-ray cold storage that drew such attention. It has 12 Blu-ray drives for an entire rack of cartridges holding 10,000 100TB Blu-ray disks managed by a robot. When the robot loads a group of 12 fresh Blu-ray disks into the drives, the appropriate amount of data to fill them is read from the currently active hard disk group and written to them. This scheduling of the writes allows for effective use of the limited write capacity of the Blu-ray drives. If the data are ever read, a specific group has to be loaded into the drives, interrupting the flow of writes, but this is a rare occurrence. Once all 10,000 disks in a rack have been written, the disks will be loaded for reads infrequently. Most of the time the entire Petabyte rack will sit there idle.

It is this careful, organized scheduling of the system's activities at data center scale that enables the synergistic cost reductions of cheap power and space. It may be true that the Blu-ray disks have a 50-year lifetime but this isn't what matters. No-one expects the racks to sit in the data center for 50 years, at some point before then they will be obsoleted by some unknown new, much denser and more power-efficient cold storage medium (perhaps DNA).
    DiskExabytes shippedIt is still the case, as it has been for decades, that the vast majority of bytes of storage shipped each year are hard disk. Until recently, these disks went into many different markets, desktops, laptops, servers, digital video recorders, and storage systems from homes to data centers. Over the last few years flash is increasingly displacing hard disk from most of these markets, with the sole exception of bulk storage in data centers ("the cloud").

    The shrinking size of the magnetic domains that store data on the platters of hard disks, and the fairly high temperatures found inside operating drives, mean that the technology is approaching the superparamagnetic limit at which the bits become unstable. HAMR is the response, using materials which are more resistant to thermal instability but which therefore require heating before they can be written. The heat is supplied by a laser focused just ahead of the write magnetic head. As we saw above, the difficulty and cost of this technology transition has been consistently under-estimated. The successor to HAMR, BPM is likely to face even worse difficulties. Disk areal density will continue to improve, but much more slowly than in its pre-2010 heyday. Vendors are attempting to make up for the slow progress in areal density in two ways:
    • Shingling, which means moving the tracks so close together that writing a track partially overwrites the adjacent track. Very sophisticated signal processing allows the partially overwritten data to be read. Shingled drives come in two forms. WD's drives expose the shingling to the host, requiring the host software to be changed to treat them like append-only media. Seagate's drives are device-managed, with on-board software obscuring the effect of shingling, at the cost of greater variance in performance.
    • Helium, which replaced air inside the drive allowing the heads to fly lower and thus allow more platters to fit in the same form factor. WD's recently announced 12TB drives have 8 platters. Adding platters adds cost, so does little to increase the Kryder rate.
    WD unit shipmentsDisk manufacturing is a high-volume, low-margin business. The industry was already consolidating before the floods in Thailand wiped out 40% of the world's disk manufacturing capacity. Since this disruption, the industry has consolidated to two-and-a-half suppliers, Western Digital has a bit over 2/5 of the market, and Seagate has bit under 2/5. Toshiba is a distant third with 1/5, raising doubts about their ability to remain competitive in this volume market.

    Seagate unit shipmentsThe effect of flash displacing hard disk from many of its traditional markets can be seen in the unit volumes for the two major manufacturers. Both have seen decreasing total unit shipments for more than 2 years. Economic stress on the industry has increased. Seagate plans a 35% reduction in capacity and 14% layoffs. WDC has announced layoffs. The most recent two quarters have seen a slight recovery:
    Hard disk drives shipments have had several quarters of declining shipments since the most recent high in 2014 and the peak of 653.6 million units in 2010 (before the 2011 Thailand floods). Last quarter and likely this quarter will see significant HDD shipment increases, partly making up for declining shipments in the first half of 2016.The spike in demand is for high-end capacity disks and is causing supply chain difficulties for component manufacturers:
    Trendfocus thinks [manufacturers] therefore won't invest heavily to meet spurts of demand. Instead, the firm thinks, suppliers will do their best to juggle disk-makers' demands.

    “This may result in more tight supply situations like this in the future, but ultimately, it is far better off to deal with tight supply conditions than to deal with over-supply and idle capacity issues” says analyst John KimReducing unit volumes reduces the economies of scale underlying the low cost of disk, slowing disk's Kryder rate, and making disk less competitive with flash. Reduced margins from this pricing pressure reduces the cash available for investment in improving the technology, further reducing the Kryder rate. This looks like the start of a slow death spiral for disk.
    FlashFlash as a data storage technology is almost 30 years old. Eli Harari filed the key enabling patent in 1988, describing multi-level cells, wear-leveling and the Flash Translation Layer. Flash has yet to make a significant impact on the lower levels of the storage hierarchy. If flash is to displace disk from these lower levels, massive increases in flash shipments will be needed. There are a number of ways flash manufacturers could increase capacity.

    They could build more flash fabs, but this is extremely expensive. If there aren't going to be a lot of new flash fabs, what else could the manufacturers do to increase shipments from the fabs they have?

    The traditional way of delivering more chip product from the same fab has been to shrink the chip technology. Unfortunately, shrinking the technology from which flash is made has bad effects. The smaller the cells, the less reliable the storage and the fewer times it can be written, as shown by the vertical axis in this table:
    Write endurance vs. cell sizeBoth in logic and in flash, the difficulty in shrinking the technology further has led to 3D, stacking layers on top of each other. 64-layer flash is in production, allowing manufacturers to go back to larger cells with better write endurance.

    Flash has another way to increase capacity. It can store more bits in each cell, as shown in the horizontal axis of the table. The behavior of flash cells is analog, the bits are the result of signal-processing in the flash controller. By improving the analog behavior by tweaking the chip-making process, and improving the signal processing in the flash controller, it has been possible to move from 1 (SLC) to 2 (MLC) to 3 (TLC) bits per cell. Because 3D has allowed increased cell size (moving up the table), TLC SSDs are now suitable for enterprise workloads.

    Back in 2009, thanks to their acquisition of M-Systems, SanDisk briefly shipped some 4 (QLC) bits per cell memory (hat tip to Brian Berg). But up to now the practical limit has been 3. As the table shows, storing more bits per cell also reduces the write endurance (and the reliability).

    LAM research NAND roadmapAs more and more layers are stacked the difficulty of the process increases, and it is currently expected that 64 layers will be the limit for the next few years. This is despite the normal optimism from the industry roadmaps.

    Beyond that for the near term manufacturers expect to use die-stacking. That involves taking two (or potentially more) complete 64-layer chips and bonding one on top of the other, connecting them by Through Silicon Vias (TSVs). TSVs are holes through the chip substrate containing wires.

    Although adding 3D layers does add processing steps, and thus some cost, it merely lengthens the processing pipeline. It doesn't slow the rate at which wafers can pass through and, because each wafer contains more storage, it increases the fab's output of storage. Die-stacking, on the other hand, doesn't increase the amount of storage per wafer, only per package. It doesn't increase the fab's output of bytes.

    It is only recently that sufficient data has become available to study the reliability of flash at scale in data centers. The behavior observed differs in several important ways from that of hard disks in similar environments with similar workloads. But since the workloads typical of current flash usage (near the top of the hierarchy) are quite unlike those of long-term bulk storage, the relevance of these studies is questionable.

    QLC will not have enough write endurance for conventional SSD applications. So will there be enough demand for manufacturers to produce it, and thus double their output relative to TLC?

    Cloud systems such as Facebook's use tiered storage architectures in which re-write rates decrease rapidly down the layers. Because most re-writes would be absorbed by higher layers, it is likely that QLC-based SSDs would work well at the bulk storage level despite only a 500 write cycle life. To do so they would likely need quite different software in the flash controllers.
    Flash vs. DiskProbably, at some point in the future flash will displace hard disk as the medium for long-term storage. There are two contrasting views as to how long this will take.

    Fontana EB shippedFirst, the conventional wisdom as expressed by the operators of cloud services and the disk industry, and supported by the graph showing how few exabytes of flash are shipped in comparison to disk. Note also that total capacity manufactured annually is increasing linearly, not exponentially as many would believe.

    Although flash is displacing disk from markets such as PCs, laptops and servers, Eric Brewer's fascinating keynote at the 2016 FAST conference started from the assertion that in the medium term the only feasible technology for bulk data storage in the cloud was spinning disk.

    NAND vs. HDD capex/TBThe argument is that flash, despite its many advantages, is and will remain too expensive for the bulk storage layer. The graph of the ratio of capital expenditure per TB of flash and hard disk shows that each exabyte of flash contains about 50 times as much capital as an exabyte of disk. Fontana estimates that last year flash shipped 83EB and hard disk shipped 565EB. For flash to displace hard disk immediately would need 32 new state-of-the-art fabs at around $9B each or nearly $300B in total investment.

    But over the next 4 years Fontana projects NAND flash shipments will grow to 400EB/yr versus hard disk shipments perhaps 800EB/yr. So there will be continued gradual erosion of hard disk market share.

    Second, the view from the flash advocates. They argue that the fabs will be built, because they are no longer subject to conventional economics. The governments of China, Japan, and other countries are stimulating their economies by encouraging investment, and they regard dominating the market for essential chips as a strategic goal, something that justifies investment. They are thinking long-term, not looking at the next quarter's results. The flash companies can borrow at very low interest rates, so even if they do need to show a return, they only need to show a very low return.

    If the fabs are built, and if QLC becomes usable for bulk storage, the increase in supply will increase the Kryder rate of flash. This will increase the trend of storage moving from disk to flash. In turn, this will increase the rate at which disk vendor's unit shipments decrease. In turn, this will decrease their economies of scale, and cause disk's Kryder rate to go negative. The point at which flash becomes competitive with disk moves closer in time.

    The result would be that the Kryder rate for bulk storage, which has been very low, would get back closer to the historic rate sooner, and thus that storing bulk data for the long term would be significantly cheaper. But this isn't the only effect. When Data Domain's disk-based backup displaced tape, greatly reducing the access latency for backup data, the way backup data was used changed. Instead of backups being used mostly to cover media failures, they became used mostly to cover user errors.

    Similarly, if flash were to displace disk, the access latency for stored data would be significantly reduced, and the way the data is used would change. Because it is more accessible, people would find more ways to extract value from it. The changes induced by reduced latency would probably significantly increase the perceived value of the stored data, which would itself accelerate the turn-over from disk to flash.

    If we're not to fry the planet, oil companies cannot develop many of the reserves they carry on their books; they are "stranded assets". Both views of the future of disk vs. flash involve a reduction in the unit volume of drives. The disk vendors cannot raise prices significantly, doing so would accelerate the reduction in unit volume. Thus their income will decrease, and thus their ability to finance the investments needed to get HAMR and then BPM into the market. The longer they delay these investments, the more difficult it becomes to afford them. Thus it is possible that HAMR and likely that BPM will be "stranded technologies", advances we know how to build, but never actually deploy in volume.
    Future Technology ChoicesSimply for business reasons, the time and the cost of developing the necessary huge manufacturing capacity, it is likely that disk and flash will dominate the bulk storage market for the medium term. Tape and optical will probably fill the relatively small niche for off-line media. Nevertheless, it is worth examining the candidate technologies being touted as potentially disrupting this market.
    Storage Class MemoriesAs we saw above, and just like hard disk's magnetic domains, flash cells are approaching the physical limits at which they can no longer retain data. 3D and QLC have delayed the point at which flash density slows, but they are one-time boosts to the technology. Grouped under the name Storage Class Memories (SCMs), several technologies that use similar manufacturing techniques to those that build DRAM (Dynamic Random Access Memory) are competing to be the successor to flash. Like flash but unlike DRAM, they are non-volatile, so are suitable as a storage medium. Like DRAM but unlike flash, they are random- rather than block-access and do not need to be erased before writing, eliminating a cause of the tail latency that can be a problem with flash (and hard disk) systems.

    Typically, these memories are slower than DRAM so are not a direct replacement but are intended to form a layer bewteen (faster) DRAM and (cheaper) flash. But included in this group is a more radical technology that claims to able to replace DRAM while being non-volatile. Nantero uses carbon nanotubes to implement memories that they claim:
    will have several thousand times faster rewrites and many thousands of times more rewrite cycles than embedded flash memory.

    NRAM, non-volatile RAM, is based on carbon nanotube (CNT) technology and has DRAM-class read and write speeds. Nantero says it has ultra-high density – but no numbers have been given out – and it's scalable down to 5nm, way beyond NAND.

    This is the first practicable universal memory, combining DRAM speed and NAND non-volatility, better-than-NAND endurance and lithography-shrink prospects.Fujitsu is among the companies to have announced plans for this technology.

    Small volumes of some forms of SCM have been shipping for a couple of years. Intel and Micron attracted a lot of attention with their announcement of 3D XPoint, an SCM technology. Potentially, SCM is a better technology than flash, 1000 times faster than NAND, 1000 times the endurance, and 100 times denser.

    SSD vs NVDIMMHowever, the performance of initial 3D XPoint SSD products disappointed the market. by being only about 8 times faster than NAND SSDs. The SSD guy, Jim Handy's Why 3D XPoint SSDs Will Be Slow explained that this is a system-level issue. The system overheads of accessing an SSD include the bus transfer, the controller logic and the file system software. They:
    account for about 15 microseconds of delay.  If you were to use a magical memory that had zero delays, then its bar would never get any smaller than 15 microseconds.

    In comparison, the upper bar, the one representing the NAND-based SSD, has combined latencies of about 90 microseconds, or six times as long.Thus for SCM's performance to justify the large cost increase, they need to eliminate these overheads:
    Since so much of 3D XPoint’s speed advantage is lost to these delays The SSD Guy expects for the PCIe interface to contribute very little to long-term 3D XPoint Memory revenues.  Intel plans to offer another implementation, shipping 3D XPoint Memory on DDR4 DIMMs.  A DIMM version will have much smaller delays of this sort, allowing 3D XPoint memory to provide a much more significant speed advantage. DIMMs are the form factor of DRAM; the system would see DIMM SCMs as memory not as an I/O device. They would be slower than DRAM but non-volatile. Exploiting these attributes requires a new, non-volatile memory layer in the storage hierarchy. Changing the storage software base to implement this will take time, which is why it makes sense to market SCMs initially as SSDs despite these products providing much less than the potential performance of the medium.

    Like flash, SCMs leverage much of the semiconductor manufacturing technology. Optimistically, one might expect SCM to impact the capacity market sometime in the late 2030s. SCMs have occupied the niche for a technology that exploits semiconductor manufacturing. A technology that didn't would find it hard to build the manufacturing infrastructure to ship the thousands of exabytes a year the capacity market will need by then.
    Exotic Optical MediaThose who believe that quasi-immortal media are the solution to the long-term storage problem have two demonstrable candidates, each aping a current form factor. One is various forms of robust DVD, such as the University of Southampton's 5D quartz DVDs, or Hitachi's fused silica glass, claimed to be good for 300 million years. The other is using lasers to write data on to the surface of steel tape.

    Both are necessarily off-line media, with the limited applicability that implies. Neither are shipping in volume, nor have any prospect of doing so. Thus neither will significantly impact the bulk storage market in the medium term.

    The latest proposal for a long-term optical storage medium is diamond, lauded by Abigail Beall in the Daily Mail as Not just a girl's best friend: Defective DIAMONDS could solve our data crisis by storing 100 times more than a DVD:
    'Without better solutions, we face financial and technological catastrophes as our current storage media reach their limits,' co-first author Dr Siddharth Dhomkar wrote in an article for The Conversation. 'How can we store large amounts of data in a way that's secure for a long time and can be reused or recycled?'.

    Speaking to the future real-world practicality of their innovation, Mr. Jacob Henshaw, co-first author said: 'This proof of principle work shows that our technique is competitive with existing data storage technology in some respects, and even surpasses modern technology in terms of re-writability.' In reality, the researchers reported that:
    "images were imprinted via a red laser scan with a variable exposure time per pixel (from 0 to 50 ms). Note the gray scale in the resulting images corresponding to multivalued (as opposed to binary) encoding. ... Information can be stored and accessed in three dimensions, as demonstrated for the case of a three-level stack. Observations over a period of a week show no noticeable change in these patterns for a sample kept in the dark. ... readout is carried out via a red laser scan (200 mWat 1 ms per pixel). The image size is 150 × 150 pixels in all cases."So, at presumably considerable cost, the researchers wrote maybe 100K bits at a few milliseconds per bit, read them back a week later at maybe a few hundred microseconds per bit without measuring an error rate. Unless the "financial and technological catastrophes" are more than two decades away, diamond is not a solution to them.
    DNANature recently featured a news article by Andy Extance entitled How DNA could store all the world's data, which claimed:
    If information could be packaged as densely as it is in the genes of the bacterium Escherichia coli, the world's storage needs could be met by about a kilogram of DNA.The article is based on research at Microsoft that involved storing 151KB in DNA. The research is technically interesting, starting to look at fundamental DNA storage system design issues. But it concludes (my emphasis):
    DNA-based storage has the potential to be the ultimate archival storage solution: it is extremely dense and durable. While this is not practical yet due to the current state of DNA synthesis and sequencing, both technologies are improving at an exponential rate with advances in the biotechnology industry[4].SourceThe paper doesn't claim that the solution is at hand any time soon. Reference 4 is a two year old post to Rob Carlson's blog. A more recent post to the same blog puts the claim that:
    both technologies are improving at an exponential ratein a somewhat less optimistic light. It may be true that DNA sequencing is getting cheaper very rapidly. But already the cost of sequencing (read) was insignificant in the total cost of DNA storage. What matters is the synthesis (write) cost. Extance writes:
    A closely related factor is the cost of synthesizing DNA. It accounted for 98% of the expense of the $12,660 EBI experiment. Sequencing accounted for only 2%, thanks to a two-millionfold cost reduction since the completion of the Human Genome Project in 2003.The rapid decrease in the read cost is irrelevant to the economics of DNA storage; if it were free it would make no difference. Carlson's graph shows that the write cost, the short DNA synthesis cost (red line) is falling more slowly than the gene synthesis cost (yellow line). He notes:
    But the price of genes is now falling by 15% every 3-4 years (or only about 5% annually).A little reference checking reveals that the Microsoft paper's claim that:
    both technologies are improving at an exponential ratewhile strictly true is deeply misleading. The relevant technology is currently getting cheaper slower than hard disk or flash memory! And since this has been true for around two decades, making the necessary 3-4 fold improvement just to keep up with the competition is going to be hard.

    Decades from now, DNA will probably be an important archival medium. But the level of hype around the cost of DNA storage is excessive. Extance's article admits that cost is a big problem, yet it finishes by quoting Goldman, lead author of a 2013 paper in Nature whose cost projections were massively over-optimistic. Goldman's quote is possibly true but again deeply misleading:
    "Our estimate is that we need 100,000-fold improvements to make the technology sing, and we think that's very credible," he says. "While past performance is no guarantee, there are new reading technologies coming onstream every year or two. Six orders of magnitude is no big deal in genomics. You just wait a bit."Yet again the DNA enthusiasts are waving the irrelevant absolute cost decrease in reading to divert attention from the relevant lack of relative cost decrease in writing. They need an improvement in relative write cost of at least 6 orders of magnitude. To do that in a decade means halving the relative cost every year, not increasing the relative cost by 10-15% every year.

    Journalists like Beall writing for mass-circulation newspapers can perhaps be excused for merely amplifying the researcher's hype, but Extance, writing for Nature, should be more critical.
    Storage Media or Storage Systems?As we have seen, the economics of long-term storage mean that the media to be used will have neither the service life nor the reliability needed. These attributes must therefore be provided by a storage system whose architecture anticipates that media and other hardware components will be replaced frequently, and that data will be replicated across multiple media.

    The system architecture surrounding the media is all the more important given two findings from research into production use of large disk populations:
    What do we want from a future bulk storage system?
    • An object storage fabric.
    • With low power usage and rapid response to queries.
    • That maintains high availability and durability by detecting and responding to media failures without human intervention.
    • And whose reliability is externally auditable.
    At the 2009 SOSP David Anderson and co-authors from C-MU presented FAWN, the Fast Array of Wimpy Nodes. It inspired me to suggest, in my 2010 JCDL keynote, that the cost savings FAWN realized without performance penalty by distributing computation across a very large number of very low-power nodes might also apply to storage.

    The following year Ian Adams and Ethan Miller of UC Santa Cruz's Storage Systems Research Center and I looked at this possibility more closely in a Technical Report entitled Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes. We showed that it was indeed plausible that, even at then current flash prices, the total cost of ownership over the long term of a storage system built from very low-power system-on-chip technology and flash memory would be competitive with disk while providing high performance and enabling self-healing.

    Two subsequent developments suggest we were on the right track. First, Seagate's announcement of its Kinetic architecture and Western Digital's subsequent announcement of drives that ran Linux. Drives have on-board computers that perform command processing, internal maintenance operations, and signal processing. These computeers have spare capacity, which both approaches use to delegate computation from servers to the storage media, and to get IP communication all the way to the media, as DAWN suggested. IP to the medium is a great way to future-proof the drive interface.

    FlashBlade hardwareSecond, although flash remains more expensive than hard disk, since 2011 the gap has narrowed from a factor of about 12 to about 6. Pure Storage recently announced FlashBlade, an object storage fabric composed of large numbers of blades, each equipped with:
    • Compute: 8-core Xeon system-on-a-chip, and Elastic Fabric Connector for external, off-blade, 40GbitE networking,
    • Storage: NAND storage with 8TB or 52TB raw capacity of raw capacity and on-board NV-RAM with a super-capacitor-backed write buffer plus a pair of ARM CPU cores and an FPGA,
    • On-blade networking: PCIe card to link compute and storage cards via a proprietary protocol.
    FlashBlade clearly isn't DAWN. Each blade is much bigger, much more powerful and much more expensive than a DAWN node. No-one could call a node with an 8-core Xeon, 2 ARMs, and 52TB of flash "wimpy", and it'll clearly be too expensive for long-term bulk storage. But it is a big step in the direction of the DAWN architecture.

    DAWN exploits two separate sets of synergies:
    • Like FlashBlade, DAWN moves the computation to where the data is, rather then moving the data to where the computation is, reducing both latency and power consumption. The further data moves on wires from the storage medium, the more power and time it takes. This is why Berkeley's Aspire project's architecture is based on optical interconnect technology, which when it becomes mainstream will be both faster and lower-power than wires. In the meantime, we have to use wires.
    • Unlike FlashBlade, DAWN divides the object storage fabric into a much larger number of much smaller nodes, implemented using the very low-power ARM chips used in cellphones. Because the power a CPU needs tends to grow faster than linearly with performance, the additional parallelism provides comparable performance at lower power.
    So FlashBlade currently exploits only one of the two sets of synergies. But once Pure Storage has deployed this architecture in its current relatively high-cost and high-power technology, re-implementing it in lower-cost, lower-power technology should be easy and non-disruptive. They have done the harder of the two parts.
    PredictionsAlthough predictions are always risky, it seems appropriate to conclude with some. The first is by far the most important:
    1. Increasing technical difficulty and decreasing industry competition will continue to keep the rate at which the per-byte cost of bulk storage media decrease well below pre-2010 levels. Over a decade or two this will cause a very large increase in the total cost of ownership of long-term data.
    2. No new media will significantly impact the bulk storage layer of the hierarchy in the medium term. It will be fought out between tape, disk, flash and conventional optical. Media that would impact the layer in this time-frame would already be shipping in volume, and none are.
    3. Their long latencies will confine tape and optical to really cold data, and thus to a small niche of the market.
    4. Towards the end of the medium term, storage class memories will start to push flash down the hierarchy into the bulk storage layer. 
    5. Storage system architectures will migrate functionality from the servers towards the storage media.
    AcknowledgementsI'm grateful to Seagate, and in particular to Dave B. Anderson, for (twice) allowing me to pontificate about their industry, to Brian Berg for his encyclopedic knowledge of the history of flash, and Tom Coughlin for illuminating discussions and the first graph of exabytes shipped. This isn't to say that they agree with any of the above.

    LITA: #NoFilter: Social Media and Its Use in the Library

    planet code4lib - Tue, 2016-12-13 15:00

    Time and again we hear how useful social media is for library outreach. Use social media for advertising events in the library! Use social media to provide book recommendations! Use social media to alert patrons to library hours and services! Use social media to highlight collections!

    And yet for many libraries (academic, research, public, corporate, school, etc.), social media success remains elusive. We sit and grumble to our colleagues about how few followers we have on x or y platform or the lack of likes and shares our posts receive. We question whether the library is cool or hip enough with new generations of patrons. Then eventually we throw up our hands, stand on our desks, and boldly proclaim “I’m done with social media! I’m going back to papering the walls with flyers!”

    Dramatics aside, many of us have been in this predicament at some point. We question why we’re doing social media for our institutions. We wonder what its value is, what are its long-term benefits. We wonder if our content is exciting and intriguing, if we post too much or too little, if we should plan in advance or work more spontaneously, if we should have a social media team in place or leave all social media duties to one person, if we are using the right metrics to gauge effectiveness. The list goes on.

    Sharing my library’s collections on Tumblr with Oloch, our woolly mammoth book cart.

    It’s my intention over the next several months to explore some of these challenges and concerns pertaining to social media and its use in the library. As a social media contributor/administrator at my library, I have grappled with these issues firsthand. My colleagues and I set up a Tumblr blog (Othmeralia) for our library in January 2014 and a Pinterest page in March 2016. They weren’t instant successes by a long shot. Success was brought about through time as well as trial and error – and accompanied by countless conversations about content, best practices, management styles, promotion techniques, and metrics.

    The tips I’ve gleaned from my library’s social media experience and from my own reading will serve as the basis for this series which will hopefully inform you about how to cultivate meaningful social media presences.

    Embracing the spirit of sharing that envelops the vast world of social media, I invite you to take part in the conversation as well. Share your institutional social media challenges, share your success stories, share your tips and tricks in the comments below.

    What social media are you using at your library? What types of content are you sharing? What’s your biggest challenge at present?

    HangingTogether: From Records to Things

    planet code4lib - Tue, 2016-12-13 13:00

    My colleague Jean Godby and I co-authored for the December/January 2017 issue of the ASIS&T Bulletin (aka Bulletin of the Association for Information Science and Technology) the article, “From Records to Things: Managing the Transition from Legacy Library Metadata to Linked Data.” This is an article we were asked to write for the Bulletin’s “special section” on Information Standards. From the editor’s summary:

    A basic requirement for linked data is that records include structured and clear data about topics of interest or searched Things, formatted in ways that allow linking to other data. While linked data presents great potential for the library community, libraries’ existing digital knowledge is largely inaccessible, stuck in the increasingly obsolete MARC format, readable only by humans and certain library systems. To maximize the value of linked data using library content, important entities and relationships must be defined and made available, codings that are machine understandable must be adapted for linked data purposes, and persistent identifiers must be substituted for text.

    Last year, Jean and I had presented a TAI-CHI (Technical Advances for Innovation in Cultural Heritage Institutions) webinar along the same theme, “How You Can Make the Transition from MARC to Linked Data Easier”, where we offered examples of what metadata specialists can do now to make it easier to transform text strings in MARC data into the entity-“things” we later expose as linked data that others can consume.

    How about you? Have you changed your workflows to make your MARC data more “Thing”-like?

    About Karen Smith-Yoshimura

    Karen Smith-Yoshimura, senior program officer, works on topics related to creating and managing metadata with a focus on large research libraries and multilingual requirements.

    Mail | Web | Twitter | More Posts (72)

    Galen Charlton: ALA and recognizing situations for what they are

    planet code4lib - Tue, 2016-12-13 02:15

    As I suspect is the case with many members, my relationship with the American Library Association runs hot and cold. On the one hand, like Soylent Green, ALA is people: I have been privileged to meet and work with many excellent folk through ALA, LITA, and ALCTS (though to complete the metaphor, sometimes I’ve seen ALA chew on people until they felt they had nothing left to give). There are folks among ALA members and staff whose example I hope to better emulate, including Andromeda Yelton, Deborah Caldwell-Stone, Keri Cascio, and Jenny Levine. I also wish that Courtney Young were ALA president now.

    And yet.

    For what follows, unfortunately I feel compelled to state my bona fides: yes, I have been and am active in ALA. I sign petitions; I grit my teeth each year and make my way through ballots that are ridiculously long; I have chaired interest groups — and started one; I’ve served on an ALA-level subcommittee; I helped organize a revenue-producing pre-conference. Of course, many people have rather more substantial records of service with ALA than I do, but I’ve paid my dues with more than just my annual membership check.

    To put it another way, the spitballs I’m about to throw are coming from a decent seat in orchestra left, not the peanut gallery.

    So, let’s consider the press releases.

    This one from 15 November, ALA offers expertise, resources to incoming administration and Congress:

    “The American Library Association is dedicated to helping all our nation’s elected leaders identify solutions to the challenges our country faces,” ALA President Julie Todaro said. “We are ready to work with President-elect Trump, his transition team, incoming administration and members of Congress to bring more economic opportunity to all Americans and advance other goals we have in common.”

    Or this one from 17 November, Libraries bolster opportunity — new briefs show how libraries support policy priorities of new administration:

    The American Library Association (ALA) released three briefs highlighting how libraries can advance specific policy priorities of the incoming Trump Administration in the areas of entrepreneurship, services to veterans and broadband adoption and use.

    In other words, the premier professional organization for U.S. librarians is suggesting that not only must we work with an incoming administration that is blatantly racist, fascist, and no friend of knowledge, we support his priorities?

    Hell no.

    Let’s pause to imagine the sounds of a record scratch followed by quick backpedaling.

    Although it appears that a website redesign has muddied the online archives, I note that ALA does not appear to have issued a press release expressing its willingness to work with Obama’s administration back in 2008. In fact, an opinion piece around that time (appropriately) expressed ALA’s expectations of the incoming Obama administration:

    During this time of transition in our nation’s leadership, the greatest challenge we face is getting our economy back on its feet. As our country faces the challenges and uncertainty of this time, the public library is one constant that all Americans, regardless of age or economic status, can count on, and it is incumbent on our leaders make it a priority to ensure America’s libraries remain open and ready to serve the needs of students, job seekers, investors, business people and others in the community who want information and need a place to get it.

    Note the politely-phrased implicit demand here: “Mr. President-Elect: we have shown our value; you must now work to bolster us.”

    This is how we should act with our political leaders: with the courage of our convictions.

    Of course, it was easy to do that with a president who was obviously not about to start tearing down public libraries.

    Consider this from Julie Todaro’s Q&A about the whole mess (emphasis mine):

    Why did we write the press releases in the first place?

    ALA often reaches out to constituents, advocates, and decision-makers – both proactively and reactively – to request actions, express our support for actions taken, request a decision-maker consider libraries in general, and request that libraries be considered for specific activities or purposes. My presidential initiative focuses on library professionals and library supporters as experts and on their expertise, and on the importance of various library initiatives in communities and institutions of all types and sizes – and on the importance of communicating this value to decision-makers. In making a strong case for the value of libraries – in any political environment – it is important to state that case from the perspective of the decision-maker. So, if a legislator or administrator is focused on the importance of small businesses and their effect on the community, for example, the strategy is to prepare a statement illustrating how libraries support small businesses within their community – and how they could be even more effective with supportive legislation, funding or other appropriate action. Our stories – combined with data –can be framed to align our vision with other visions – always within the framework of our values.

    Really? What on earth was that decision-maker’s perspective imagined to be?

    That of a normal business man, fond of his tax cuts but not wholly bereft of a sense that some leavings from his financial empire ought to be sprinkled around for the public good, or at least the assuagement of a guilty conscience?

    That of a conservative, Republican, library board member, who might never vote to eliminate overdue fines but at least recognizes that a town is not complete without its library?

    That of a entrepreneur overfond of his technological toys, who at least might be shown that there are some things Google neither finds nor indexes?

    Such people might be reachable.

    A conman is not.

    A conman who explicitly denies the value of acquiring information. A conman who unapologetically names a white nationalist as his chief counselor. A conman whose Cabinet picks are nearly uniformly those who would pillage the departments they would lead. A conman who, unlike George W. Bush, has no known personal connection to libraries.

    A conman who cannot be bought off, even if ALA were to liquidate itself.

    Cowering before Trump will not save us; will not save libraries. I do not suggest that ALA should have pulled the tiger’s tail; in the face of fascism, such moral authority as we possess only works quietly. We are in for the long haul; consequently, it would have been appropriate, if not necessarily courageous, for ALA to have said nothing to the incoming administration.

    One of the things that appalls me about the press releases is the lack of foresight. There was no reason to expect that Trump would respect craven offerings, and it was entirely predictable that a significant portion of the membership would object to the attempt.

    Contrary to Naomi Schaefer Riley’s piece in the New York Post, libraries are not suddenly political. However, in its recent actions, ALA deserves the contempt she expresses: an organization that yanks two press releases is, at least, inept — inept beyond the normal slow pace of library decision-making.

    ALA desperately need to do better. The political climate is unfriendly enough even before we consider creeping fascism: we should not plan on the survival of IMLS and LSTA nor on the Copyright Office remaining under the oversight of the Librarian of Congress. An administration that is hinting at a purge of EPA scientists who investigate climate change will not hesitate to suppress their writings. An administration that seeks to expand a registry of Muslims may not stoop at demanding lists of library patrons who have checked out books in Dewey 297.

    And frankly, I expect libraries to lose a lot of battles on Capitol Hill, although I do think there is at least some hope that smart action in Washington, but particularly at the state level, might ameliorate some of the losses.

    But only if we recognize the situation for what it is. We face both the apotheosis of GOP efforts to diminish, dismantle, and privatize government services and a resurgence of unrestrained racism and white nationalism.

    I just hope that ALA will remain with me in resisting.

    Islandora: New and Awesome in Islandora

    planet code4lib - Mon, 2016-12-12 18:25

    As we approach the end of 2016, let's have a look at some of the tools that the Islandora community has built and shared this year. You can find these and more on the Islandora Awesome list, but in case you missed them when they came out, here are the highlights:

    Islandora On The Day - This tool from SFU's Mark Jordan is a utility module that queries Solr for objects whose date fields contain a month and day equal to the current day's. The module displays a gallery of thumbnails at /onthisday for the objects it finds in Solr, with results that look like this:

    Islandora URL Redirector - This one from BCELN's Brandon Weigel is a migration module that preserves permalinks the repository the objects migrated from. When an incoming URL matches a defined pattern, the module looks up an object's old "permanent" URL from an identifier field and redirects the viewer to its new home in Islandora. This module can work in multiple ways:

    • Install on the new Islandora repository, if you kept the same domain name
    • Install on the old repository if it uses a Drupal front end
    • Build a new Drupal site using this module, and point your old domain name there

    Islandora Twitter Cards - Another offering from Brandon Weigel, this module adds Twitter-recognized meta tags to the HTML headers on Islandora objects based on MODS metadata. These tags enable Twitter to render Twitter Cards when Islandora content is tweeted. It can also handle Facebook meta tags.

    Islandora Managed Access - This one comes to us from Florida State University. It was created "to fit a use case where certain Islandora objects should show up in collection browsing and search results, but require a user to register an account in order to view said object so that repository administrators may keep track of who is looking at it (similar in concept to an archival reading room)." They are soliciting use cases to help refine how it works, so if you have feedback, please visit the Islandora listserv and let Bryan Brown know. 


    Subscribe to code4lib aggregator