You are here

Feed aggregator

Open Library Data Additions: OL.120101.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120101.meta.mrc 5284 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.121101.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.121101.meta.mrc 6896 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120901.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120901.meta.mrc 6035 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120801.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120801.meta.mrc 5760 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120701.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120701.meta.mrc 5421 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

State Library of Denmark: Facet filtering

planet code4lib - Fri, 2015-04-10 21:24

In generation 2 of our net archive search we plan to experiment with real time graphs: We would like to visualize links between resources and locate points of interest based on popularity. Our plan is to use faceting with Solr on 500M+ unique links per shard, which is a bit of challenge in itself. To make matters worse, plain faceting does not fully meet the needs of the researchers. Some sample use cases for graph building are

  1. The most popular resources that pages about gardening links to overall
  2. The most popular resources that pages on a given site links to externally
  3. The most popular images that pages on a given site links to internally
  4. The most popular non-Danish resources that Danish pages links to
  5. The most popular JavaScripts that all pages from a given year links to

Unfortunately, only the first one can be solved with plain faceting.

Blacklists & whitelists with regular expressions

The idea is to filter all viable term candidates through a series of blacklists and whitelists to check whether they should be part of the facet result or not. One flexible way of expressing conditions on Strings is with regular expressions. The main problem with that approach is that all the Strings for the candidates must be resolved, instead of only the ones specified by facet.limit.

Consider the whitelist condition .*wasp.* which matches all links containing the word wasp. That is a pretty rare word overall, so if a match-all query is issued and the top 100 links with the wasp-requirement are requested, chances are that millions of terms must be resolved to Strings and checked, before the top 100 allowed ones has been found. On the other hand, a search for gardening would likely have a much higher chance of wasp-related links and would thus require far less resolutions.

An extremely experimental (written today) implementation of facet filtering has been added to the pack branch of Sparse Faceting for Solr. Correctness testing has been performed, where testing means “tried it a few times and the results looked plausible”. Looking back at the cases in the previous section, facet filtering could be used to support them:

  1. The most popular resources that pages about gardening links to overall
  2. The most popular resources that pages on a given site links to externally[^/]*example\.com
  3. The most popular images that pages on a given site links to internally[^/]*example\.com/.*\.(gif|jpg|jpeg|png)$
  4. The most popular non-Danish resources that Danish pages links to
  5. The most popular JavaScripts that all pages from a given year links to

Some questions like “The most popular resources larger than 2MB in size linked to from pages about video” cannot be answered directly with this solution as they rely on the resources at the end of the links, not just the links themselves.

Always with the performance testing

Two things of interest here:

  1. Faceting on 500M+ unique values (5 billion+ DocValues references) on a 900GB single-shard index with 200M+ documents
  2. Doing the trick with regular expressions on top

Note the single-shard thing! The measurements should not be taken as representative for the final speed of the fully indexed net archive, which will be 50 times larger. As we get more generation 2 shards, the tests will hopefully be re-run.

As always, Sparse Faceting is helping tremendously with the smaller result sets. This means that averaging the measurements to a single number is highly non-descriptive: Response times varies from < 100ms for a few thousand hits to 5 minutes for a match-all query.

Performance testing used a single thread to issue queries with random words from a Danish dictionary. The Solr server was a 24 core Intel i7 machine (only 1 active core due to the unfortunate single-threaded nature of faceting) with 256GB of RAM (200GB free for disk cache) and SSDs. All tests were with previously unused queries. 5 different types of requests were issued:

  1. no_facet: as the name implies, just a plain search with no faceting
  2. sparse: Sparse Faceting on the single links-field with facet limit 25
  3. regexp_easy: Sparse Faceting with whitelist regular expression .*htm.* which is fairly common in links
  4. regexp_evil: Sparse Faceting with whitelist regular expression .*nevermatches.* effectively forcing all terms in the full potential result set to be resolved and checked
  5. solr: Vanilla Solr faceting

900GB, 200M+ docs, 500M+ unique values, 5 billion+ references

  • Sparse Faceting without regular expressions (purple) performs just as well with 500M+ values as it did with previous tests of 200M+ values.
  • Using a regular expression that allows common terms (green) has moderate impact on performance.
  • The worst possible regular expression (orange) has noticeable impact at 10,000 hits and beyond. At the very far end at match-all, the processing time was 10 minutes (versus 5 minutes for non-regular expression faceting). This is likely to be highly influenced by storage speed and be slower with more shards on the same machine.
  • The constant 2 second overhead of vanilla Solr faceting (yellow) is highly visible.

Worst case processing times has always been a known weakness of our net archive search. Facet filtering exacerbates this. As this is tightly correlated to the result set size, which is fast to calculate, adding a warning with “This query is likely to take minutes to process” could be a usable bandage.

With that caveat out of the way, the data looks encouraging so far; the overhead for regular expressions was less than feared. Real-time graphs or at least fill-the-coffee-cup-time graphs seems doable. At the cost of 2GB of extra heap per shard to run the faceting request.

Additional notes (updated 2015-04-11)

@maxxkrakoa noted “@TokeEskildsen you wrote Solr facet is 1 thread. facet.threads can change that – but each field will only be able to use one thread each.“. He is right and it does help significantly for our 6 field faceting. For single field faceting, support for real multi-thread counting would be needed.

The simple way of doing multi-thread counting is to update multiple copies of the counter structure and merge them at the end. For at 500M+ field, that is likely to be prohibitive with regards to both memory and speed: The time used for merging the multiple counters would likely nullify the faster counter update phase. Some sort of clever synchronization or splitting of the counter space would be needed. No plans yet for that part, but it has been added to “things to ponder when sleep is not coming”-list.

John Miedema: Cognitive computing. Computers already know how to do math, now they can read and understand.

planet code4lib - Fri, 2015-04-10 20:06

Cognitive computing extends the range of knowledge tasks that can be performed by computers and humans. It is characterized by the following:

  1. Life-world data. Operates on data that is large, varied, complex and dynamic, the stuff of daily human life.
  2. Natural questions. A question is more than a keyword query. A question embodies unstructured meaning. It may be asked in natural language. A dialog allows for refinement of questions.
  3. Reading and understanding. Computers already know how to do math. Cognitive computing provides the ability to read. Reading includes understanding context, nuance, and meaning.
  4. Analytics. Understanding is extended with statistics and reasoning. The system finds patterns and structures. It considers alternatives and chooses a best answer.
  5. Answers are structured and ordered. An answer is an “assembly,” a wiki-type summary, or a visualization such as a knowledge graph. It often includes references to additional information.

Cognitive computing is not artificial intelligence. Solutions are characterized by a partnership with humans:

  1. Taught rather than just programmed. Cognitive systems “borrow” from human intelligence. Computers use resources compiled from human knowledge and language.
  2. Learn from human interaction. A knowledge base is improved by feedback from humans. Feedback is ideally implicit in an interaction, or it may be explicit, e.g., thumbs up or down.

DPLA: DPLAfest 2015: Special Indy Activities and Attractions

planet code4lib - Fri, 2015-04-10 19:29

There’s lots to do in the Indianapolis area during DPLAfest 2015, just a week away! Here is a sampling of some of the excellent options.

Check out the Indiana Historical Society’s array of exhibitions (free for fest attendees!)

Did you know that you can get free admission to the Indiana Historical Society with your DPLAfest name badge? Simply show your DPLAfest name badge at the Historical Society welcome center and you’ll receive a wristband to explore the wonderful exhibits and activities inside:

  • Step into three-dimensional re-creations of historic photographs complete with characters come to life in You Are There: featuring That Ayres Look1939: Healing Bodies, Changing Mindsand 1904: Picture This
  • Let the latest technology take you back in time on virtual journeys throughout the state in Destination Indiana.
  • Pull up a stool at a cabaret and immerse yourself in the music of Hoosier legend Cole Porter in the Cole Porter Room.
  • In the W. Brooks and Wanda Y. Fortune History Lab, get a behind-the-scenes, hands-on look at conservation and the detective work involved history research.
  • Check out Lilly Hall, home of the Mezzanine Gallery, INvestigation Stations and Hoosier Legends.

Take a walking tour of the Indiana State Library’s stunning architecture

Join Indiana State Library staffer Steven Schmidt for a guided tour of the historic Indiana State Library. Please meet at least 5 minutes before the tour is slated to begin; see the DPLAfest schedule for more details.

Learn about the history and development of Indianapolis

If you’re interested in history, architecture, and anything in between,  be sure to check out Indianapolis Historical Development: A Quick Overview on  Saturday, 4/18 at 1:15 PM in IUPUI UL Room 1116. Led by William L. Selm, IUPUI Adjunct Faculty of the School of Engineering & Technology, this presentation will provide an “overview of the development of Indianapolis since its founding in 1821 as the capital city of Indiana in the center of the state. The factors that shaped the city will be presented as well as the buildings and monuments that are the products of these factors and forces.” Find out more about this session here.

Get inspired at the Indianapolis Museum of Art:

  • At the Innovative Museum Leaders Speaker Series on April 16, hear from Mar Dixon, the Founder of MuseuoMix UK, Museum Camp and Ask a Curator Day, London.
  • Peruse a number of interesting art exhibitions. This includes a Monuments Men-inspired look at the provenance research and ownership discussion surrounding one of the IMA’s European pieces, “Interior of Antwerp Cathedral.”

Learn more at The Eiteljorg Museum of American Indians and Western Art:

  • With a mission to inspire an appreciation and understanding of the cultures of the indigenous peoples of North America, the Eiteljorg Museum has a number of interesting offerings. See art from the American West, as well as an exhibit about the gold rush. 

Show off your sports side:

For other child (or child-at-heart) friendly options:

  • Visit the world’s largest children’s museum, the Children’s Museum of Indianapolis. And, yes, that’s Optimus Prime you’ll see parked out front–it’s part of a new Transformers exhibit, one of many fun options at the museum.
  • Explore at the Indianapolis Zoo, which now has a new immersive “Butterfly Kaleidoscope” conservatory, with 40 different butterfly species.

Don’t forget to take a second look at the DPLAfest schedule to make sure you don’t miss any of the exciting fest sessions! See you soon in Indianapolis!

Brown University Library Digital Technologies Projects: What is ORCID?

planet code4lib - Fri, 2015-04-10 15:46

ORCID is an open, non-profit initiative founded by academic institutions, professional bodies, funding agencies, and publishers to resolve authorship confusion in scholarly work.  The ORCID repository of unique scholar identification numbers will reliably identify and link scholars in all disciplines with their work, analogous to the way ISBN and DOI identify books and articles.

Brown is a member of ORCID which allows the University, among other things, to create ORCID records on behalf of faculty, students, and affiliated individuals; integrate authenticated ORCID identifiers into grant application processes; ingest ORCID data to maintain internal systems such as institutional repositories; and link ORCID identifiers to other IDs and registry systems.  ORCID identifiers will facilitate the gathering of publication, grant, and other data for use in Reseachers@Brown profiles.  The library, with long experience in authority control, is coordinating this effort.

Brown University Library Digital Technologies Projects: What is OCRA?

planet code4lib - Fri, 2015-04-10 15:40

OCRA is a platform for faculty to request digital course reserves in all formats.  Students access digital course reserves via Canvas or at the standalone OCRA site.  Students access physical reserves in library buildings via Josiah.

Hydra Project: SAVE THE DATE: Hydra Connect 2015 – Monday, September 21st – Thursday, September 24th

planet code4lib - Fri, 2015-04-10 15:21

Hydra today announced the dates for Hydra Connect 2015:

Hydra Connect 2015
Minneapolis, Minnesota
Monday, September 21 – Thursday, September 24, 2015

Registration and lodging details will be available in early June 2015.

The four day event will be structured as follows:

  • Mon 9/21        – Workshops and leader facilitated training sessions
  • Tue 9/22          – Main Plenary Session
  • Wed 9/23        – Morning Plenary Session, Afternoon Un-conference breakout sessions
  • Thu 9/24  – All day Un-conference breakouts and workgroup sessions

We are also finalizing details for a poster session within the program and a conference dinner to be held on the one of the main conference evenings.

Please mark your calendars and plan on joining us this September!

David Rosenthal: 3D Flash - not as cheap as chips

planet code4lib - Fri, 2015-04-10 15:00
Chris Mellor has an interesting piece at The Register pointing out that while 3D NAND flash may be dense, its going to be expensive.

The reason is the enormous number of processing steps per wafer - between 96 and 144 deposition layers for the three leading 3D NAND flash technologies. Getting non-zero yields from that many steps involves huge investments in the fab:
Samsung, SanDisk/Toshiba, and Micron/Intel have already announced +$18bn investment for 3D NAND.
  • Samsung’s new Xi’an, China, 3D NAND fab involves a +$7bn total capex outlay
  • Micron has outlined a $4bn spend to expand its Singapore Fab 10
This compares with Seagate and Western Digital’s capex totalling ~$4.3 bn over the past three years.Chris has this chart, from Gartner and Stifel, comparing the annual capital expenditure per TB of storage of NAND flash and hard disk. Each TB of flash contains at least 50 times as much capital as a TB of hard disk, which means it will be a lot more expensive to buy.

PS - "as cheap as chips" is a British usage.

Jonathan Rochkind: “Streamlining access to Scholarly Resources”

planet code4lib - Fri, 2015-04-10 14:14

A new Ithaka report, Meeting Researchers Where They Start: Streamlining Access to Scholarly Resources [thanks to Robin Sinn for the pointer], makes some observations about researcher behavior that many of us probably know, but that most of our organizations haven’t succesfully responded to yet:

  • Most researchers work from off campus.
  • Most researchers do not start from library web pages, but from google, the open web, and occasionally licensed platform search pages.
  • More and more of researcher use is on smaller screens, mobile/tablet/touch.

The problem posed by the first two points is the difficulty in getting access to licensed resources. If you start from the open web, from off campus, and wind up at a paywalled licensed platform — you will not be recognized as a licensed user.  Becuase you started from the open web, you won’t be going through EZProxy. As the Ithaka report says, “The proxy is not the answer… the researcher must click through the proxy server before arriving at the licensed content resource. When a researcher arrives at a content platform in another way, as in the example above, it is therefore a dead-end.”

Shibboleth and UI problems

Theoretically, Shibboleth federated login is an answer to some of that. You get to a licensed platform from the open web, you click on a ‘login’ link, and you have the choice to login via your university (or other host organization), using your institutional login at your home organization, which can authenticate you via Shibboleth to the third party licensed platform.

The problem here that the Ithaka report notes is that these Shibboleth federated login interfaces at our  licensed content providers — are terrible.

Most of them even use the word “Shibboleth” as if our patrons have any idea what this means. As the Ithaka report notes, “This login page is a mystery to most researchers. They can be excused for wondering “what is Shibboleth?” even if their institution is part of a Shibboleth federation that is working with the vendor, which can be determined on a case by case basis by pulling down the “Choose your institution” menu.”

Ironically, this exact same issue was pointed out in the NISO “Establishing Suggested Practices Regarding Single Sign-on” (ESPReSSO) report from 2011. The ESPReSSO report goes on to not only identify the problem but suggest some specific UI practices that licensed content providers could take to improve things.

Four years later, almost none have. (One exception is JStor, which actually acted on the ESPReSSO report, and as a result actually has an intelligible federated sign-on UI, which I suspect our users manage to figure out. It would have been nice if the Ithaka report had pointed out good examples, not just bad ones. edit: I just discovered JStor is actually currently owned by Ithaka, perhaps they didn’t want to toot their own horn.).

Four years from now, will the Ithaka report have had any more impact?  What would make it so?

There is one more especially frustrating thing to me regarding Shibboleth, that isn’t about UI.  It’s that even vendors that say they support Shibboleth, support it very unreliably. Here at my place of work we’ve been very aggressive at configuring Shibboleth with any vendor that supports it. And we’ve found that Shibboleth often simply stops working at various vendors. They don’t notice until we report it — Shibboleth is not widely used, apparently.  Then maybe they’ll fix it, maybe they won’t. In another example, Proquest’s shibboleth login requires the browser to access a web page on a four-digit non-standard port, and even though we told them several years ago that a significant portion of our patrons are behind a firewall that does not allow access to such ports, they’ve been uninterested in fixing/changing it. After all, what are we going to do, cancel our license?  As the several years since we first complained about this issue show, obviously not.  Which brings us to the next issue…

Relying on Vendors

As the Ithaka report notes, library systems have been effectively disintermediated in our researchers workflows. Our researchers go directly to third-party licensed platforms. We pay for these platforms, but we have very little control of them.

If a platform does not work well on a small screen/mobile device, there’s nothing we can do but plead. If a platform’s authentication system UI is incomprehensible to our patrons, likewise.

The Ithaka report recognizes this, and basically recommends that… we get serious when we tell our vendors to improve their UI’s:

Libraries need to develop a completely different approach to acquiring and licensing digital content, platforms, and services. They simply must move beyond the false choice that sees only the solutions currently available and instead push for a vision that is right for their researchers. They cannot celebrate content over interface and experience, when interface and experience are baseline requirements for a content platform just as much as a binding is for a book. Libraries need to build entirely new acquisitions processes for content and infrastructure alike that foreground these principles.

Sure. The problem is, this is completely, entirely, incredibly unrealistic.

If we were for real to stop “celebreating content over interface and experience”, and have that effected in our acquisitions process, what would that look like?

It might look like us refusing to license something with a terrible UX, even if it’s content our faculty need electronically. Can you imagine us telling faculty that? It’s not going to fly. The faculty wants the content even if it has a bad interface. And they want their pet database even if 90% of our patrons find it incomprehensible. And we are unable to tell them “no”.

Let’s imagine a situation that should be even easier. Let’s say we’re lucky enough to be able to get the same package of content from two different vendors with two different platforms. Let’s ignore the fact that “big deal” licensing makes this almost impossible (a problem which has only gotten worse since a D-Lib article pointed it out 14 years ago). Even in this fantasy land, where we say we could get the same content from two differnet platforms — let’s say one platform costs more but has a much better UX.  In this continued time of library austerity budgets (which nobody sees ending anytime soon), could we possibly pick the more expensive one with the better UX? Will our stakeholders, funders, faculty, deans, ever let us do that? Again, we can’t say “no”.

edit: Is it any surprise, then, that our vendors find business success in not spending any resources on improving their UX?  One exception again is JStor, which really has a pretty decent and sometimes outstanding UI.  Is the fact that they are a non-profit endeavor relevant? But there are other non-profit content platform vendors which have UX’s at the bottom of the heap.

Somehow we’ve gotten ourselves in a situation where we are completely unable to do anything to give our patrons what we know they need.  Increasingly, to researchers, we are just a bank account for licensing electronic platforms. We perform the “valuable service” of being the entity you can blame for how much the vendors are charging, the entity you require to somehow keep licensing all this stuff on smaller budgets.

I don’t think the future of academic libraries is bright, and I don’t even see a way out. Any way out would take strategic leadership and risk-taking from library and university administrators… that, frankly, institutional pressures seem to make it impossible for us to ever get.

Is there anything we can do?

First, let’s make it even worse — there’s a ‘technical’ problem that the Ithaka report doesn’t even mention that makes it even worse. If the user arrives at a paywall from the open web, even if they can figure out how to authenticate, they may find that our institution does not have a license from that particular vendor, but may very well have access to the same article on another platform. And we have no good way to get them to it.

Theoretically, the OpenURL standard is meant to address exactly this “appropriate copy” problem. OpenURL has been a very succesful standard in some ways, but the ways it’s deployed simply stop working when users don’t start from library web pages, when they start from the open web, and every place they end up has no idea what institution they belong to or their appropriate institutional OpenURL link resolver.

I think the only technical path we have (until/unless we can get vendors to improve their UI’s, and I’m not holding my breath) is to intervene in the UI.  What do I mean by intervene?

The LibX toolbar is one example — a toolbar you install in your browser that adds instititutionally specific content and links to web pages, links that can help the user authenticate against a platform arrived to via the open web, even links that can scrape the citation details from a page and help the user get to another ‘appropriate copy’ with authentication.

The problem with LibX specifically is that browser toolbars seem to be a technical dead-end.  It has proven pretty challenging to get a browser toolbar to keep working accross browser versions. The LibX project seems more and more moribund — it may still be developed, but it’s documentation hasn’t kept pace, it’s unclear what it can do or how to configure it, fewer browsers are supported. And especially as our users turn more and more to mobile (as the Ithaka report notes), they more and more often are using browsers in which plugins can’t be installed.

A “bookmarklet” approach might be worth considering, for targetting a wider range of browsers with less technical investment. Bookmarklets aren’t completely closed off in mobile browsers, although they are a pain in the neck for the user to add in many.

Zotero is another interesting example.  Zotero, as well as it’s competitors including Mendeley, can succesfully scrape citation details from many licensed platform pages. We’re used to thinking of Zotero as ‘bibliographic management’, but once it’s scraped those citation details, it can also send the user to the institutionally-appropriate link resolver with those citation details — which is what can get the user to the appropriate licensed copy, in an authenticated way.  Here at my place of work we don’t officially support Zotero or Mendeley, and haven’t spent much time figuring out how to get the most out of even the bibliographic management packages we do officially support.

Perhaps we should spend more time with these, not just to support ‘bibliographic management’ needs, but as a method to get users from the open web to authenticated access to an appropriate copy.  And perhaps we should do other R&D in ‘bookmarklets'; in machine learning for citation parsing so users can just paste a citation into a box (perhaps via bookmarklet) to get authenticated access to appropriate copy; in anything else we can think of to:

Get the user from the open web to licensed copies.  To be able to provide some useful help for accessing scholarly resources to our patrons, instead of just serving as a checkbook. With some library branding, so they recognize us as doing something useful after all.

Filed under: General

LITA: “Why won’t my document print?!” — Two Librarians in Training

planet code4lib - Fri, 2015-04-10 13:30

For this post, I am joined by a fellow student in Indiana University’s Information and Library Science Department, Sam Ott! Sam is a first year student, also working toward a dual-degree Master of Library Science and Master of Information Science, who has over three years of experience working in paraprofessional positions in multiple public libraries. Sam and I are taking the same core classes, but he is focusing his studies on public libraries instead of my own focus on academic and research libraries. With these distinct end goals in mind, we wanted to write about how the technologies we are learning in library school are helping cultivate our skills in preparation for future jobs.



On the academic library track, much of the technology training seems to be abstract and theory based, paired with practical training. There is a push for students to learn digital encoding practices, such as TEI/XML, and to understand how these concepts function within a digital library/archive. Website architecture and development also appear as core classes and electives as ways to complement theoretical classes.

Specializations offer a chance to delve deeper into the theory and practice of one of these aspects, for example, Digital Libraries, Information Architecture, and Data Science. The student chapter of the Association for Information Science and Technology (ASIS&T) offers workshops through UITS, in addition to the courses offered, to introduce and hone UNIX, XML/XSLT, and web portfolio development skills.


ALA Midwinter Meeting, 2015.

On the public library track, the technology training is limited to two core courses (Representation and Organization, plus one chosen technology requirement) and electives. While most of the coursework for public libraries is geared toward learning how to serve multiple demographics, studying Information Architecture can allow for greater exposure to relevant technologies. However, the student’s schedule is filled by the former, with less time for technological courses.

One reason I chose to pursue the Master of Information Science, was to bridge what I saw as a gap in technology preparation for public library careers. The MIS has been extremely helpful in allowing me to learn best practices for system design and how people interact with websites and computers. However, these classes are still geared toward the skills needed for an academic librarian or industry employee, and lack the everyday technology skills a public librarian may need, especially if there isn’t an IT department available.


We’ve considered a few options of courses and workshops which could provide a hands-on approach to daily technology use in any library. Since many academic librarians focused in digital tools still staff the reference desk and interact with patrons, this information is vital for library students moving on to jobs. We imagine a course or workshop series that introduces students to common issues staff and patrons face with library technologies. The topics of this course could include: learning how to reboot and defragment computers, hook up and use various audio visual technologies such as projectors, and troubleshooting the dreaded printer problems.

The troubleshooting method we want to avoid. Image courtesy of

As public and academic libraries embrace the evolving digital trends, staff will need to understand how to use and troubleshoot ranges of platforms, makerspaces, and digital creativity centers. Where better to learn these skills than in school!

But we aren’t quite finished. An additional aspect to the course or workshop would be allowing the students to shadow, observe, and learn from University Information Technology Services as they troubleshoot common problems across all platforms. This practical experience both observing and learning how to fix frequent and repeated issues would give students a well-rounded experiential foundation while in library school.

If you are a LITA blog reader working in a public library, which skills would you recommend students learn before taking the job? What kinds of technology-related questions are frequently asked at your institution?

Open Knowledge Foundation: OpenCon 2015 is launched

planet code4lib - Fri, 2015-04-10 11:47

This blog post is cross-posted from the Open Access Working Group blog.

Details of OpenCon 2015 have just been announced!

OpenCon2015: Empowering the Next Generation to Advance Open Access, Open Education and Open Data will take place in on November 14-16 in Brussels, Belgium and bring together students and early career academic professionals from across the world to learn about the issues, develop critical skills, and return home ready to catalyze action toward a more open system for sharing the world’s information — from scholarly and scientific research, to educational materials, to digital data.

Hosted by the Right to Research Coalition and SPARC, OpenCon 2015 builds on the success of the first-ever OpenCon meeting last year which convened 115 students and early career academic professionals from 39 countries in Washington, DC. More than 80% of these participants received full travel scholarships, provided by sponsorships from leading organizations, including the Max Planck Society, eLife, PLOS, and more than 20 universities.

“OpenCon 2015 will expand on a proven formula of bringing together the brightest young leaders across the Open Access, Open Education, and Open Data movements and connecting them with established leaders in each community,” said Nick Shockey, founding Director of the Right to Research Coalition. “OpenCon is equal parts conference and community. The meeting in Brussels will serve as the centerpiece of a much larger network to foster initiatives and collaboration among the next generation across OpenCon’s three issue areas.“

OpenCon 2015’s three day program will begin with two days of conference-style keynotes, panels, and interactive workshops, drawing both on the expertise of leaders in the Open Access, Open Education and Open Data movements and the experience of participants who have already led successful projects.

The third day will take advantage of the location in Brussels by providing a half-day of advocacy training followed by the opportunity for in-person meetings with relevant policy makers, ranging from the European Parliament, European Commission, embassies, and key NGOs. Participants will leave with a deeper understanding of the conference’s three issue areas, stronger skills in organizing local and national projects, and connections with policymakers and prominent leaders across the three issue areas.

Speakers at OpenCon 2014 included the Deputy Assistant to the President of the United States for Legislative Affairs, the Chief Commons Officer of Sage Bionetworks, the Associate Director for Data Science for the U.S. National Institutes of Health, and more than 15 students and early career academic professionals leading successful initiatives. OpenCon 2015 will again feature leading experts. Patrick Brown and Michael Eisen, two of the co-founders of PLOS, are confirmed for a joint keynote at the 2015 meeting.

“For the ‘open’ movements to succeed, we must invest in capacity building for the next generation of librarians, researchers, scholars, and educators,said Heather Joseph, Executive Director of SPARC (The Scholarly Publishing and Academic Resources Coalition). “OpenCon is dedicated to creating and empowering a global network of young leaders across these issues, and we are eager to partner with others in the community to support and catalyze these efforts.”

OpenCon seeks to convene the most effective student and early career academic professional advocates—regardless of their ability to pay for travel costs. The majority of participants will receive full travel scholarships. Because of this, attendance is by application only, though limited sponsorship opportunities are available to guarantee a fully funded place at the conference. Applications will open on June 1, 2015.

In 2014, more than 1,700 individuals from 125 countries applied to attend the inaugural OpenCon. This year, an expanded emphasis will be placed on building the community around OpenCon and on satellite events. OpenCon satellite events are independently hosted meetings that mix content from the main conference with live presenters to localize the discussion and bring the energy of an in-person OpenCon event to a larger audience. In 2014, OpenCon satellite events reached hundreds of students and early career academic professionals in nine countries across five continents. A call for partners to host satellite events has now opened and is available at

OpenCon 2015 is organized by the Right to Research Coalition, SPARC, and a committee of student and early career researcher organizations from around the world.

Applications for OpenCon 2015 will open on June 1st. For more information about the conference and to sign up for updates, visit You can follow OpenCon on Twitter at @Open_Con or using the hashtag #opencon.

Hydra Project: DPLA joins the Hydra Partners

planet code4lib - Fri, 2015-04-10 09:28

We are delighted to announce that the Digital Public Library of America (DPLA) has become the latest formal Hydra Partner.  In their Letter of Intent Mark Matienzo, DPLA’s Director of Technology, writes of their “upcoming major Hydra project, generously funded by the IMLS, and in partnership with Stanford University and Duraspace, [which] focuses on developing an improved set of tools for content management, publishing, and aggregation for the network of DPLA Hubs. This, and other projects, will allow us to make contributions to other core components of the Hydra stack, including but not limited to Blacklight, ActiveTriples, and support for protocols like IIIF and ResourceSync. We are also interested in continuing to contribute our metadata expertise to the Hydra community to ensure interoperability across our communities.”

A warm welcome into the Partners for all our friends at the DPLA!

HangingTogether: Managing Metadata for Image Collections

planet code4lib - Thu, 2015-04-09 18:03

That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Naun Chew of Cornell and Stephen Hearn of the University of Minnesota. Focus group members manage a wide variety of image collections presenting challenges for metadata management. In some cases image collections that developed outside the library and its data models need to be integrated with other collections or into new search environments. Depending on the nature of the collection and its users, questions concerning identification of works, depiction of entities, chronology, geography, provenance, genre, subjects (“of-ness” and “about-ness”) all present themselves; so do opportunities for crowdsourcing and interdisciplinary research.

Many describe their digital image resources on the collection level while selectively describing items. As much as possible, enhancements are done in batch. Some do authority work, depending on the quality of the accompanying metadata. Some libraries have disseminated metadata guidelines to help bring more consistency in the data.

Among the challenges discussed:

Variety of systems and schemas:  Image collections created in different parts of the university such as art or anthropology departments serve different purposes and use different systems and schemas than those used by the library. The metadata often comes in spreadsheets, or unstructured accompanying data. Often the metadata created by other departments requires a lot of editing. The situation is simpler when all digitization is handled through one center and the library does all of the metadata creation. Some libraries are using Dublin Core for their image collections’ metadata and others are using MODS (Metadata Object Description Schema). It was suggested that MODS be used in conjunction with MADS (Metadata Authority Description Schema).

Duplicate metadata for different objects: There are cases where the metadata is identical for a set of drawings, even though there are slight differences in those drawings. Duplicating the metadata across similar objects is likely due to limited staff. Possibly the faculty could add more details. Brown University extended authorizations to photographers to add to the metadata accompanying their photographs without any problems.

Lack of provenance: A common challenge is receiving image collections with scanty metadata and with no information regarding their provenance. For example, a researcher took OCR’ed text retrieved from HathiTrust, ending up with millions of images. However, the researcher didn’t include the metadata of where the images came from. The challenge is to support both a specific purpose and group of people as well as large scale discovery.

Maintaining links between metadata and images: How should libraries store images and keep them in sync with the metadata?  There may be rights issues from relying on a specific platform to maintain links between metadata and images. Where should thumbnails live?

Relating multiple views and versions of same object: Multiple versions of the same object taken over time can be very useful for forensics. For example, Stanford relied on dated photographs to determine when its “Gates of Hell” sculpture had been damaged. Brown University decided to describe a “blob” of. various images of the same thing in different formats and then have descriptions of the specific versions hanging off it. The systems used within the OCLC Research Library Partnership do not yet have a good way to structure and represent relationships among images, such as components of a piece.

Integrating collections from different sources: Stanford is considering ingesting images from a local art museum, many of which are images for a single object, so that scholars can study the object over time. They are wondering how to represent them in their discovery layer. MIT is trying to integrate metadata coming from different departments so that they can contribute to different aggregators, such as the DPLA.  All involved get together to make sure that there is a shared understanding. Contributing and having images live in an aggregated way present new challenges.

Yale’s largest image collection is the Kissinger papers, with about 2 million scanned images. For much of the collection, metadata is very scanty. Meetings among the collection owner, metadata specialist and systems staff try to resolve insufficient or questionable data and to come to a shared understanding. They store two copies of each image: TIFF (preservation and on request) and JPEG for everything else).

Managing relationships with faculty and curators: It’s important to ensure that faculty feel their needs are met. Collaboration is necessary among holders of the materials, metadata specialists and developers as all come from different perspectives.


Challenges of bringing together different images or versions of the same object in a large aggregation were explored by OCLC Research’s Europeana Innovation Pilots.  The pilots came up with a method for hierarchically structuring cultural objects at different similarity levels to find semantic clusters.


About Karen Smith-Yoshimura

Karen Smith-Yoshimura, program officer, works on topics related to renovating descriptive and organizing practices with a focus on large research libraries and area studies requirements.

Mail | Web | Twitter | More Posts (57)

DPLA: DPLA and Education: Findings and Recommendations from our Whiting Study

planet code4lib - Thu, 2015-04-09 17:09

During the last nine months, the Digital Public Library of America has been researching educational use with generous support from the Whiting Foundation. We’ve been learning from other online education resource providers in the cultural heritage world and beyond about what they have offered and how teachers and students have responded. We also convened focus groups of instructors in both K-12 and undergraduate higher education to hear about what they use and would like to see from DPLA. Today, we are proud to release our research findings and recommendations for DPLA’s future education work.

Preliminary feedback from educators suggested that DPLA was exciting as a single place to find content but occasionally overwhelming because of the volume of accessible material.  In our focus groups, we learned that educators are eager to incorporate primary sources into instructional activities and independent student research projects, but we can better help them by organizing source highlights topically and giving them robust context. We also discovered how important it was to educators and students to be able to create their own primary-sourced based projects with tools supported by DPLA. From other education projects, including many supported by our Hubs, we learned that sustainable education projects require teacher involvement, deep standards research, and specific outreach strategies. Based on this research, we recommend that DPLA and its teacher advocates build curated content for student use with teacher guidance, and that DPLA use its position at the center of a diverse mix of cultural heritage institutions to continue to facilitate conversations about educational use. We see this report as the beginning of a process of working with our many partners and educators to make large digital collections like DPLA more useful and used.

Péter Király: Seminar Programme: Göttingen Dialog in Digital Humanities (2015)

planet code4lib - Thu, 2015-04-09 17:02

Seminar Programme: Göttingen Dialog in Digital Humanities (2015)

The dialogs take place on Tuesdays at 17:00 during the Summer semester (from April 21th until July 14th). The venue of the seminars is to be announced, at the Göttingen Centre for Digital Humanities (GCDH). The centre's address is: Heyne-Haus, Papendiek 16, D-37073 Göttingen.

April 21
Yuri Bizzoni, Angelo Del Grosso, Marianne Reboul (University of Pisa, Italy)
Diachronic trends in Homeric translations

April 28
Stefan Jänicke, Judith Blumenstein, Michaela Rücker, Dirk Zeckzer, Gerik Scheuermann (Universität Leipzig, Germany)
Visualizing the Results of Search Queries on Ancient Text Corpora with Tag Pies

May 5
Jochen Tiepmar (Universität Leipzig, Germany)
Release of the MySQL based implementation of the CTS protocol

May 12
Patrick Jähnichen, Patrick Oesterling, Tom Liebmann, Christoph Kurras, Gerik Scheuermann, Gerhard Heyer (Universität Leipzig, Germany)
Exploratory Search Through Visual Analysis of Topic Models

May 19
Christof Schöch (Universität Würzburg, Germany)
Topic Modeling Dramatic Genre

May 26
Peter Robinson (University of Saskatchewan, Canada)
Some principles for making of collaborative scholarly editions in digital form

June 2
Jürgen Enge, Heinz Werner Kramski, Susanne Holl (HAWK Hildesheim, Germany)
»Arme Nachlassverwalter...« Herausforderungen, Erkenntnisse und Lösungsansätze bei der Aufbereitung komplexer digitaler Datensammlungen

June 9
Daniele Salvoldi (Freie Universität Berlin, Germany)
A Historical Geographic Information System (HGIS) of Nubia based on the William J. Bankes Archive (1815-1822)

June 16
Daniel Burckhardt (HU Berlin, Germany)
Comparing Disciplinary Patterns: Gender and Social Networks in the Humanities through the Lens of Scholarly Communication

June 23
Daniel Schüller, Christian Beecks, Marwan Hassani, Jennifer Hinnell, Bela Brenger, Thomas Seidl, Irene Mittelberg (RWTH Aachen University, Germany, University of Alberta, Canada)
Similarity Measuring in 3D Motion Capture Models of Co-Speech Gesture

June 30
Federico Nanni (University of Bologna, Italy)
Reconstructing a website’s lost past - Methodological issues concerning the history of

July 7
Edward Larkey (University of Maryland, USA)
Comparing Television Formats: Using Digital Tools for Cross-Cultural Analysis

July 14
Francesca Frontini, Amine Boukhaled, Jean-Gabriel Ganascia (Laboratoire d’Informatique de Paris 6, Université Pierre et Marie Curie)
Mining for characterising patterns in literature using correspondence analysis: an experiment on French novels

As announced in the Call For Papers, the dialogs will take the form of a 45 minute presentation in English, followed by 45 minutes of discussion and student participation. Due to logistic and time constraints, the 2015 dialog series will not be video-recorded or live-streamed. A summary of the talks, together with photographs and, where available, slides, will be uploaded to the GCDH/eTRAP. For this reason, presenters are encouraged, but not obligated, to prepare slides to accompany their papers. Please also consider that the €500 award for best paper will be awarded on the basis of both the quality of the paper *and* the delivery of the presentation.

Camera-ready versions of the papers must reach Gabriele Kraft at gkraft(at)gcdh(dot)de by April 30.

The papers will not be uploaded to the GCDH/eTRAP website but, as previously announced, published as a special issue of Digital Humanities Quarterly (DHQ). For this reason, papers must be submitted in an editable format (e.g. .docx or LaTeX), not as PDF files.

A small budget for travel cost reimbursements is available.

Everybody is welcome to join in.

If anyone would like to tweet about the dialogs, the Twitter hashtag of this series is #gddh15.

For any questions, do not hesitate to contact gkraft(at)gcdh(dot)de. For further information and updates, visit or

We look forward to seeing you in Göttingen!

The GDDH Board (in alphabetical order):
Camilla Di Biase-Dyson (Georg August University Göttingen)
Marco Büchler (Göttingen Centre for Digital Humanities)
Jens Dierkes (Göttingen eResearch Alliance)
Emily Franzini (Göttingen Centre for Digital Humanities)
Greta Franzini (Göttingen Centre for Digital Humanities)
Angelo Mario Del Grosso (ILC-CNR, Pisa, Italy)
Berenike Herrmann (Georg August University Göttingen)
Péter Király (Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen)
Gabriele Kraft (Göttingen Centre for Digital Humanities)
Bärbel Kröger (Göttingen Academy of Sciences and Humanities)
Maria Moritz (Göttingen Centre for Digital Humanities)
Sarah Bowen Savant (Aga Khan University, London, UK)
Oliver Schmitt (Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen)
Sree Ganesh Thotempudi (Göttingen Centre for Digital Humanities)
Jörg Wettlaufer (Göttingen Centre for Digital Humanities & Göttingen Academy of Sciences and Humanities)
Ulrike Wuttke (Göttingen Academy of Sciences and Humanities)

This event is financially supported by the German Ministry of Education and Research (No. 01UG1509).


Subscribe to code4lib aggregator