You are here

Feed aggregator

SearchHub: Introducing Lucidworks View!

planet code4lib - Tue, 2016-04-12 16:14

Lucidworks is pleased to announce the release of Lucidworks View.

View is an extensible search interface designed to work with Fusion, allowing for the deployment of an enterprise-ready search front end with minimal effort. View has been designed to harness the power of Fusion query pipelines and signals, and provides essential search capabilities including faceted navigation, typeahead suggestions, and landing page redirects.

View showing automatic faceted navigation:

View showing typeahead query pipelines, and the associated config file on the right:

View is powered by Fusion, Gulp, AngularJS, and Sass allowing for the easy deployment of a sophisticated and customized search interface. All visual elements of View can be configured easily using SCSS styling.

View is easy to customize.. quickly change styling with a few edits:

Additional features:

  • Document display templates for common Fusion data sources.
  • Included templates are web, file, Slack, Twitter, Jira and a default.
  • Landing Page redirects.
  • Integrates with Fusion authentication.

Lucidworks View 1.0 is available for immediate download at http://lucidworks.com/products/view

Read the release notes or documentation, learn more on the Lucidworks View product page, or browse the source on GitHub,

The post Introducing Lucidworks View! appeared first on Lucidworks.com.

Mark E. Phillips: DPLA Descriptive Metadata Lengths: By Provider/Hub

planet code4lib - Tue, 2016-04-12 15:30

In the last post I took a look at the length of the description fields for the Digital Public Library of America as a whole.  In this post I wanted to spend a little time looking at these numbers on a per-provider/hub basis to see if there is anything interesting in the data.

I’ll jump right in with a table that shows all 29 of the providers/hubs that are represented in the snapshot of metadata that I am working with this time.  In this table you can see the minimum record length, max length, the number of descriptions (remember values can be multi-valued so there are more descriptions than records for a provider/hub),  sum (all of the lengths added together), the mean of the length and then finally the standard deviation.

provider min max count sum mean stddev artstor 0 6,868 128,922 9,413,898 73.02 178.31 bhl 0 100 123,472 775,600 6.28 8.48 cdl 0 6,714 563,964 65,221,428 115.65 211.47 david_rumsey 0 5,269 166,313 74,401,401 447.36 861.92 digital-commonwealth 0 23,455 455,387 40,724,507 89.43 214.09 digitalnc 1 9,785 241,275 45,759,118 189.66 262.89 esdn 0 9,136 197,396 23,620,299 119.66 170.67 georgia 0 12,546 875,158 135,691,768 155.05 210.85 getty 0 2,699 264,268 80,243,547 303.64 273.36 gpo 0 1,969 690,353 33,007,265 47.81 58.20 harvard 0 2,277 23,646 2,424,583 102.54 194.02 hathitrust 0 7,276 4,080,049 174,039,559 42.66 88.03 indiana 0 4,477 73,385 6,893,350 93.93 189.30 internet_archive 0 7,685 523,530 41,713,913 79.68 174.94 kdl 0 974 144,202 390,829 2.71 24.95 mdl 0 40,598 483,086 105,858,580 219.13 345.47 missouri-hub 0 130,592 169,378 35,593,253 210.14 2325.08 mwdl 0 126,427 1,195,928 174,126,243 145.60 905.51 nara 0 2,000 700,948 1,425,165 2.03 28.13 nypl 0 2,633 1,170,357 48,750,103 41.65 161.88 scdl 0 3,362 159,681 18,422,935 115.37 164.74 smithsonian 0 6,076 2,808,334 139,062,761 49.52 137.37 the_portal_to_texas_history 0 5,066 1,271,503 132,235,329 104.00 95.95 tn 0 46,312 151,334 30,513,013 201.63 248.79 uiuc 0 4,942 63,412 3,782,743 59.65 172.44 undefined_provider 0 469 11,436 2,373 0.21 6.09 usc 0 29,861 1,076,031 60,538,490 56.26 193.20 virginia 0 268 30,174 301,042 9.98 17.91 washington 0 1,000 42,024 5,258,527 125.13 177.40

This table is very helpful to reference as we move through the post but it is rather dense.  I’m going to present a few graphs that I think illustrate some of the more interesting things in the table.

Average Description Length

The first is to just look at the average description length per provider/hub to see if there is anything interesting in there.

Average Description Length by Hub

For me I see that there are several bars that are very small on this graph, specifically for the providers bhl, kdl, nara, unidentified_provider, and virginia.  I also noticed that david_rumsey has the highest average description length of 450 characters.  Following david_rumsey is getty at 300 and then mmdl, missouri, and tn who are at about 200 characters for the average length.

One thing to keep in mind from the previous post is that the average length for the whole DPLA was 83.32 characters in length, so many of the hubs were over that and some significantly over that number.

Mean and Standard Deviation by Partner/Hub

I think it is also helpful to take a look at the standard deviation in addition to just the average,  that way you are able to get a sense of how much variability there is in the data.

Description Length Mean and Stddev by Hub

There are a few providers/hubs that I think stand out from the others by looking at the chart. First david_rumsey has a stddev just short of double its average length.  The mwdl and the missouri-hub have a very high stddev compared to their average. For this dataset, it appears that these partners have a huge range in their lengths of descriptions compared to others.

There are a few that have a relatively small stddev compared to the average length.  There are just two partners that actually have a stddev lower than the average, those being the_portal_to_texas_history and getty.

Longest Description by Partner/Hub

In the last blog post we saw that there was a description that was over 130,000 characters in length.  It turns out that there were two partner/hubs that had some seriously long descriptions.

Longest Description by Hub

Remember the chart before this one that showed the average and the stddev next to each other for the Provider/Hub,  there we said a pretty large stddev for missouri_hub and mwdl? You may see why that is with the chart above.  Both of these hubs have descriptions of over 120,000 characters.

There are six Providers/Hubs that have some seriously long descriptions,  digital-commonwealth, mdl, missouri_hub, mwdl, tn, and usc.  I could be wrong but I have a feeling that descriptions that long probably aren’t that helpful for users and are most likely the full-text of the resource making its way into the metadata record.  We should remember,  “metadata is data about data”… not the actual data.

Total Description Length of Descriptions by Provider/Hub

Total Description Length of All Descriptions by Hub

Just for fun I was curious about how the total lengths of the description fields per provider/hub would look on a graph, those really large numbers are hard to hold in your head.

It is interesting to note that hathitrust which has the most records in the DPLA doesn’t contribute the most description content. In fact the most is contributed by mwdl.  If you look into the sourcing of these records you will have an understanding of why with the majority of the records in the hathitrust set coming from MARC records which typically don’t have the same notion of “description” that records from digital libraries and formats like Dublin Core have. The provider/hub mwdl is an aggregator of digital library content and has quite a bit more description content per record.

Other providers/hubs of note are georgia, mdl, smithsonian, and the_portal_to_texas_history which all have over 100,000,000 characters in their descriptions.

Closing for this post

Are there other aspects of this data that you would like me to take a look at?  One idea I had was to try and determine on a provider/hub basis what might be a notion of “too long” for a given provider based on some methods of outlier detection,  I’ve done the work for this but don’t know enough about the mathy parts to know if it is relevant to this dataset or not.

I have about a dozen more metrics that I want to look at for these records so I’m going to have to figure out a way to move through them a bit quicker otherwise this blog might get a little tedious (more than it already is?).

If you have questions or comments about this post,  please let me know via Twitter.

Evergreen ILS: Statement on North Carolina House Bill 2

planet code4lib - Tue, 2016-04-12 15:04

Due to the recent passage of House Bill 2 in North Carolina, the Evergreen Oversight Board, on behalf of the Evergreen Project, has released the following statement. Our main concern is that our conference is a safe and inclusive space for all Evergreen Community members. While other organizations have cancelled their conferences in North Carolina over this matter, our conference was simply too imminent to move or cancel without significant harm to the Project and its members. Please feel free to contact the Evergreen Oversight Board if you have any questions about this statement.

Sincerely,
Grace Dunbar
Chair, Evergreen Oversight Board

Per the Evergreen Project’s Event Code of Conduct Evergreen event organizers are dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion. We do not tolerate harassment of event participants in any form. It is now important to reemphasize that commitment.

In particular, the Evergreen Oversight Board is disappointed that the North Carolina General Assembly has chosen to pass legislation that nullifies city ordinances that protect LGBT individuals from discrimination, including one such ordinance passed by the Raleigh City Council. Since the 2016 Evergreen Conference is to be held in Raleigh, the Oversight Board is taking the following steps to protect conference attendees:

  • We are working with the conference venue and hotels to ensure that their staff and organizations will not discriminate on the basis of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion.
  • Members of the Evergreen Safety Committee will be available at the conference to advocate for and assist conference attendees, including accompanying attendees to and from the conference venues.
  • The Evergreen Project, via its fiscal agent Software Freedom Conservancy, will refund in full registrations from individuals who feel that they can no longer safely attend the conference. We’re committed to processing refunds requested on this basis, but it will take us some time to process them. Please be patient if you request a refund.

Library of Congress: The Signal: Expanding NDSA Levels of Preservation

planet code4lib - Tue, 2016-04-12 15:01

This is a guest post by Shira Peltzman from the UCLA Library.

Shira Peltzman. Photo by Alan Barnett.

Last month Alice Prael and I gave a presentation at the annual Code4Lib conference in which I mentioned a project I’ve been working on to update the NDSA Levels of Digital Preservation so that it includes a metric for access. (You can see the full presentation on YouTube at the 1:24:00 minute mark.)

For anyone who is unfamiliar with NDSA Levels, it’s a tool that was developed in 2012 by the National Digital Stewardship Alliance as a concise and user-friendly rubric to help organizations manage and mitigate digital preservation risks. The original version of the Levels of Digital Preservation includes four columns (Levels 1-4) and five rows. The columns/levels range in complexity, from the least you can do (Level 1) to the most you can do (Level 4). Each row represents a different conceptual area: Storage and Geographic Location, File Fixity and Data Integrity, Information Security, Metadata and File Formats. The resulting matrix contains a tiered list of concrete technical steps that correspond to each of these preservation activities.

It has been on my mind for a long time to expand the NDSA Levels so that the table includes a means of measuring an organization’s progress with regard to access. I’m a firm believer in the idea that access is one of the foundational tenets of digital preservation. It follows that if we are unable to provide access to the materials we’re preserving, then we aren’t really doing such a great job of preserving those materials in the first place.

When it comes to digital preservation, I think there’s been an unfortunate tendency to give short shrift to access, to treat it as something that can always be addressed in the future. In my view, the lack of any access-related fields within the current NDSA Levels reflects this.

Of course I understand that providing access can be tricky and resource-intensive in general, but particularly so when it comes to born-digital. From my perspective, this is all the more reason why it would be useful for the NDSA Levels to include a row that helps institutions measure, build, and enhance their access initiatives.

While some organizations use NDSA Levels as a blueprint for preservation planning, other organizations — including the UCLA Library where I work — employ NDSA Levels as a means to assess compliance with preservation best practices and identify areas that need to be improved.

In fact, it was in this vein that the need originally arose for a row in NDSA Levels explicitly addressing access. After suggesting that we use NDSA Levels as a framework for our digital preservation gap analysis, it quickly became apparent to me that its failure to address Access would be a blind spot too great to ignore.

Providing access to the material in our care is so central to UCLA Library’s mission and values that failing to assess our progress/shortcomings in this area was not an option for us. To address this, I added an Access row to the NDSA Levels designed to help us measure and enhance our progress in this area.

My aims in crafting the Access row were twofold: First, I wanted to acknowledge the OAIS reference model by explicitly addressing the creation of Dissemination Information Packages (which in turn necessitated mentioning other access-related terms like Designated Community, Representation Information and Preservation Description Information). This resulted in the column feeling rather jargon-heavy, so eventually I’d like to adjust this so that it better matches the tone and language of the other columns.

Second, I tried to remain consistent with the model already in place. That meant designing the steps for each column/level so that they are both content agnostic and system agnostic and can be applied to various collections or systems. For the sake of consistency I also tried to maintain the sub-headings for each column/level, (i.e., “protect your data,” “know your data,” “monitor your data,” and “repair your data”) even though some have questioned their usefulness in the past; for more on this, see the comments at the bottom of Trevor Owens blog post.

While I’m happy with the end result overall, these categories map better in some instances than in others. I welcome feedback from you and the digital preservation community at large about how they could be improved. I have deliberately set the permissions to allow anyone to view/edit the document, since I’d like for this to be something to which the preservation community at large can contribute.

Fortunately, NDSA Levels was designed to be iterative. In fact, in a paper titled “The NDSA Levels of Digital Preservation: An Explanation and Uses,” published shortly after NDSA Levels’ debut, its authors solicited feedback from the community and acknowledged future plans to revise the chart. Tools like this ultimately succeed because practitioners push for them to be modified and refined so that they can better serve the community’s needs. I hope that enough consensus builds around some of the updates I proposed for them to eventually become officially incorporated into the next iteration of the NDSA Levels if and when it is released.

My suggested updates are in the last row of the Levels of Preservation table below, labeled Access. If you have any questions please contact me: Shira Peltzman, Digital Archivist, UCLA Library,speltzman@library.ucla.edu | (310) 825-4784.

LEVELS OF PRESERVATION

Level One
(Protect Your Data) Level Two
(Know Your data) Level Three
(Monitor Your Data) Level Four
(Repair Your Data) Storage and Geographic Location Two complete copies that are not collocated

For data on heterogeneous media (optical disks, hard drives, etc.) get the content off the medium and into your storage system

At least three complete copies

At least one copy in a different geographic location/

Document your storage system(s) and storage media and what you need to use them

At least one copy in a geographic location with a different disaster threat

Obsolescence monitoring process for your storage system(s) and media

At least 3 copies in geographic locations with different disaster threats

Have a comprehensive plan in place that will keep files and metadata on currently accessible media or systems

File Fixity and Data Integrity Check file fixity on ingest if it has been provided with the content

Create fixity info if it wasn’t provided with the content

Check fixity on all ingestsUse write-blockers when working with original media

Virus-check high risk content

Check fixity of content at fixed intervals

Maintain logs of fixity info; supply audit on demand

Ability to detect corrupt data

Virus-check all content

Check fixity of all content in response to specific events or activities

Ability to replace/repair corrupted data

Ensure no one person has write access to all copies

Information Security Identify who has read, write, move, and delete authorization to individual files

Restrict who has those authorizations to individual files

Document access restrictions for content Maintain logs of who performed what actions on files, including deletions and preservation actions Perform audit of logs Metadata Inventory of content and its storage location

Ensure backup and non-collocation of inventory

Store administrative metadata

Store transformative metadata and log events

Store standard technical and descriptive metadata Store standard preservation metadata File Formats When you can give input into the creation of digital files encourage use of a limited set of known open file formats and codecs Inventory of file formats in use Monitor file format obsolescence issues Perform format migrations, emulation and similar activities as needed Access Determine designated community1

Ability to ensure the security of the material while it is being accessed. This may include physical security measures (e.g. someone staffing a reading room) and/or electronic measures (e.g. a locked-down viewing station, restrictions on downloading material, restricting access by IP address, etc.)

Ability to identify and redact personally identifiable information (PII) and other sensitive material

Have publicly available catalogs, finding aids, inventories, or collection descriptions available to so that researchers can discover material

Create Submission Information Packages (SIPs) and Archival Information Packages (AIPs) upon ingest2

Ability to generate Dissemination Information Packages (DIPs) on ingest3

Store Representation Information and Preservation Description Information4

Have a publicly available access policy

Ability to provide access to obsolete media via its native environment and/or emulation

1 Designated Community essentially means “users”; the term that comes from the Reference Model for an Open Archival Information System (OAIS).
2 The Submission Information Package (SIP) is the content and metadata received from an information producer by a preservation repository. An Archival Information Package (AIP) is the set of content and metadata managed by a preservation repository, and organized in a way that allows the repository to perform preservation services.
3 Dissemination Information Package (DIP) is distributed to a consumer by the repository in response to a request, and may contain content spanning multiple AIPs.
4 Representation Information refers to any software, algorithms, standards, or other information that is necessary to properly access an archived digital file. Or, as the Preservation Metadata and the OAIS Information Model put it, “A digital object consists of a stream of bits; Representation Information imparts meaning to these bits.” Preservation Description Information refers to the information necessary for adequate preservation of a digital object. For example, Provenance, Reference, Fixity, Context, and Access Rights Information.

Access Conference: Peer Reviewers Wanted!

planet code4lib - Tue, 2016-04-12 13:22

Interested in helping out with Access 2016? Looking to gain some valuable professional experience? The Access 2016 program committee is looking for volunteers to serve as peer-reviewers!

If you’re interested, send us an email at accesslibcon@gmail.com by Friday, April 22, attaching a copy of your current CV and answers to the following:

  • Name
  • Current Position (student reviewers are also welcome!)
  • Institution
  • Have you attended Access before?
  • Have you presented at Access before?
  • Have you been a peer reviewer for Access before?

Questions or comments? Drop us a line at accesslibcon@gmail.com.

DuraSpace News: Registration for the VIVO 2016 Conference is Now Open!

planet code4lib - Tue, 2016-04-12 00:00

From the VIVO 2016 Conference organizers

Join us in beautiful Denver, Colorado, August 17 to 19 for the VIVO 2016 Conference. To reserve your hotel room at the VIVO conference discount, book now before rooms sell out. >

David Rosenthal: Brewster Kahle's "Distributed Web" proposal

planet code4lib - Mon, 2016-04-11 20:21
Back in August last year Brewster Kahle posted Locking the Web Open: A Call for a Distributed Web. It consisted of an analysis of the problems of the current Web, a set of requirements for a future Web that wouldn't have those problems, and a list of pieces of current technology that he suggested could be assembled into a working if simplified implementation of those requirements layered on top of the current Web. I meant to blog about it at the time, but I was busy finishing my report on emulation.

Last November, Brewster gave the EE380 lecture on this topic (video from YouTube or Stanford), reminding me that I needed to write about it. I still didn't find time to write a post. On 8th June, Brewster, Vint Cerf and Cory Doctorow are to keynote a Decentralized Web Summit. I encourage you to attend. Unfortunately, I won't be able to, and this has finally forced me to write up my take on this proposal. Follow me below the fold for a brief discussion; I hope to write a more detailed post soon.

I should start by saying that I agree with Brewster's analysis of the problems of the current Web, and his requirements for a better one. I even agree that the better Web has to be distributed, and that developing it by building prototypes layered on the current Web is the way to go in the near term. I'll start by summarizing Brewster's set of requirements and his proposed implementation, then point out some areas where I have concerns.

Brewster's requirements are:
  • Peer-to-Peer Architecture to avoid the single points of failure and control inherent in the endpoint-based naming of the current Web.
  • Privacy to disrupt the current Wed's business model of pervasive, fine-grained surveillance.
  • Distributed Authentication for Identity to avoid the centralized control over identity provided by Facebook and Google.
  • Versioning to provide the memory the current Web lacks.
  • Easy payment mechanism to provide an alternate way to reward content generators.
There are already a number of attempts at partial implementations of these requirements, based as Brewster suggests on JavaScript, public-key cryptography, blockchain, Bitcoin, and Bittorrent. An example is IPFS (also here). Pulling these together into a coherent and ideally interoperable framework would be an important outcome of the upcoming summit.

Thinking of these as prototypes, exploring the space of possible features, they are clearly useful. But we have known the risks of allowing what should be prototypes to become "interim" solutions since at least the early 80s. The Alto "Interim" File Server (IFS) was designed and implemented by David R. Boggs and Ed Taft in the late 70s. In 1977 Ed wrote:
The interim nature of the IFS should be emphasized. The IFS is not itself an object of research, though it may be used to support other research efforts such as the Distributed Message System. We hope that Juniper will eventually reach the point at which it can replace IFS as our principal shared file system.Because IFS worked well enough for people at PARC to get the stuff they needed done, the motivation to replace it with Juniper was never strong enough. The interim solution became permanent. Jim Morris, who was at PARC at the time, and who ran the Andrew Project at C-MU on which I worked from 1983-85, used IFS as the canonical example of a "success disaster", something whose rapid early success entrenches it in ways that cause cascading problems later.

And in this case the permanent solution is at least as well developed as the proposed "interim" one. For at least the last decade, rather than build a “Distributed Web”, Van Jacobson and many others have been working to build a “Distributed Internet”. The Content-Centric Networking project at Xerox PARC, which has become the Named Data Networking (NDN) project spearheaded by UCLA, is one of the NSF’s four projects under the Future Internet Architecture Program. Here is a list of 68 peer-reviewed papers published in the last 7 years relating to NDN.

By basing the future Internet on the name of a data object rather than the location of the object, many of the objectives of the “Distributed Web” become properties of the network infrastructure rather than something implemented in some people’s browsers.

Another way of looking at this is that the current Internet is about moving data from one place to another, NDN is about copying data. By making the basic operation in the net a copy, caching works properly (unlike in the current Internet). This alone is a huge deal, and not just for the Web. The Internet is more than just the Web, and the reasons for wanting to be properly “Distributed” apply just as much to the non-Web parts. And Web archiving should be, but currently isn't, about persistently caching selected parts of the Web.

I should stress that I believe that implementing these concepts initially on top of IP, and even on top of HTTP, is a great and necessary idea; it is how NDN is being tested. But doing so with the vision that eventually IP will be implemented on top of a properly “Distributed” infrastructure is also a great idea; IP can be implemented on top of NDN. For a detailed discussion of these ideas see my (long) 2013 blog post reviewing the 2012 book Trillions.

There are other risks in implementing Brewster's requirements using JavaScript, TCP/IP, the blockchain and the current Web:
  • JavaScript poses a fundamental risk, as we see from Douglas Crockford's attempt to define a "safe" subset of the language. It isn't clear that it is possible to satisfy Brewster's requirements in a safe subset of JavaScript, even if one existed. Allowing content from the Web to execute in your browser is a double-edged sword; it enables easy implementation of new capabilities, but if they are useful they are likely to pose a risk of being subverted.
  • Implementing anonymity on top of a communication infrastructure that explicitly connects endpoints turns out to be very hard. Both Tor and Bitcoin users have been successfully de-anonymized.
  • I have written extensively about the economic and organizational issues that plague Bitcoin, and will affect other totally distributed systems, such as the one Brewster wants to build. It is notable that Postel's Law (RFC 793) or the Robustness Principle has largely prevented these problems affecting the communication infrastructure level that NDN addresses.
So there are very good reasons why this way of implementing Brewster's requirements should be regarded as creating valuable prototypes, but we should be wary of the Interim File System effect. The Web we have is a huge success disaster. Whatever replaces it will be at least as big a success disaster. Lets not have the causes of the disaster be things we knew about all along.

Mark E. Phillips: DPLA Description Field Analysis: Yes there really are 44 “page” long description fields.

planet code4lib - Mon, 2016-04-11 16:00

In my previous post I mentioned that I was starting to take a look at the descriptive metadata fields in the metadata collected and hosted by the Digital Public Library of America.  That last post focused on records, how many records had description fields present, and how many were missing.  I also broke those numbers into the Provider/Hub groupings present in the DPLA dataset to see if there were any patterns.

Moving on the next thing I wanted to start looking at was data related to each instance of the description field.  I parsed each of the description fields, calculated a variety of statistics using that description field and then loaded that into my current data analysis tool, Solr which acts as my data store and my full-text index.

After about seven hours of processing I ended up with 17,884,946 description fields from the 11,654,800 records in the dataset.  You will notice that we have more descriptions than we do records, this is because a record can have more than one instance of a description field.

Lets take a look at a few of the high-level metrics.

Cardinality

I first wanted to find out the cardinality of the lengths of the description fields.  When I indexed each of the descriptions,  I counted the number of characters in the description and saved that as an integer in a field called desc_length_i in the Solr index.  Once it was indexed, it was easy to retrieve the number of unique values for length that were present.  There are 5,287 unique description lengths in the 17,884,946 descriptions that were are analyzing.  This isn’t too surprising or meaningful by itself, just a bit of description of the dataset.

I tried to make a few graphs to show the lengths and how many descriptions had what length.  Here is what I came up with.

Length of Descriptions in dataset

You can see a blue line barely,  the problem is that the zero length records are over 4 million and the longer records are just single instances.

Here is a second try using a log scale for the x axis

Length of Descriptions in dataset (x axis log)

This reads a little better I think, you can see that there is a dive down from zero lengths and then at about 10 characters long there is a spike up again.

One more graph to see what we can see,  this time a log-log plot of the data.

Length of Descriptions in dataset (log-log)

Average Description Lengths

Now that we are finished with the cardinality of the lengths,  next up is to figure out what the average description length is for the entire dataset.  This time the Solr StatsComponent is used and makes getting these statistics a breeze.  Here is a small table showing the output from Solr.

min max count missing sum sumOfSquares mean stddev 0 130,592 17,884,946 0 1,490,191,622 2,621,904,732,670 83.32 373.71

Here we see that the minimum length for a description is zero characters (a record without a description present has a length of zero for that field in this model).  The longest record in the dataset is 130,592 characters long.  The total number of characters present in the dataset was nearly one and a half billion characters.  Finally the number that we were after is the average length of a description, this turns out to be 83.32 characters long.

For those that might be curious what 84 characters (I rounded up instead of down) of description looks like,  here is an example.

Aerial photograph of area near Los Angeles Memorial Coliseum, Los Angeles, CA, 1963.

So not a horrible looking length for a description.  It feels like it is just about one sentence long with 13 “words” in this sentence.

Long descriptions

Jumping back a bit to look at the length of the longest description field,  that description is 130,592 characters long.  If you assume that the average single spaced page is 3,000 characters long, this description field is 43.5 pages long.  The reader of this post that has spent time with aggregated metadata will probably say “looks like someone put the full-text of the item into the record”.  If you’ve spent some serious (or maybe not that serious) time in the metadata mines (trenches?) you would probably mumble somethings like “ContentDM grumble grumble” and you would be right on both accounts.  Here is the record on the DPLA site with the 130,492 character long description – http://dp.la/item/40a4f5069e6bf02c3faa5a445656ea61

The next thing I was curious about was the number of descriptions that were “long”.  To answer this I am going to require a little bit of back of the envelope freedom right now to decide what “long” is for a description field in a metadata record.  (In future blog posts I might be able to answer this with different analysis on the data but this hopefully will do for today.)  For now I’m going to arbitrarily decide that anything over 325 characters in length is going to be considered “too long”.

Descriptions: Too Long and Not Too Long

Looking at that pie chart,  there are 5.8% of the descriptions that are “too long” based on my ad-hoc metric from above.  This 5.8% of the records make up 708,050,671 or  48% of the 1,490,191,622 characters in the entire dataset.  I bet if you looked a little harder you would find that the description field gets very close to the 80/20 rule with 20% of the descriptions accounting for 80% of the overall description length.

Short descriptions

Now that we’ve worked with long descriptions, the next thing we should look at are the number of descriptions that are “short” in length.

There are 4,113,841 records that don’t have a description in the DPLA dataset.  This means that for this analysis 4,113,841(23%) of the descriptions have a length of 0.  There are 2,041,527 (11%) descriptions that have a length between 1 and 10 characters in length. Below is the breakdown of these ten counts,  you can see that there is a surprising number (777,887) of descriptions that have a single character as their descriptive contribution to the dataset.

Descriptions 10 characters or less

There is also an interesting spike at ten characters in length where suddenly we jump to over 500,000 descriptions in the DPLA.

So what?

Now that we have the average length of a description in the DPLA dataset,  the number of records that we consider “long” and the number of records that we consider “short”.  I think the very next question that gets asked is “so what?”

I think there are four big reasons that I’m working on this kind of project with the DPLA data.

One is that the DPLA is the largest aggregation of descriptive metadata in the US for digital resources in cultrual heritage institutions. This is important because you get to take a look at a wide variety of data input rules, practices, and conversions from local systems to an aggregated metadata system.

Secondly this data is licensed with a CC0 license and in a bulk data format so it is easy to grab the data and start working with it.

Thirdly there haven’t been that many studies on descriptive metadata like this that I’m aware of. OCLC will publish analysis on their MARC catalog data from time to time, and the research that was happening at UIUC in the GSILS with IMLS funded metadata isn’t going on anymore (great work to look at by the way)  so there really aren’t that many discussions about using large scale aggregations of metadata to understand the practices in place in cultural heritage institutions across the US.  I am pretty sure that there is work being carried out across the Atlantic with the Eureopana datasets that are available.

Finally I think that this work can lead to metadata quality assurance practices and indicators for metadata creators and aggregators about what may be wrong with their metadata (a message saying “your description is over a page long, what’s up with that?”).

I don’t think there are many answers so far in this work but I feel that they are moving us in the direction of a better understanding of our descriptive metadata world in the context of these large aggregations of metadata.

If you have questions or comments about this post,  please let me know via Twitter.

District Dispatch: Involve young people in library advocacy

planet code4lib - Mon, 2016-04-11 14:00

Guest post by Katie Bowers, Campaigns Director for the Harry Potter Alliance.

The Harry Potter Alliance (HPA) is an organization that uses the power of story to turn fans into heroes. Each spring, the HPA hosts an annual worldwide literacy campaign known as “Accio Books“. Started in 2009, Accio Books began as a book drive where HPA members donated over 30,000 books to the Agahozo Shalom Youth Village in Rwanda. Today, we’ve donated over 250,000 books to communities in need around the world.

The HPA runs Accio Books every year because we believe that the power of story should be accessible to everyone. That’s why we’re so excited to be using our unique model to help young people advocate for libraries on National and Virtual Library Legislative Day (VLLD) this year! We’ve invited our members to call, write, and even visit their lawmakers in support of the ALA’s asks. Our members are excited: we have already received 475 pledges to send owls (letters and calls) to Washington on May 3.

Want to get young people at your library excited for VLLD? The HPA has created a guide for library staff on using pop culture to help young people send their own owls to Washington. The first step? Start an HPA chapter! Chapters are fun, youth-led, and entirely free to start. As a chapter, library staff and your organizers get access to more free resources and lots of support from HPA staff and volunteers.

Once you have your chapter, talk with members about why they love libraries, and how they feel about issues like accessibility, copyright, and technology. Brainstorming parallels from beloved stories can be a great way to have this conversation. For example, adequate funding is a major concern for libraries. Imagine – if the Hogwarts’ library had its funding cut then Hermione, Harry and Ron might never have learned about the Sorcerer’s Stone! Use storytelling to help make VLLD understandable and relatable, and then ask your chapter what they want to do support libraries.

HPA members have come up with all sorts of creative ideas, from hosting letter writing socials and call centers to creating Youtube videos and Tumblr photosets celebrating libraries. You might be asking yourself, “Why would young people care?” Through HPA campaigns, young people made 3,000 phone calls for marriage equality, donated 250,000 books to libraries and literacy organizations worldwide, and organized over 20,000 Youtube video creators and fans to advocate for net neutrality. The truth is, young people want to make a difference, and advocacy can give them that chance. Virtual Library Legislative Day makes Washington advocacy available to anyone, but you and the HPA can bring the inspiration and excitement!

You can check out the full VLLD resource on the HPA’s website, along with ideas for bringing Accio Books to your library and celebrating World Book Night! You can also learn how to help our chapter in Masaka, Uganda stock the shelves of a brand new school library. 

Editor’s note: If you plan to participate in VLLD this year, let us know!

The post Involve young people in library advocacy appeared first on District Dispatch.

District Dispatch: 3D printing companies chime in on online copyright issues

planet code4lib - Fri, 2016-04-08 23:20

Image from Pixabay

The Digital Millennium Copyright Act (DMCA): even if you don’t know exactly what it is, if you work in Library Land, chances are you’ve at least heard it talked about in passing at some point in your career. There’s a reason for that: Libraries are all about public access to content, and the DMCA is a federal statute that includes a provision that offers recourse to those facing challenges to the dissemination of content online by sometimes-over-zealous rights holders.

Section 512 of the DMCA gives online service providers (OSPs) – entities that offer services online, including libraries that provide public access computers – safe harbor against liability for the copyright-infringing actions of their users. This safe harbor comes with a caveat. To enjoy it, OSPs must comply with a notice and take down process in dealing with content that rights holders identify as infringing. The process includes the opportunity for users (many of which are utilizing OSPs’ services to post content they themselves created) to push back against rights holders’ infringement claims.

The process is a win-win, in that it looks out for the legal and economic interests of OSPs and OSP users alike. Or maybe it’s a win-win-win…I think it’s fair to say the process makes a victor of rights holders, too, seeing as it reduces the likelihood they’ll have to expend resources fighting alleged infringers of their copyrights in court.

You may be wondering…Why bring this up now?

Current trends in online infringement disputes reveal that, as useful as the DMCA’s notice-and-take down process is, it at least needs to be re-examined, and at most needs to be re-imagined, to function properly in the modern marketplace. Nowhere are these trends clearer than in the online market for 3D designs and 3D-printed products. So, when the U.S. Copyright Office recently solicited comments assessing how well copyright functions in the online environment, 3D printing firms that host user-generated content – Shapeways, MakerBot and MakerBot’s parent company, Stratasys – joined several other online merchants in offering their opinions on the efficacy of the DMCA’s notice and take down process as currently constituted (Shapeways is the world’s largest online 3D printing marketplace, and  MakerBot is a prominent consumer-grade 3D printer manufacturer that also offers a marketplace for sharing 3D designs).

The crux of these companies’ comments is that the notice and take down process is being widely used in a way that strips OSPs like Shapeways and MakerBot of the infringement immunity it was designed to provide them, and in turn, prevents posters of content that lives on these sorts of platforms from standing up to rights holders.

What are the specifics of this critique?

Rights holders are widely issuing take down notices to OSPs that describe trademark grievances, or some combination of trademark and copyright grievances. This is especially true among OSPs in the 3D printing marketplace. Shapeways reports that 76% of the copyright take down notices it receives are paired with trademark complaints.

The problem with this is that copyright and trademark are two discrete systems of protection. Broadly speaking, the former protects works of creative expression, while the latter protects words, phrases, symbols, and more that help consumers identify a product’s origin and distinguish it from those of other sellers. The section 512 notice and take down process applies exclusively to copyright. It has nothing to do with trademark. Why would it? It’s pursuant to the Digital Millennium Copyright Act, after all – not the Digital Millennium Trademark Act (no such Act exists). So, anytime Shapeways or MakerBot receives a take down notice that even touches on trademark, they know that they are not immune from liability related to the content described in the notice.

As a result, they are all but compelled to comply with such a notice for fear of liability – and because 512 doesn’t address trademark infringement, the alleged infringer has no recourse to challenge to the take down.  In essence, take down turns into “stay down.” This scenario is at best annoying, and at worst devastating to the alleged infringer. He or she may just post content for fun, but also may make a living selling content online.

Stratasys, MakerBot and similar firms also note that even if they were willing to take the massive risk of re-posting content that was the subject of a notice mentioning trademark, they’d then risk receiving a repeat notice, which would open them up to legal ramifications for not implementing a reasonable repeat infringer policy.

As you can probably tell, this is all a bit messy. But, the upshot is that it would behoove legislators, regulators and the courts to study the DMCA’s 512 notice-and-take down provision with a view to creating policy that establishes similar checks and balances between OSPs, OSP users and rights holders under trademark law. Without such policy, OSPs will yield fewer economic opportunities for online merchants, and less overall opportunity for the dissemination of content online.

The post 3D printing companies chime in on online copyright issues appeared first on District Dispatch.

Nicole Engard: Bookmarks for April 8, 2016

planet code4lib - Fri, 2016-04-08 20:30

Today I found the following resources and bookmarked them on Delicious.

Digest powered by RSS Digest

The post Bookmarks for April 8, 2016 appeared first on What I Learned Today....

Related posts:

  1. Digital Cameras – I’m up for Suggestions
  2. First Big Present
  3. Google Homepage Themes

SearchHub: Apache Solr 6 Is Released! Here’s What’s New:

planet code4lib - Fri, 2016-04-08 20:11

Happy Friday – Apache Solr 6 just released! 

From the official announcement:

“Solr 6.0.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

“See the CHANGES.txt

“Solr 6.0 Release Highlights:

  • Improved defaults for “Similarity” used in Solr, in order to provide better default experience for new users.

  • Improved “Similarity” defaults for users upgrading: DefaultSimilarityFactory has been removed, implicit default Similarity has been changed to SchemaSimilarityFactory, and SchemaSimilarityFactory has been modified to use BM25Similarity as the default for field types that do not explicitly declare a Similarity.

  • Deprecated GET methods for schema are now accessible through the bulk API. The output has less details and is not backward compatible.

  • Users should set useDocValuesAsStored=”false” to preserve sort order on multi-valued fields that have both stored=”true” and docValues=”true”.

  • Read the full list of highlights

Want a walk-through of what’s new? Here’s Cassandra Target’s webinar:

The post Apache Solr 6 Is Released! Here’s What’s New: appeared first on Lucidworks.com.

Patrick Hochstenbach: Brush Inking Exercise

planet code4lib - Fri, 2016-04-08 19:08
Filed under: Comics, Doodles, portaits Tagged: art, brush, comic, illustration, ink, new mexico, portrait, shiprock, sktchy

Evergreen ILS: Tell Us About Your Evergreen Library / Consortium

planet code4lib - Fri, 2016-04-08 17:18

The Evergreen Outreach Committee is asking anyone who can to respond to the first annual community survey. If you are a member of a consortium we only need one response for the entire consortium.

The information is for an annual report that we will present at the conference each year. Additionally, the information will help us have a better image of the community as a whole. We will also use it to create materials aimed at creating an accurate picture of our community.

We have kept the survey short and simple and should only take a few moments. Thank you in advance.

The survey can be accessed here: https://www.surveymonkey.com/r/5MZWBYY.

OCLC Dev Network: Working with Graphs Without a SPARQL Endpoint

planet code4lib - Fri, 2016-04-08 14:00

Learn about working with Graphs without a SPARQL endpoint.

LITA: Top Strategies to Win that Technology Grant: Part 1

planet code4lib - Fri, 2016-04-08 12:00

Do you remember the time when you needed to write your first research paper in MLA or APA format?  The long list of guidelines, including properly formed in-text citations and a References or Works Cited page, seemed like learning a new language.  The same holds true when approaching an RFP (Request for Proposal) and writing a grant proposal.  Unfortunately with grants, most of us are in the dark without guidance.  I am here to say, don’t give up.

Get Familiar with the Grant Writing Process and Terms
Take free online courses, such as the ones offered by the National Network of Libraries of Medicine Grants and Proposal Writing course (Note: you do not have to be a medical librarian to take advantage of this free course) or WebJunction’s archived webinar – Winning Library Grants presented by Stephanie Gerding. Read a few books from the American Library Association (ALA).  Browse the list below.  This is a sure way to begin to demystify the topic.

Change the Free Money, Shopping Spree Thinking
I have failed at grant writing many times because I started writing a list of “toys” I wanted.  I would begin browsing stores online and pictured awesome technology I wanted.  Surely my patrons would enjoy them too.  I never thought, will my patrons need this technology?  Will they use it?  As MacKellar & Gerding state in their books, funders want to help people.  Learning about the community you serve is step one before you start your shopping list or even writing your grant proposal.

Write Your Proposal in Non-Expert, Jargon-Free, Lay Language
Some professionals may have the tendency, as they excitedly share their project, to go into tech vocabulary.  This is a sure way to lose some of the grant planning or awarding committee members who may not be familiar with tech terms or a particular area of technology.  Be mindful of the words you use to explain your technology needs.  The main goal of a proposal is to make all parties feel included and a part of the game plan.

Start Small and Form Partnerships
To remove the daunting feeling you may have of writing a proposal, find community partners or colleagues that can assist in making the process enjoyable.  For example, a library can participate in grant proposals spawned by others. What better way to represent our profession than to become the researcher for a grant group.  Research is our secret weapon.  The master researcher for the grant may add some items that help fund library equipment, staff, or materials in support of the project request.  It may not be a grant proposal from the library, but a component may help the library in support of that initiative.  Another idea is to divide the grant proposal process into sections or phases among staff members.  As you know, each of us have strengths that fit into a phase of a grant proposal.  Tap into those strengths and divide the work needed to get that funding.

Create SMART Outcomes and Objectives
Ensure that outcomes and objectives are SMART, which stands for Specific, Measurable, Achievable, Realistic, and Time-Bound.  How will you know if the project is succeeding or has been a success?  Also, it is helpful to see how your technology grant request correlates with your library’s and/or institution’s technology plan.

Grants are a great way to receive recognition from peers, administration, and the community you serve.  For those in academia, this is a wonderful way to grow as a professional, add to your curriculum vitae and collect evidence towards a future promotion.  It can even become enjoyable.  Once you mastered writing MLA or APA papers, didn’t you want to write more papers?  Come to think of it, forget about my research paper and grant writing analogy.

Find future posts on technology grant writing tips on our LITA blog.

Open Knowledge Foundation: International open data day report from Yaounde Cameroon

planet code4lib - Fri, 2016-04-08 11:36

The Open Data Day 2016  was successfully hosted and celebrated in Cameroon by the netsquared Yaoundé community.  The theme of the day was ‘Empowering Cameroonians to accelerate open data’, bringing together 90 participants.

The event was hosted in Paraclete Institute in Yaoundé, which brought together multiple stakeholders and students, to empower them in advancing open data in this part of the world.

The event started at 3pm with a theoretical session and ended with a practical workshop at 7pm.

The theoretical session was hosted to shared with participants the basic concept of open data, its importance, and how it could be accelerated. This was demonstrated through a powerpoint presentation from panel members who shared examples of the impact of open data on government intermediaries, education and agriculture in strengthening citizen engagement. And the importance of the release of data sets.

This event help to encourage participants to use open data for local content development in Cameroon,  showing how data could be made available for everyone to use, especially government data.

The key concept was resourcing technologies that could be used for smart visualization of data and how data could be made available on a database for everyone to use to encourage innovative collaboration. We also discovered that most data has not been made accessible in Cameroon.f In order to encourage innovation, transparency, and collaboration we need to advance the open data movement in Cameroon,

The practical workshop empowered participants to blog about data andto share it for reuseIt can be distributed on a platform like internet database website using blogg.com and other blogging sites like simplesite.com.

We also made them to understand that research data must be made available for people to reuse and distributed for everyone to visualize it. We also empower them on how they can  made their data  available  socially, teaching participants that they can share data from blogs to other communication platforms or social media platforms  like Facebook, Twitter and Google  

The event was appreciated by every participant.

Open Knowledge Foundation: Open Data Day Spain – Towards IODC 16

planet code4lib - Fri, 2016-04-08 10:56

This post was written by Adolfo Anton Bravo from OK Spain.

Open Data Day in Spain is not something exceptional anymore. Five years after the first Open Data Day was born in Canada, nine Spanish cities have adopted in 2016 this celebration by organizing various local events It is not a coincidence that Spain will host the next International Open Data Conference 2016 in october, given the good health of its data communities, in spite of the fact of its poor results shown in the Open Data Index. Open Data in Spain is definitely a growing seed.

Alicante, Barcelona –with two events–, Bilbao, Girona, Granada, Madrid, Pamplona, Valencia, and Zaragoza were the cities that held activities to celebrate Open Data Day.

Open Knowledge Spain took part in the organization of the event in Madrid, and created a website to announce all of the activities that were going to be held in Spain, including the International Open Data  Conference, that its Call for Proposals had just been opened for applications.

 

Overview of the events

In alphabetical order, Barcelona celebrated Open Data Day twice.  apps4citizen organized a gathering where people deliberated about the importance of personal data, transparency, the knowledge acquisition process, or the various results that may be reached from the interpretation of data. A week later, Procomuns.net organized a data visualization contest on Commons Collaborative Economies in the P2P value project.

In Bilbao, the event run by  MoreLab DeustoTech-Internet, the Faculty of Engineering at the University of Deusto, and the Bilbao city council. The group focused on the scope of the movement in general, and in specific, linked open data.  The participants split into working groups with the objective to design and implement fast and easy applications that link and use open data.

The Girona Municipal Archive and the Center for Research and Image Distribution organized the event in Girona; their theme revolved around the documentary heritage data that included 125 archives and collections, 31 inventories, and 75 catalogues.

In Granada, the Free Software Office at the University of Granada organised a hackathon with eight candidate projects from March 4 to March 7. The projected looked at various topics, from traffic to gender bias.

The Medialab-Prado data journalism group, Open Knowledge Spain, and Open Data Institute (ODI) Madrid, organised a hackathon where three teams from different background such as   developers, journalists, programmers, statisticians, and citizens worked to open data in different aspects of open data: city light pollution, asbestos, and glass parliaments.

Pamplona took the opportunity to present the open technological platform FIWARE, an initiative for developers or entrepreneurs to use open data for innovative applications.FINODEX is the first European accelerator that is already funding projects that reuse open data with FIWARE technology.

The first OpenDatathon ETSINF – UPV took place in Valencia and was organised by  It was organised by the Higher Technical School of Computer Engineering at the Polytechnic University of Valencia, MUGI, the Master’s degree in Information Management, and the DataUPV Group.   16 teams participated,, with the objective of supporting, promoting and disseminating the use of open data, especially among the members of the university. It was supported by the Department of Transparency, Social Responsibility, Participation and Cooperation at the Valencia Regional Government, Inndea Foundation, Cátedra Ciudad de Valencia at UPV, and the private companies BigML and Everis.

The Zaragoza city council is well known  for  its support to open data. The city mission is to provide open, accessible and useful data to its citizens. For example, all the information about bills is open and can be found on the city website. In this regard, they are not only talking about open data but also transparency and municipal policies on open data.

Finally, on March 17, the University of Alicante organized a meeting  with participants from the Department of Transparency at the Valencia Regional Government, the Open Data Institute Madrid (ODI), the data research data opening network Maredata, and an initiative that promotes the University of Alicante startup ecosystem, ua:emprende.  The Open Data Meeting 2016 consisted of a series of lectures about the current condition of open data in Spain,  and emphasized that public sector information (PSI) reuse means an opportunity for entrepreneurship and the impact it generates in the field of of transparency and accountability. and some of its participants are. The event concluded  with the #UAbierta for open data entrepreneurship award ceremony

Bohyun Kim: Three Recent Talks of Mine on UX, Data Visualization, and IT Management

planet code4lib - Fri, 2016-04-08 03:19

I have been swamped at work and pretty quiet here in my blog. But I gave a few talks recently. So I wanted to share those at least.

I presented about how to turn the traditional library IT department and its operation that is usually behind the scene into a more patron-facing unit at the recent American Library Association Midwinter Meeting back in January. This program was organized by the LITA Heads of IT Interest Group. In March, I gave a short lightning talk at the 2016 Code4Lib Conference about the data visualization project of library data at my library. I was also invited to speak at the USMAI (University System of Maryland and Affiliated Institutions) UX Unconference and gave a talk about user experience, personas, and the idea of applying library personas to library strategic planning.

Here are those three presentation slides for those interested!

Strategically UX Oriented with Personas from Bohyun Kim

Visualizing Library Data from Bohyun Kim

Turning the IT Dept. Outward from Bohyun Kim

Pages

Subscribe to code4lib aggregator