You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 6 hours 54 min ago

Max Planck Digital Library: ProQuest Illustrata databases discontinued

Fri, 2016-04-15 15:20

Last year, the information provider ProQuest decided to discontinue its "Illustrata Technology" and "Illustrata Natural Science" databases. Unfortunately, this represents a preliminary end to ProQuest’s long-year investment into deep indexing content.

In a corresponding support article ProQuest states that there "[…] will be no loss of full text and full text + graphics images because of the removal of Deep Indexed content". In addition, they announce to "[…] develop an even better way for researchers to discover images, figures, tables, and other relevant visual materials related to their research tasks".

The MPG.ReNa records for ProQuest Illustrata: Technology and ProQuest Illustrata: Natural Science have been marked as "terminating" and will be deactivated soon.

OCLC Dev Network: Manipulating output with FILTER, OPTIONAL and UNION

Fri, 2016-04-15 13:00

Learn about choosing which data to return using FILTER OPTIONAL and UNION

Harvard Library Innovation Lab: IIPC: Two Track Thursday

Fri, 2016-04-15 00:09

A protester throwing cookies at the parliament.

Here are some things that caught our ear this fine Thursday at the International Internet Preservation Consortium web archiving conference:

  • Tom Storrar at the UK Government Web Archive reports on a user research project: ~20 in person interviews and ~130 WAMMI surveys resulting in 5 character vignettes. “WAMMI” replaces “WASAPI” as our favorite acronym.
    • How do we integrate user research into day-to-day development? We’ll be chewing more on that one.
  • Jefferson Bailey shares the Internet Archive’s learnings ups and downs with Archive-It Research Services. Projects from the last year include .GOV (100TB of .gov data in a Hadoop cluster donated by Altiscale), the L3S Alexandria Project, and something we didn’t catch with Ian Milligan at Archives.ca.
  • What the WAT? We hear a lot about WATs this year. Common Crawl has a good explainer.
  • Ditte Laursen sets out to answer a big research question: “What does Danish web look like?” What is the shape of .dk? Eld Zierau reports that in a comparison of the Royal Danish Library’s .dk collection with the Internet Archive’s collection of Danish-language sites, only something like 10% were in both.
  • Hugo Huurdeman asks an important question: what exactly is a website? Is it a host, a domain, or a set of pages that share the same CSS? To visualize change in whatever that is, he uses ssdeep, a fuzzy hashing mechanism for page comparison.
  • Let’s just pause to say how inspiring this all is. It’s at about this point in the day that we started totally rethinking a project we’ve been working on for months.
  • Justin Littman shares the Social Feed Manager, his happenin’ stack to harvest tweets and such.
  • We learned that TWARC is either twerking for WARCs or a Twitter-harvesting Python package — we’re not entirely sure. Either way it’s our new new favorite acronym. Sorry, WAMMI.
  • Nick Ruest and Ian Milligan give a very cool talk about sifting through hashtagged content on Twitter. Did you know that researchers only have 7-9 days to grab tweets under a hashtag before Twitter only makes the full stream available for a fee? (We did not know that.)
  • We were also impressed by Canada’s huge amount of political social media engagement. Even though Canada isn’t a huge country,[Ian’s words not ours] 55,000 Tweets were generated in one day with the #elxn42 tag.
  • Fernando Melo of Arquivo.pt pointed out that the struggle is real with live-web leaks in his research comparing OpenWayback and pywb. Fernando says in his tests OpenWayback was faster but pywb has higher-quality playbacks (more successes, fewer leaks). Both tools are expected to improve soon. We say it’s time for something like arewefastyet.com to make this a proper competition.
  • Nicola Bingham is self-deprecating about the British Library’s extensive QA efforts: “This talk title isn’t quite right because it implies that we have Quality Assurance Practices in the Post Legal Deposit Environment.” They use the Web Curator Tool QA Module, but are having to go beyond that for domain-scale archiving.
  • We’re also curious about this paper: Current Quality Assurance Practices in Web Archiving.
  • Todd Stoffer demos NC State’s QA tool. A clever blend of tools like Google Forms, Trello, and IFTTT to let student employees provide archive feedback during downtime. Here are Todd’s [snazzy HTML/JS] slides.

 
TL;DR: lots of exciting things happening in the archiving world. Also exciting: the Icelandic political landscape. On the way to dinner, the team happened upon a relatively small protest right outside of the parliament. There was pot clanging, oil barrel banging, and an interesting use of an active smoke alarm machine as a noise maker. We were also handed “red cards” to wave at the government.
 

http://librarylab.law.harvard.edu/blog/wp-content/uploads/2016/04/Slack-for-iOS-Upload-1.mp4

Now we’re off to look for the northern lights!

Cynthia Ng: UX Libraries Meetup: Notes on Google Analytics Talk and Lightning Talks

Thu, 2016-04-14 22:39
The UX Libraries Vancouver Group had a presentation on Google Analytics and a few lightning talks. Google Analytics by Jonathan Kift statistics are constantly gathered in libraries; web analytics is only one of them; also want to answer specific questions, be able to tell stories Accounts = general sources of data; under which you have … Continue reading UX Libraries Meetup: Notes on Google Analytics Talk and Lightning Talks

LITA: LITA ALA Annual Precon: Digital Privacy

Thu, 2016-04-14 16:47

Don’t miss these amazing speakers at this important LITA preconference to the ALA Annual 2016 conference in Orlando FL.

Digital Privacy and Security: Keeping You And Your Library Safe and Secure In A Post-Snowden World
Friday June 24, 2016, 1:00 – 4:00 pm
Presenters: Blake Carver, LYRASIS and Jessamyn West, Library Technologist at Open Library

Register for ALA Annual and Discover Ticketed Events

Learn strategies on how to make you, your librarians and your patrons more secure & private in a world of ubiquitous digital surveillance and criminal hacking. We’ll teach tools that keep your data safe inside of the library and out — how to secure your library network environment, website, and public PCs, as well as tools and tips you can teach to patrons in computer classes and one-on-one tech sessions. We’ll tackle security myths, passwords, tracking, malware, and more, covering a range of tools from basic to advanced, making this session ideal for any library staff.

Jessamyn West

Jessamyn West is a librarian and technologist living in rural Vermont. She studies and writes about the digital divide and solves technology problems for schools and libraries. Jessamyn has been speaking on the intersection of libraries, technology and politics since 2003. Check out her long running professional blog Librarian.net.

Jessamyn has given presentations, workshops, keynotes and all-day sessions on technology and library topics across North America and Australia. She has been speaking and writing on the intersection of libraries and technology for over a decade. A few of her favorite topics include: Copyright and fair use; Free culture and creative commons; and the Digital divide. She is the author of Without a Net: Librarians Bridging the Digital Divide, and has written the Practical Technology column for Computers in Libraries magazine since 2008.

See more information about Jessamyn at: http://jessamyn.info

Blake Carver

Blake Carver is the guy behind LISNews, LISWire & LISHost. Blake was one of the first librarian bloggers (he created LISNews in 1999) and is a member of Library Journal’s first Movers & Shakers cohort. He has worked as a web librarian, a college instructor, and a programmer at a startup. He is currently the Senior Systems Administrator for LYRASIS Technology Services where he manages the servers and infrastructure that support their products and services.

Blake has presented widely at professional conferences talking about open source systems, Drupal, WordPress and IT Security For Libraries.

See more information about Blake at: http://eblake.com/

More LITA Preconferences at ALA Annual
Friday June 24, 2016, 1:00 – 4:00 pm

  • Islandora for Managers: Open Source Digital Repository Training
  • Technology Tools and Transforming Librarianship

Registration Information

Register for the 2016 ALA Annual Conference in Orlando FL

Discover Ticketed Events

Questions or Comments?

For all other questions or comments related to the preconference, contact LITA at (312) 280-4269 or Mark Beatty, mbeatty@ala.org.

FOSS4Lib Recent Releases: Open Monograph Press - 1.2

Thu, 2016-04-14 16:00
Package: Open Monograph PressRelease Date: Wednesday, April 13, 2016

Last updated April 14, 2016. Created by David Nind on April 14, 2016.
Log in to edit this page.

See the release notes for details and a demo.

Galen Charlton: Changing LCSH and living dangerously

Thu, 2016-04-14 13:25

On 22 March 2016, the Library of Congress announced [pdf] that the subject heading Illegal aliens will be cancelled and replaced with Noncitizens and Unauthorized immigration. This decision came after a couple years of lobbying by folks from Dartmouth College (and others) and a resolution [pdf] passed by the American Library Association.

Among librarians, responses to this development seemed to range from “it’s about time” to “gee, I wish my library would pay for authority control” to the Annoyed Librarian’s “let’s see how many MORE clicks my dismissiveness can send Library Journal’s way!” to Alaskan librarians thinking “they got this change made in just two years!?! Getting Denali and Alaska Natives through took decades!”.

Business as usual, in other words. Librarians know the importance of names; that folks will care enough to advocate for changes to LCSH comes as no surprise.

The change also got some attention outside of libraryland: some approval by lefty activist bloggers, a few head-patting “look at what these cute librarians are up to” pieces in mainstream media, and some complaints about “political correctness” from the likes of Breitbart.

And now, U.S. Representative Diane Black has tossed her hat in the ring by announcing that she has drafted (update: and now introduced) a bill to require that

The Librarian of Congress shall retain the headings ‘‘Aliens’’ and ‘‘Illegal Aliens’’, as well as related headings, in the Library of Congress Subject Headings in the same manner as the headings were in effect during 2015.

There’s of course a really big substantive reason to oppose this move by Black: “illegal aliens” is in fact pejorative. To quote Elie Wiesel: “no human being is illegal.” Names matter; names have power: anybody intentionally choosing to refer to another person as illegal is on shaky ground indeed if they wish to not be thought racist.

There are also reasons to simply roll one’s eyes and move on: this bill stands little chance of passing Congress on its own, let alone being signed into law. As electoral catnip to Black’s voters and those of like-minded Republicans, it’s repugnant, but still just a drop in the ocean of topics for reactionary chain letters and radio shows.

Still, there is value in opposing hateful legislation, even if it has little chance of actually being enacted. There are of course plenty of process reasons to oppose the bill:

  • There are just possibly a few matters that a member of the House Budget Committee could better spend her time on. For example, libraries in her district in Tennessee would benefit from increased IMLS support, to pick a example not-so-randomly.
  • More broadly, Congress as a whole has much better things to do than to micro-manage the Policy and Standards Division of the Library of Congress.
  • If Congress wishes to change the names of things, there are over 31,000 post offices to work with. They might also consider changing the names of military bases named after generals who fought against the U.S.
  • Professionals of any stripe in civil service are owed a degree of deference in their professional judgments by legislators. That includes librarians.
  • Few, if any, members of Congress are trained librarians or ontologists or have any particular qualifications to design or maintain controlled vocabularies.

However, there is one objection that will not stand: “Congress has no business whatsoever advocating or demanding changes to LCSH.”

If cataloging is not neutral, if the act of choosing names has meaning… it has political meaning.

And if the names in LCSH are important enough for a member of Congress to draft a bill about — even if Black is just grandstanding — they are important enough to defend.

If cataloging is not neutral, then negative reactions must be expected — and responded to.


Updated 2016-04-14: Add link to H.R. 4926.

Ed Summers: WARC Work

Thu, 2016-04-14 04:00

WARC is often thought of as a useful preservation format for websites and Web content, but it can also be a useful tool in your toolbox for Web maintenance work.

At work we are in the process of migrating a custom site developed over 10 years ago to give it a new home on the Web. The content has proven useful and popular enough over time that it was worth the investment to upgrade and modernize the site.

You can see in the Internet Archive that the Early Americas Digital Archive has been online at least since 2003. You can also see that it hasn’t changed at all since then. It may not seem like it, but that’s a long time for a dynamic site to be available. It speaks to the care and attention of a lot of MITH staff over the years that it’s still running, and that it is even possible to conceive of migrating it to a new location using a content management system that didn’t even exist when the website was born.

As a result of the move the URLs for the authors and documents in the archive will be changing significantly, and there are lots of links to the archive on the Web. Some of these links can even be found in books, so it’s not just a matter of Google updating their indexes when they encounter a permanent redirect. Nevertheless, we do want to create permanent redirects from the old location to the new location so these links don’t break.

If you are the creator of a website, itemizing the types of URLs that need to change may not be a hard thing to do. But for me, arriving on the scene a decade later, it was non-trivial to digest the custom PHP code, look at the database, and the filesystem and determine the full set of URLs that might need to be redirected.

So instead of code spelunking I decided to crawl the website, and then look at the URLs that are present in the crawled data. There are lots of ways to do this, but it occurred to me that one way would be to use wget to crawl the website and generate a WARC file that I could then analyze.

The first step is to crawl the site. wget is a venerable tool with tons of command line options. Thanks to the work of Jason Scott and Archive Team a --warc-file command line option was added a few years ago that serializes the results of the crawl as a single WARC file.

In my case I also wanted to create a mirror copy of the website for access purposes. The mirrored content is an easy way to see what the website looked like without needing to load the WARC file into a player of some kind like Wayback…but more on that below.

So here was my wget command:

wget --warc-file eada --mirror --page-requisites --adjust-extension --convert-links --wait 1 --execute robots=off --no-parent http://mith.umd.edu/eada/

The EADA website isn’t huge, but this ran for about an hour because I decided to be nice to the server with a one second pause between requests. When it was done I had a single WARC file that represented the complete results of the crawl: eada.warc.gz.

With this in hand I could then use Anand Chitipothu and Noufal Ibrahim’s warc Python module to read in the WARC file looking for HTTP responses for HTML pages. The program simply emits the URLs as it goes, and thus builds a complete set of webpage URLs for the EADA website.

import warc from StringIO import StringIO from httplib import HTTPResponse class FakeSocket(): def __init__(self, response_str): self._file = StringIO(response_str) def makefile(self, *args, **kwargs): return self._file for record in warc.open("eada.warc.gz"): if record.type == "response": resp = HTTPResponse(FakeSocket(record.payload.read())) resp.begin() if resp.getheader("content-type") == "text/html": print record['WARC-Target-URI']

As you can probably see the hokiest part of this snippet is parsing the HTTP response embedded in the WARC data. Python’s httplib wanted the HTTP response to look like a socket connection instead of a string. If you know of a more elegant way of going from a HTTP response string to a HTTP Response object I’d love to hear from you.

I sorted the output and came up with a nice list of URLs for the website. Here is a brief snippet:

http://mith.umd.edu/eada/gateway/winslow.php http://mith.umd.edu/eada/gateway/winthrop.php http://mith.umd.edu/eada/gateway/witchcraft.php http://mith.umd.edu/eada/gateway/wood.php http://mith.umd.edu/eada/gateway/woolman.php http://mith.umd.edu/eada/gateway/yeardley.php http://mith.umd.edu/eada/guesteditors.php http://mith.umd.edu/eada/html/display.php?docs=acrelius_founding.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=alsop_character.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=arabic.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=ashbridge_account.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=banneker_letter.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=barlow_anarchiad.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=barlow_conspiracy.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=barlow_vision.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=barlowe_voyage.xml&action=show

The URLs with docs= in them are particularly important because they identify documents in the archive, and seem to form the majority of inbound links. So there still remains work to map the old URLs to the new ones, but now we at least now what they are.

I mentioned earlier that the mirror copy is an easy way to view the crawled content without needing to start a Wayback or equivalent web archive player. But one other useful thing you can on your workstation is download Ilya Kreymer’s WebArchivePlayer for Mac or Windows, start it up, at which point it asks you to select a WARC file to view, which it then lets you view in your browser.

In case you don’t believe me, here’s a demo of this little bit of magic:

As I’ve with other web preservation work I’ve been doing at MITH I then took the WARC file, the mirrored content, the server side code and database export and put them in a bag which I then copied up to MITH’s S3 storage. Will the WARC file stand the test of time? I’m not sure. But the WARC file was useful to me here today. So there’s reason to hope.

William Denton: Rosie or Nunslaughter

Thu, 2016-04-14 00:10

I’ve decided to get rid of my CDs, so I’m ripping them all (to FLAC) with Rhythmbox. It can talk to MusicBrainz to get metadata: album title, album artist, song titles, genre, etc. Sometimes that doesn’t work, and then it’s nice to use the feature of EasyTAG (which I use to edit metadata) where it can look up the information on FreeDB based on the raw information about track lengths and such. Almost always, that works. Sometimes, it doesn’t. One time, it presented me with a strange choice:

Hmm … how to tell?

I dig some death metal, but in this case, it was definitely Rosie and Der Bingle.

Harvard Library Innovation Lab: LIL at IIPC: The Story So Far

Wed, 2016-04-13 23:10

We’re halfway through the International Internet Preservation Consortium’s annual web archiving conference. Here are just a few notes from our time so far:

Auto-captioned photo of Jack, Genève, and Matt — thanks CaptionBot!

April 12

  • Andy Jackson kicks the conference off with “Have I accidentally committed international journalism?” — he has contributed to the open source software that was used to review the Panama Papers.
  • Andrea Goethals describes the desire for smaller modules in the web archive tool chain, one of her conclusions from Harvard Library’s Environmental Scan of Web Archiving. This was the first of many calls throughout the day for more nimble tools.
  • Stephen Abrams shares the California Digital Library’s success story with Archive-It. “Archive-It is good at what it does, no need for us to replicate that service.”
  • John Erik Halse encourages folks to contribute code and documentation. Don’t be intimidated and just dive in.
  • There seems to be consensus that Heritrix is a tool that everyone needs but no one is in charge of — that’s tough for contributors. A few calls for the Internet Archive to ride in and save the day.
  • We’re not naming names, but a number of organizations have had their IT departments, or IT contractors, seek to run virus scanners that would edit the contents of an archive after preservation. (Hint: it’s not easy to archive malware, but “just delete it” isn’t the answer.)
  • Some kind member of IIPC reminds us of the amazing Malware Museum hosted by the Internet Archive.
  • David Rosenthal notes that Iceland has been called the ”Switzerland of bits”. After being in Reykjavik for only a few days, we sort of agree!
  • Jefferson Bailey of the Internet Archive echoed concerns about looming web entropy: there is significant growth in web archiving, but a concentration of storage for archives.
  • Nicholas Taylor of the Stanford Digital Library is responsible for the most wonderful acronym of all time, WASAPI (“Web Archiving Systems API”).
  • The Memento Protocol remains the greatest thing since sliced bread. (Here we refer to the web discovery standard, not the Jason Bourne movie.)
  • We chat with Michael Nelson about his projects at ODU, from the Mink browser plugin to the icanhazmemento Twitter bot.

April 13

  • Hjálmar Gíslason points out that 500 hours of video are uploaded to YouTube each minute. It would take 90,000 employees working full time to watch it all. Conclusion: Google needs to hire some people and get on this.
  • Hjálmar also mentions Tim Berners-Lee’s 5-Star Open Data standard. Nice goal to work toward for Free the Law!
  • Vint Cerf on Digital Vellum: the Catholic Church has lasted for an awfully long time, and breweries tend to stick around a long time. How could we design a digital archiving institution that could last that long?
  • (Perma’s suggestion: how about a TLD for URLs that never change? We were going to suggest .cool, because cool URLs don’t change. But that seems to be taken.)
  • Ilya Kramer shows off the first webpage ever in the first browser ever, running in a simulated NeXT Computer, courtesy of oldweb.today.
  • Dragan Espensch says Rhizome views the web as “performative media” while showing Jan Robert Leegte’s [untitled]scrollbars piece through different browsers in oldweb.today. Sometimes the OS is the artwork.
  • Matthew S. Weber and Ian Milligan have been running web archive hackathons to connect researchers to computer programmers. Researchers need this: “It would be dishonest to do a history of the 90s without using web archives.” Cue <marquee> tags here.
  • Brewster Kahle pitches the future of national digital collections, using as a model the fictional (but oh-so-cool) National Library of Atlantis. Shows off clever ways to browse a nation’s tv news, books, music, video games, and so much more.
  • Brewster encourages folks to recognize that there is no “The Web” anymore: collections will differ based on context and provenance of the curator or crawler. (What is archiving “The Web” if each of us has a different set of sites that are blocked, allowed, or custom-generated for us?)
  • Brewster voices the need for broad, high level visualizations in web archives. He highlights existing work and thinks we can push it further.
  • And oh by the way, he also shows off Wayback Explorer over at Archive Labs — graph major and minor changes in websites over time.
  • Bonus: We’re fortunate enough to grab some whale sushi (or vegan alternatives) with David Rosenthal, Ilya Kreymer, and Dragan Espenschied.

Looking forward to the next couple of days …

Zotero: Indiana University Survey of Zotero Users

Wed, 2016-04-13 23:08

As part of a grant funded by the Alfred P. Sloan Foundation to analyze altmetrics and expand the Zotero API, our research partners at Indiana University are studying the readership of reference sources across a range of platforms. Cassidy Sugimoto and a team of researchers at IU have developed an anonymous, voluntary survey that seeks to analyze the bibliometrics of Zotero data. The survey includes questions regarding user behavior, discoverability, networking, the research process, open access, open source software, scholarly communication, and user privacy. It is a relatively short survey and your input is greatly appreciated. We will post a follow-up to the Zotero blog that analyzes the results of the survey. Follow this link to take the survey.

LibUX: 036 – Penelope Singer

Wed, 2016-04-13 21:08

I — you know, Michael! — talk shop with the user interface designer Penelope Singer. We chat about cross-platform and cross-media brand, material design and web animation, cats, and anticipatory design.

As usual, you can download the MP3 directly – or now on SoundCloud, too.

Segments
  • 0:55 – What does Penelope do as a user interface designer?
  • 3:00 – “The brand is nothing more than … what the user says you are.”
  • 7:16 – Cats
  • 9:00 – “People’s paradigms are always related to physical things”
  • 12:38 – Animations that communicate state change
  • 15:20 – Resistance to brand and style guidelines
  • 19:47 – Anticipatory design 1 and concerns around privacy

You can subscribe to LibUX on Stitcher, iTunes, SoundCloud or plug our feed right into your podcatcher of choice. Help us out and say something nice. You can find every podcast right here on www.libux.co.

  1. Yes, again.

The post 036 – Penelope Singer appeared first on LibUX.

LITA: Jobs in Information Technology: April 13, 2016

Wed, 2016-04-13 18:58

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Qualcomm, Inc., Content Integration Librarian, San Diego, CA

Multnomah County Library, Front End Drupal Web Developer, Portland, OR

City of Virginia Beach Library Department, Librarian I/Web Specialist #7509, Virginia Beach, VA

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Equinox Software: As You Wish: No Programmer Required

Wed, 2016-04-13 18:08

Every year, Library Journal investigates the Library Systems Landscape.  In this series of articles, libraries and vendors alike are polled and interviewed in order to come up with a cohesive glimpse into Libraryland’s inner workings.  We greatly appreciate the hard work Matt Enis puts in each year to give a well-rounded perspective on both proprietary and open source solutions for libraries.

Equinox President Mike Rylander was interviewed for the Open Invitation portion of this series.  But what really caught our attention here at Equinox was a separate portion:  Wish List.  In this article, librarians were surveyed on their current ILS and the things they would like to see in their own libraries.  

We were most interested in this quote (emphasis added):

“Others expressed concerns about whether on-staff expertise would be needed to operate an open source ILS—a perception that development houses such as ByWater, Equinox, and LibLime have been trying to battle. One respondent who had served on an ILS search team noted that despite being “extremely unhappy” with one commercial ILS, the library ultimately moved to another proprietary solution “due to insufficient funds for open source support staff.” Another wrote that an “open source solution has great appeal due to the customization possibilities. The cost and maintenance factors also play into this. But you have to have the internal capacity to support open source, and with budget reductions,we just haven’t been able to consider an open source option.

While the prevalence of this misunderstanding has decreased in recent years, it’s very unfortunate to see it repeated by libraries again and again.  The only winners when those beliefs persist are the proprietary vendors selling ILS’ that make their users “extremely unhappy[.]”

Equinox would like to take this opportunity to say that “Open Source support staff” is NOT necessary to use Open Source solutions in ANY library.  Equinox (as well as other open source support vendors) offer the full gamut of services.  We can migrate, host, install, maintain, and develop new features, at a cost lower than what you would expect from a proprietary ILS vendor.  There is absolutely no need for extra library staff.  There is absolutely no need to have a programmer on staff.  We can do the programming, tech support, and training for you.

You can have all the benefits an open source product brings, truly open APIs and code, product portability, and flexibility along with the security of having a team of experts supporting you.  You get real choice and a real voice, and you don’t have to hire any new staff to get them.

We don’t want you to be extremely unhappy with your current ILS.  We want you to be over the moon happy with your ILS solution because it has the functionality and flexibility you expect and deserve.  We can help you with this–no computer science degree required.

David Rosenthal: The Architecture of Emulation on the Web

Wed, 2016-04-13 17:00
In the program my talk at the IIPC's 2016 General Assembly in Reykjavík was entitled Emulation & Virtualization as Preservation Strategies. But after a meeting to review my report called by the Mellon Foundation I changed the title to The Architecture of Emulation on the Web. Below the fold, an edited text of the talk with an explanation for the change, and links to the sources.
Title Its a pleasure to be here and I'm grateful to the organizers for inviting me to talk today. As usual, you don't need to take notes or ask for the slides, the text of the talk with links to the sources will go up on my blog shortly.

Thanks to funding from the Mellon Foundation I spent last summer on behalf of the Mellon and Sloan Foundations, and IMLS researching and writing a report entitled Emulation & Virtualization as Preservation Strategies. Jeff Rothenberg's 1995 Ensuring the Longevity of Digital Documents identified emulation and migration as the two possible techniques and came down strongly in favor of emulation. Despite this, migration has been overwhelmingly favored until recently. What has changed is that emulation frameworks have been developed that present emulations as a normal part of the Web.

Last month there was a follow-up meeting at the Mellon Foundation. In preparing for it, I realized that there was an important point that the report identified but didn't really explain properly. I'm going to try to give a better explanation today, because it is about how emulations of preserved software appear on the web, and thus how they can be become part of the Web that we collect, preserve and disseminate. I'll start by describing how the three emulation frameworks I studied appear on the Web, then illustrating the point with an analogy, and suggesting how it might be addressed.

When I gave a talk about the report at CNI I included live demos. It was a disaster; Olive was the only framework that worked via hotel WiFi. I have pre-recorded the demos using Kazam and a Chromium browser on my Ubuntu 14.04 system.
Theresa Duncan's CD-ROMs The Theresa Duncan CD-ROMs.From 1995 to 1997 Theresa Duncan produced three seminal feminist CD-ROM games, Chop Suey, Smarty and Zero Zero. Rhizome, a project hosted by the New Museum in New York, has put emulations of them on the Web. You can visit http://archive.rhizome.org/theresa-duncan-cdroms/, click any of the "Play" buttons and have an experience very close to that of playing the CD on MacOS 7.5 . This has proved popular. For several days after their initial release they were being invoked on average every 3 minutes.
What Is Going On? What happened when I clicked Smarty's Play button?
  • The browser connects to a session manager in Amazon's cloud, which notices that this is a new session.
  • Normally it would authenticate the user, but because this CD-ROM emulation is open access it doesn't need to.
  • It assigns one of its pool of running Amazon instances to run the session's emulator. Each instance can run a limited number of emulators. If no instance is available when the request comes in it can take up to 90 seconds to start another.
  • It starts the emulation on the assigned instance, supplying metadata telling the emulator what to run.
  • The emulator starts. After a short delay the user sees the Mac boot sequence, and then the CD-ROM starts running.
  • At intervals, the emulator sends the session manager a keep-alive signal. Emulators that haven't sent one in 30 seconds are presumed dead, and their resources are reclaimed to avoid paying the cloud provider for unused resources.
    bwFLA architecturebwFLARhizome, and others such as Yale, the DNB and ZKM Karlsruhe use technology from the bwFLA team at the University of Freiburg to provide Emulation As A Service (EAAS). Their GPLv3 licensed framework runs in "the cloud" to provide comprehensive management and access facilities wrapped around a number of emulators. It can also run as a bootable USB image or via Docker. bwFLA encapsulates each emulator so that the framework sees three standard interfaces
    • Data I/O, connecting the emulator to data sources such as disk images, user files, an emulated network containing other emulators, and the Internet.
    • Interactive Access, connecting the emulator to the user using standard HTML5 facilities.
    • Control, providing a Web Services interface that bwFLA's resource management can use to control the emulator.
    The communication between the emulator and the user takes place via standard HTTP on port 80; there is no need for a user to install software, or browser plugins, and no need to use ports other than 80. Both of these are important for systems targeted at use by the general public.

    bwFLA's preserved system images are stored as a stack of overlays in QEMU's "qcow2'' format. Each overlay on top of the base system image represents a set of writes to the underlying image. For example, the base system image might be the result of an initial install of Windows 95, and the next overlay up might be the result of installing Word Perfect into the base system. Or the next overlay up might be the result of redaction. Each overlay contains only those disk blocks that differ from the stack of overlays below it. The stack of overlays is exposed to the emulator as if it were a normal file system via FUSE.

    The technical metadata that encapsulates the system disk image is described in a paper presented to the iPres conference in November 2015, using the example of emulating CD-ROMs. Broadly, it falls into two parts, describing the software and hardware environments needed by the CD-ROM in XML. The XML refers to the software image components via the Handle system, providing a location-independent link to access them.
    TurboTaxTurboTax97 on Windows 3.1I can visit https://olivearchive.org/launch/11/ and get 1997's TurboTax running on Windows 3.1. The pane in the browser window has top and bottom menu bars, and between them is the familiar Windows 3.1 user interface.
    What Is Going On? The top and bottom menu bars come from a program called VMNetX that is running on my system. Chromium invoked it via a MIME-type binding, and VMNetX then created a suitable environment in which it could invoke the emulator that is running Windows 3.1, and TurboTax. The menu bars include buttons to power-off the emulated system, control its settings, grab the screen, and control the assignment of the keyboard and mouse to the emulated system.

    The interesting question is "where is the Windows 3.1 system disk with TurboTax installed on it?"
    OliveThe answer is that the "system disk" is actually a file on a remote Apache Web server. The emulator's disk accesses are being demand-paged over the Internet using standard HTTP range queries to the file's URL.

    This system is Olive, developed at Carnegie Mellon University by a team under my friend Prof. Mahadev Satyanarayanan, and released under GPLv2. VMNetX uses a sophisticated two-level caching scheme to provide good emulated performance even over slow Internet connections. A "pristine cache" contains copies of unmodified disk blocks from the "system disk". When a program writes to disk, the data is captured in a "modified cache". When the program reads a disk block, it is delivered from the modified cache, the pristine cache or the Web server, in that order. One reason this works well is that successive emulations of the same preserved system image are very similar, so pre-fetching blocks into the pristine cache is effective in producing YouTube-like performance over 4G cellular networks.
    VisiCalcVisiCalc on Apple ][You can visit https://archive.org/details/VisiCalc_1979_SoftwareArts and run Dan Bricklin and Bob Frankston's VisiCalc from 1979 on an emulated Apple ][. It was the world's first spreadsheet. Some of the key-bindings are strange to users conditioned by decades of Excel, but once you've found the original VisiCalc reference card, it is perfectly usable.
    What Is Going On? The Apple ][ emulator isn't running in the cloud, as bwFLA's does, nor is it running as a process on my machine, as Olive's does. Instead, it is running inside my browser. The emulators have been compiled into JavaScript, using emscripten. When I clicked on the link to the emulation, metadata describing the emulation including the emulator to use was downloaded into my browser, which then downloaded the JavaScript for the emulator and the system image for the Apple ][ with VisiCalc installed.
    EmularityThis is the framework underlying the Internet Archive's software library, which currently holds nearly 36,000 items, including more than 7,300 for MS-DOS, 3,600 for Apple, 2,900 console games and 600 arcade games. Some can be downloaded, but most can only be streamed.

    The oldest is an emulation of a PDP-1 with a DEC 30 display running the Space War game from 1962, more than half a century ago. As I can testify having played this and similar games on Cambridge University’s PDP-7 with a DEC 340 display seven years later, this emulation works well

    The quality of the others is mixed. Resources for QA and fixing problems are limited; with a collection this size problems are to be expected. Jason Scott crowd-sources most of the QA; his method is to see if the software boots up and if so, put it up and wait to see whether visitors who remember it post comments identifying problems, or whether the copyright owner objects. The most common problem is the sound.

    It might be thought that the performance of running the emulator locally by adding another layer of emulation (the JavaScript virtual machine) would be inadequate, but this is not the case for two reasons. First, the user’s computer is vastly more powerful than an Apple ][ and, second, the performance of the JavaScript engine in a browser is critical to its success, so large resources are expended on optimizing it.

    The movement supported by major browser vendors to replace the JavaScript virtual machine with a byte-code virtual machine called WebAssembly has borne fruit. Last month four major browsers announced initial support, all running the same game, a port of Unity's Angry Bots. This should greatly reduce the pressure for multi-core and parallelism support in JavaScript, which was always likely to be a kludge. Improved performance for in-browser emulation is also likely to make in-browser emulation more competitive with techniques that need software installation and/or cloud infrastructure, reducing the barrier to entry.
    The PDF AnalogyLets make an analogy between emulation and something that everyone would agree is a Web format, PDF. Browsers lack built-in support for rendering PDF. They used to depend on external PDF renderers, such as Adobe Reader via a Mime-Type binding. Now, they download pdf.js and render the PDF internally even though its a format for which they have no built-in support. The Webby, HTML5 way to provide access to formats that don't qualify for built-in support is to download a JavaScript renderer. We don't preserve PDFs by wrapping them in a specific PDF renderer, we preserve them as PDF plus a MimeType. At access time the browser chooses an appropriate renderer, which used to be Adobe Reader and is now pdf.js.
    Landing Pages ACM landing pageThere's another interesting thing about PDFs on the web. In many cases the links to them don't actually get you to the PDF. The canonical, location-independent link to the LOCKSS paper in ACM ToCS is http://dx.doi.org/10.1145/1047915.1047917, which currently redirects to http://dl.acm.org/citation.cfm?doid=1047915.1047917 which is a so-called "landing page", not the paper but a page about the paper, on which if you look carefully you can find a link to the PDF.

    The fact that it is very difficult for a crawler to find this link makes it hard for archives to collect and preserve scholarly papers. Herbert Van de Sompel and Michael Nelson's Signposting proposal addresses this problem, as to some extent do W3C activities called Packaging on the Web and Portable Web Publications for the Open Web Platform.

    Like PDFs, preserved system images, the disk image for a system to be emulated and the metadata describing the hardware it was intended for, are formats that don't qualify for built-in support. The Webby way to provide access to them is to download a JavaScript emulator, as Emularity does. So is the problem of preserving system images solved?
    Problem Solved? NO! No it isn't. We have a problem that is analogous to, but much worse than, the landing page problem. The analogy would be that, instead of a link on the landing page leading to the PDF, embedded in the page was a link to a rendering service. The metadata indicating that the actual resource was a PDF, and the URI giving its location, would be completely invisible to the user's browser or a Web crawler. At best all that could be collected and preserved would be a screenshot.

    All three frameworks I have shown have this problem. The underlying emulation service, the analogy of the PDF rendering service, can access the system image and the necessary metadata, but nothing else can. Humans can read a screenshot of a PDF document, a screenshot of an emulation is useless. Wrapping a system image in an emulation like this makes it accessible in the present, not preservable for the future.

    If we are using emulation as a preservation strategy, shouldn't we be doing it in a way that is itself able to be preserved?
    A MimeType for Emulations?What we need is a MimeType definition that allows browsers to follow a link to a preserved system image and construct an appropriate emulation for it in whatever way suits them. This would allow Web archives to collect preserved system images and later provide access to them.

    The linked-to object that the browser obtains needs to describe the hardware that should be emulated. Part of that description must be the contents of the disks attached to the system. So we need two MimeTypes:
    • A metadata MimeType, say Emulation/MachineSpec, that describes the architecture and configuration of the hardware, which links to one or more resources of:
    • A disk image MimeType, say DiskImage/qcow2, with the contents of each of the disks.
    Emulation/MachineSpec is pretty much what the hardware part of bwFLA's internal metadata format does, though from a preservation point of view there are some details that aren't ideal. For example, using the Handle system is like using a URL shortener or a DOI, it works well until the service dies. When it does, as for example last year when doi.org's domain registration expired, all the identifiers become useless.

    I suggest DiskImage/qcow2 because QEMU's qcow2 format is a de facto standard for representing the bits of a preserved system's disk image.
    And binding to "emul.js" Then, just as with pdf.js, the browser needs a binding to a suitable "emul.js" which knows, in this browser's environment, how to instantiate a suitable emulator for the specified machine configuration and link it to the disk images.This would solve both problems:
    • The emulated system image would not be wrapped in a specific emulator; the browser would be free to choose appropriate, up-to-date emulation technology.
    • The emulated system image and the necessary metadata would be discoverable and preservable because there would be explicit links to them.
    The details need work but the basic point remains. Unless there are MimeTypes for disk images and system descriptions, emulations cannot be first-class Web objects that can be collected, preserved and later disseminated.

    SearchHub: Better Feature Engineering with Spark, Solr, and Lucene Analyzers

    Wed, 2016-04-13 15:32

    This blog post is about new features in the Lucidworks spark-solr open source toolkit. For an introduction to the spark-solr project, see Solr as an Apache Spark SQL DataSource

    Performing text analysis in Spark

    The Lucidworks spark-solr open source toolkit now contains tools to break down full text into words a.k.a. tokens using Lucene’s text analysis framework. Lucene text analysis is used under the covers by Solr when you index documents, to enable search, faceting, sorting, etc. But text analysis external to Solr can drive processes that won’t directly populate search indexes, like building machine learning models. In addition, extra-Solr analysis can allow expensive text analysis processes to be scaled separately from Solr’s document indexing process.

    Lucene text analysis, via LuceneTextAnalyzer

    The Lucene text analysis framework, a Java API, can be used directly in code you run on Spark, but the process of building an analysis pipeline and using it to extract tokens can be fairly complex. The spark-solr LuceneTextAnalyzer class aims to simplify access to this API via a streamlined interface. All of the analyze*() methods produce only text tokens – that is, none of the metadata associated with tokens (so-called “attributes”) produced by the Lucene analysis framework is output: token position increment and length, beginning and ending character offset, token type, etc. If these are important for your use case, see the “Extra-Solr Text Analysis” section below.

    LuceneTextAnalyzer uses a stripped-down JSON schema with two sections: the analyzers section configures one or more named analysis pipelines; and the fields section maps field names to analyzers. We chose to define a schema separately from Solr’s schema because many of Solr’s schema features aren’t applicable outside of a search context, e.g.: separate indexing and query analysis; query-to-document similarity; non-text fields; indexed/stored/doc values specification; etc.

    Lucene text analysis consists of three sequential phases: character filtering – whole-text modification; tokenization, in which the resulting text is split into tokens; and token filtering – modification/addition/removal of the produced tokens.

    Here’s the skeleton of a schema with two defined analysis pipelines:

    { "analyzers": [{ "name": "...", "charFilters": [{ "type": "...", ...}, ... ], "tokenizer": { "type": "...", ... }, "filters": [{ "type": "...", ... } ... ] }] }, { "name": "...", "charFilters": [{ "type": "...", ...}, ... ], "tokenizer": { "type": "...", ... }, "filters": [{ "type": "...", ... }, ... ] }] } ], "fields": [{"name": "...", "analyzer": "..."}, { "regex": ".+", "analyzer": "..." }, ... ] }

    In each JSON object in the analyzers array, there may be:

    • zero or more character filters, configured via an optional charFilters array of JSON objects;
    • exactly one tokenizer, configured via the required tokenizer JSON object; and
    • zero or more token filters, configured by an optional filters array of JSON objects.

    Classes implementing each one of these three kinds of analysis components are referred to via the required type key in these components’ configuration objects, the value for which is the SPI name for the class, which is simply the case-insensitive class’s simple name with the -CharFilterFactory, -TokenizerFactory, or -(Token)FilterFactory suffix removed. See the javadocs for Lucene’s CharFilterFactory, TokenizerFactory and TokenFilterFactory classes for a list of subclasses, the javadocs for which include a description of the configuration parameters that may be specified as key/value pairs in the analysis component’s configuration JSON objects in the schema.

    Below is a Scala snippet to display counts for the top 10 most frequent words extracted from spark-solr’s top-level README.adoc file, using LuceneTextAnalyzer configured with an analyzer consisting of StandardTokenizer (which implements the word break rules from Unicode’s UAX#29 standard) and LowerCaseFilter, a filter to downcase the extracted tokens. If you would like to play along at home: clone the spark-solr source code from Github; change directory to the root of the project; build the project (via mvn -DskipTests package); start the Spark shell (via $SPARK_HOME/bin/spark-shell --jars target/spark-solr-2.1.0-SNAPSHOT-shaded.jar); type paste: into the shell; and finally paste the code below into the shell after it prints // Entering paste mode (ctrl-D to finish):

    import com.lucidworks.spark.analysis.LuceneTextAnalyzer val schema = """{ "analyzers": [{ "name": "StdTokLower", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }], | "fields": [{ "regex": ".+", "analyzer": "StdTokLower" }] } """.stripMargin val analyzer = new LuceneTextAnalyzer(schema) val file = sc.textFile("README.adoc") val counts = file.flatMap(line => analyzer.analyze("anything", line)) .map(word => (word, 1)) .reduceByKey(_ + _) .sortBy(_._2, false) // descending sort by count println(counts.take(10).map(t => s"${t._1}(${t._2})").mkString(", "))

    The top 10 token(count) tuples will be printed out:

    the(158), to(103), solr(86), spark(77), a(72), in(44), you(44), of(40), for(35), from(34)

    In the schema above, all field names are mapped to the StdTokLower analyzer via the "regex": ".+" mapping in the fields section – that’s why the call to analyzer.analyze() uses "anything" as the field name.

    The results include lots of prepositions (“to”, “in”, “of”, “for”, “from”) and articles (“the” and “a”) – it would be nice to exclude those from our top 10 list. Lucene includes a token filter named StopFilter that removes words that match a blacklist, and it includes a default set of English stopwords that includes several prepositions and articles. Let’s add another analyzer to our schema that builds on our original analyzer by adding StopFilter:

    import com.lucidworks.spark.analysis.LuceneTextAnalyzer val schema = """{ "analyzers": [{ "name": "StdTokLower", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }, | { "name": "StdTokLowerStop", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }, | { "type": "stop" }] }], | "fields": [{ "name": "all_tokens", "analyzer": "StdTokLower" }, | { "name": "no_stopwords", "analyzer": "StdTokLowerStop" } ]} """.stripMargin val analyzer = new LuceneTextAnalyzer(schema) val file = sc.textFile("README.adoc") val counts = file.flatMap(line => analyzer.analyze("no_stopwords", line)) .map(word => (word, 1)) .reduceByKey(_ + _) .sortBy(_._2, false) println(counts.take(10).map(t => s"${t._1}(${t._2})").mkString(", "))

    In the schema above, instead of mapping all fields to the original analyzer, only the all_tokens field will be mapped to the StdTokLower analyzer, and the no_stopwords field will be mapped to our new StdTokLowerStop analyzer.

    spark-shell will print:

    solr(86), spark(77), you(44), from(34), source(32), query(25), option(25), collection(24), data(20), can(19)

    As you can see, the list above contains more important tokens from the file.

    For more details about the schema, see the annotated example in the LuceneTextAnalyzer scaladocs.

    LuceneTextAnalyzer has several other analysis methods: analyzeMV() to perform analysis on multi-valued input; and analyze(MV)Java() convenience methods that accept and emit Java-friendly datastructures. There is an overloaded set of these methods that take in a map keyed on field name, with text values to be analyzed – these methods return a map from field names to output token sequences.

    Extracting text features in spark.ml pipelines

    The spark.ml machine learning library includes a limited number of transformers that enable simple text analysis, but none support more than one input column, and none support multi-valued input columns.

    The spark-solr project includes LuceneTextAnalyzerTransformer, which uses LuceneTextAnalyzer and its schema format, described above, to extract tokens from one or more DataFrame text columns, where each input column’s analysis configuration is specified by the schema.

    If you don’t supply a schema (via e.g. the setAnalysisSchema() method), LuceneTextAnalyzerTransformer uses the default schema, below, which analyzes all fields in the same way: StandardTokenizer followed by LowerCaseFilter:

    { "analyzers": [{ "name": "StdTok_LowerCase", "tokenizer": { "type": "standard" }, "filters": [{ "type": "lowercase" }] }], "fields": [{ "regex": ".+", "analyzer": "StdTok_LowerCase" }] }

    LuceneTextAnalyzerTransformer puts all tokens extracted from all input columns into a single output column. If you want to keep the vocabulary from each column distinct from other columns’, you can prefix the tokens with the input column from which they came, e.g. word from column1 becomes column1=word – this option is turned off by default.

    You can see LuceneTextAnalyzerTransformer in action in the spark-solr MLPipelineScala example, which shows how to use LuceneTextAnalyzerTransformer to extract text features to build a classification model to predict the newsgroup an article was posted to, based on the article’s text. If you wish to run this example, which expects the 20 newsgroups data to be indexed into a Solr cloud collection, follow the instructions in the scaladoc of the NewsgroupsIndexer example, then follow the instructions in the scaladoc of the MLPipelineScala example.

    The MLPipelineScala example builds a Naive Bayes classifier by performing K-fold cross validation with hyper-parameter search over, among several other params’ values, whether or not to prefix tokens with the column from which they were extracted, and 2 different analysis schemas:

    val WhitespaceTokSchema = """{ "analyzers": [{ "name": "ws_tok", "tokenizer": { "type": "whitespace" } }], | "fields": [{ "regex": ".+", "analyzer": "ws_tok" }] }""".stripMargin val StdTokLowerSchema = """{ "analyzers": [{ "name": "std_tok_lower", "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }], | "fields": [{ "regex": ".+", "analyzer": "std_tok_lower" }] }""".stripMargin [...] val analyzer = new LuceneTextAnalyzerTransformer().setInputCols(contentFields).setOutputCol(WordsCol) [...] val paramGridBuilder = new ParamGridBuilder() .addGrid(hashingTF.numFeatures, Array(1000, 5000)) .addGrid(analyzer.analysisSchema, Array(WhitespaceTokSchema, StdTokLowerSchema)) .addGrid(analyzer.prefixTokensWithInputCol)

    When I run MLPipelineScala, the following log output says that the std_tok_lower analyzer outperformed the ws_tok analyzer, and not prepending the input column onto tokens worked better:

    2016-04-08 18:17:38,106 [main] INFO CrossValidator - Best set of parameters: { LuceneAnalyzer_9dc1a9c71e1f-analysisSchema: { "analyzers": [{ "name": "std_tok_lower", "tokenizer": { "type": "standard" }, "filters": [{ "type": "lowercase" }] }], "fields": [{ "regex": ".+", "analyzer": "std_tok_lower" }] }, hashingTF_f24bc3f814bc-numFeatures: 5000, LuceneAnalyzer_9dc1a9c71e1f-prefixTokensWithInputCol: false, nb_1a5d9df2b638-smoothing: 0.5 } Extra-Solr Text Analysis

    Solr’s PreAnalyzedField field type enables the results of text analysis performed outside of Solr to be passed in and indexed/stored as if the analysis had been performed in Solr.

    As of this writing, the spark-solr project depends on Solr 5.4.1, but prior to Solr 5.5.0, querying against fields of type PreAnalyzedField was not fully supported – see Solr JIRA issue SOLR-4619 for more information.

    There is a branch on the spark-solr project, not yet committed to master or released, that adds the ability to produce JSON that can be parsed, then indexed and optionally stored, by Solr’s PreAnalyzedField.

    Below is a Scala snippet to produce pre-analyzed JSON for a small piece of text using LuceneTextAnalyzer configured with an analyzer consisting of StandardTokenizer+LowerCaseFilter. If you would like to try this at home: clone the spark-solr source code from Github; change directory to the root of the project; checkout the branch (via git checkout SPAR-14-LuceneTextAnalyzer-PreAnalyzedField-JSON); build the project (via mvn -DskipTests package); start the Spark shell (via $SPARK_HOME/bin/spark-shell --jars target/spark-solr-2.1.0-SNAPSHOT-shaded.jar); type paste: into the shell; and finally paste the code below into the shell after it prints // Entering paste mode (ctrl-D to finish):

    import com.lucidworks.spark.analysis.LuceneTextAnalyzer val schema = """{ "analyzers": [{ "name": "StdTokLower", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }], | "fields": [{ "regex": ".+", "analyzer": "StdTokLower" }] } """.stripMargin val analyzer = new LuceneTextAnalyzer(schema) val text = "Ignorance extends Bliss." val fieldName = "myfield" println(analyzer.toPreAnalyzedJson(fieldName, text, stored = true))

    The following will be output (whitespace added):

    {"v":"1","str":"Ignorance extends Bliss.","tokens":[ {"t":"ignorance","s":0,"e":9,"i":1}, {"t":"extends","s":10,"e":17,"i":1}, {"t":"bliss","s":18,"e":23,"i":1}]}

    If we make the value of the stored option false, then the str key, with the original text as its value, will not be included in the output JSON.

    Summary

    LuceneTextAnalyzer simplifies Lucene text analysis, and enables use of Solr’s PreAnalyzedField. LuceneTextAnalyzerTransformer allows for better text feature extraction by leveraging Lucene text analysis.

    The post Better Feature Engineering with Spark, Solr, and Lucene Analyzers appeared first on Lucidworks.com.

    District Dispatch: Archived webinar on libertarians and copyright now available

    Wed, 2016-04-13 13:43

    Check out the latest CopyTalk webinar.

    An archived copy of the CopyTalk webinar “what do Libertarians think about copyright law?” is now available. Originally webcasted on April 7, 2016 by the Office for Information Technology Policy’s Copyright Education Subcommittee, presenters were Zach Graves, director of digital strategy and a senior fellow working at the intersection of policy and technology at R Street and Dr. Wayne Brough, chief economist and vice president for research at FreedomWorks. They described how thinkers from across the libertarian spectrum view copyright law. Is copyright an example of big government? Does the law focus too much on content companies and their interests?  Is it in the interests of freedom and individual choice? FreedomWorks and R Street (along with ALA) are founding members of Re:Create, a new DC copyright coalition that  supports balanced copyright, creativity, understandable copyright law, and freedom to tinker and innovate!

    Plan ahead! One hour CopyTalk webinars occur on the first Thursday of every month at 11am Pacific/2 pm Eastern Time.  It’s free! On May 5, university copyright education programs will be on tap!

    The post Archived webinar on libertarians and copyright now available appeared first on District Dispatch.

    LITA: Creating a Technology Needs Pyramid

    Wed, 2016-04-13 12:00

    Technology Training in Libraries” by Sarah Houghton has become my bible. It was published as part of LITA’s Tech Set series back in 2010 and acts as a no-nonsense guide to technology training for librarians. Before I started my current position, implementing a technology training model seemed easy enough, but I’ve found that there are many layers, including (but certainly not limited to) things like curriculum development, scheduling, learning styles, generational differences, staff buy-in, and assessment. It’s a prickly pear and one of the biggest challenges I’ve faced as a professional librarian.

    After several months of training attempts I took a step back after finding inspiration in the bible. In her chapter on planning, Houghton discusses the idea of developing a Technology Needs Pyramid similar to the Maslow Hierarchy of Needs (originally proposed by Aaron Schmidt on the Walking Paper blog). Necessary skills and competencies make up the base and more idealistic areas of interest are at the top. Most of my research has pointed me towards creating a list of competencies, but the pyramid was much more appealing to a visual thinker like me.

    In order to construct a pyramid for the Reference Services department, we held a brainstorming session where I asked my co-workers what they feel they need to know to work at the reference desk, followed by what they want to learn. At Houghton’s suggestion, I also bribed them with treats. The results were a mix of topics (things like data visualization and digital mapping) paired with specific software (Outlook, Excel, Photoshop).

    Once we had a list I created four levels for our pyramid. “Need to Know” is at the bottom and “Want to Learn” is at the top, with a mix of both in between. I hope that this pyramid will act as a guideline for our department, but more than anything it will act as a guide for me going forward. I’ve already printed it and pinned it to my bulletin board as a friendly daily reminder of what my department needs and what they’re curious about. While I’d like to recommend the Technology Needs Pyramid to everyone, I don’t have the results just yet! I look forward to sharing our progress as we rework our technology plan. In the meantime I can tell you that whether it’s a list, graphic, or narrative; collecting (and demanding) feedback from your colleagues is vital. It’s not always easy, but it’s definitely worth the cost of a dozen donuts.

    Harvard Library Innovation Lab: LIL at IIPC: Noticing Reykjavik

    Tue, 2016-04-12 23:33

    Matt, Jack, and Anastasia are in Reykjavik this week, along with Genève Campbell of the Berkman Center for Internet and Society, for the annual meeting of the International Internet Preservation Consortium. We’ll have lots of details from IIPC coming soon, but for this first post we wanted to share some of the things we’re noticing here in Reykjavik. 

    [Genève] Nothing in Reykjavik seems to be empty space. There is always room for something different, new, creative, or odd to fill voids. This is the parking garage of the Harpa concert hall. Traditional fluorescent lamps are interspersed with similar ones in bright colors.

    [Jack] I love how many ways there are to design something as simple as a bathroom. Here are some details I noticed in our guest house:


    Clockwise from top left: shower set into floor; sweet TP stash as design element; soap on a spike and exposed hot/cold pipes; toilet tank built into wall.

    [Matt] Walking around the city is colorful and delightful. Spotting an engaging piece of street art is a regular occurrence. A wonderful, regular occurrence.

    [Anastasia] After returning from Iceland for the first time a year ago, I found myself missing something I don’t normally give much thought to: Icelandic money is some of the loveliest currency I have ever seen.

    The banknotes are quite complex artistically, and yet every denomination abides by thoughtful design principles. Each banknote’s side shows either a culturally-significant figure or scene. The denomination is displayed prominently, the typography is ornate but consistent. The colors, beautiful.

    But what trumps the aesthetics is the banknotes’ dimensions. Icelandic paper money is sized according to amount: the 500Kr note is smaller than the 1000Kr note, which in turn is outsized by the 5000Kr note. This is incredibly important — it allows visually impaired people to move about more freely in the world.

    In comparison, our money looks silly and our treasury department negligent, as it is impossible to differentiate the values by touch alone. And, confoundingly, there don’t seem to be movements to amend this either: in 2015 the department made “strides” by announcing it would start providing money readers, little machines that read value to people who filled out what I’m sure is not a fun amount of paperwork, instead of coming up with a simple design solution.

    The coins are a different story. When I first arrived the clunky coins were a happy surprise — they’re delightfully weighty (maybe even a little too bulky for normally non-cash-carrying types), adorned with beautifully thoughtful designs. On one side of each of the coins (gold or silver), the denomination stands out in large type along with local sea creatures: a big Lumpfish, three small Capelin fish, a dolphin, a Shore crab.

    On the reverse side the four great guardians of Iceland gaze intensely. They are the dragon (Dreki), the griffin (Gammur), the bull (Griðungur), and the giant (Bergrisi), that each protected Iceland from Denmark invasion in turn, according to the Saga of King Olaf Tryggvason. On the back of the 1 Krona, only the giant stands, commanding.

    And that’s it. No superfluous information. No humans, either, only mythology and fish.

    Returning home is good things, but sometimes it also means re-entering a world where money is just sad green rectangles (and oddly sized coins) full of earthly men.

    Villanova Library Technology Blog: CFP: Libraries and Archives in the Anthropocene: A Colloquium at NYU

    Tue, 2016-04-12 21:54

    Call for Proposals

    Libraries and Archives in the Anthropocene: A Colloquium
    May 13-14, 2017
    New York University

    As stewards of a culture’s collective knowledge, libraries and archives are facing the realities of cataclysmic environmental change with a dawning awareness of its unique implications for their missions and activities. Some professionals in these fields are focusing new energies on the need for environmentally sustainable practices in their institutions. Some are prioritizing the role of libraries and archives in supporting climate change communication and influencing government policy and public awareness. Others foresee an inevitable unraveling of systems and ponder the role of libraries and archives in a world much different from the one we take for granted. Climate disruption, peak oil, toxic waste, deforestation, soil salinity and agricultural crisis, depletion of groundwater and other natural resources, loss of biodiversity, mass migration, sea level rise, and extreme weather events are all problems that indirectly threaten to overwhelm civilization’s knowledge infrastructures, and present information institutions with unprecedented challenges.

    This colloquium will serve as a space to explore these challenges and establish directions for future efforts and investigations. We invite proposals from academics, librarians, archivists, activists, and others.

    Some suggested topics and questions:

    • How can information institutions operate more sustainably?
    • How can information institutions better serve the needs of policy discussions and public awareness in the area of climate change and other threats to the environment?
    • How can information institutions support skillsets and technologies that are relevant following systemic unraveling?
    • What will information work look like without the infrastructures we take for granted?
    • How does information literacy instruction intersect with ecoliteracy?
    • How can information professionals support radical environmental activism?
    • What are the implications of climate change for disaster preparedness?
    • What role do information workers have in addressing issues of environmental justice?
    • What are the implications of climate change for preservation practices?
    • Should we question the wisdom of preserving access to the technological cultural legacy that has led to the crisis?
    • Is there a new responsibility to document, as a mode of bearing witness, the historical event of society’s confrontation with the systemic threat of climate change, peak oil, and other environmental problems?
    • Given the ideological foundations of libraries and archives in Enlightenment thought, and given that Enlightenment civilization may be leading to its own environmental endpoint, are these ideological foundations called into question? And with what consequences?

    Formats:
    Lightning talk (5 minutes)
    Paper (20 minutes)

    Proposals are due August 1, 2016.
    Notifications of acceptance will be sent by September 16, 2016.
    Submit your proposal here: http://goo.gl/forms/rz7uN1mBNM

    Planning committee:

     


    Like0

    Pages