You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 1 month 1 day ago

Islandora: New to Islandora: The Technical Advisory Group

Wed, 2017-09-27 13:49

Along with our recent shift to calling the Islandora Roadmap Committee the Islandora Coordinating Committee comes another addition to our governing structure: the Islandora Technical Advisory Group (TAG). This new body, capped at eight member so it stays agile, will work with the Technical Lead (Danny Lamb) to make recommendations about the development of Islandora, while considering the very diverse needs of our community. Members of the Technical Advisory Group can be drawn from any member institution in the Islandora Foundation, and are elected by and report to the Board of Directors.

Activities of the TAG include:

  • Remaining abreast of trends in community development and their implications for the larger code base.
  • Researching and advising on architectural decisions.

  • Advising on development and deprecation timelines for major versions.

  • Recommending priorities for community sprints.

  • At the request of the Committers, providing guidance about specific development issues.

  • Liaising with the Islandora community to clarify optimal development goals and general needs.

  • Providing guidance and support to the Technical Lead for making decisions and setting priorities.

  • Presenting recommendations to the Board of Directors and/or Roadmap regarding things such as:

    • Licensing issues

    • Spending of money (such as to hire contractors or pay for development tools)

    • Suggesting new nominees at lower levels of membership for the TAG

    • Acceptance of new modules into the code base

  • Organizing meetings and communicating regularly with stakeholders.

Meetings of the TAG will be limited to members, but agendas will be published ahead of time with a request for suggested items from the community, and notes will be published after each meeting.

The initial members of the Islandora Technical Advisory Group, as elected by the Board of Directors, are:

  • Jared Whiklo (University of Manitoba)
  • Bryan Brown (Florida State University)

  • William Panting (discoverygarden, Inc.)

  • Jonathan Green (LYRASIS)

  • Derek Merleaux (Born-Digital)

  • Rosie Le Faive (University of Prince Edward Island)

District Dispatch: After 20 years, House hearing focuses on depository libraries

Wed, 2017-09-27 13:30

On September 26, Congress’ Committee on House Administration held a hearing to discuss the Federal Depository Library Program (FDLP) – the first such hearing in 20 years.

The hearing was part of the committee’s initiative to examine Title 44 of the U.S. Code, which is the basis for the FDLP and the Government Publishing Office (GPO). While much of the law has not been substantially changed since 1962, today’s meeting is further evidence of growing momentum in Congress to develop legislation that will bolster the FDLP and help libraries connect Americans to their government.

The committee heard today from four librarians, testifying as individual experts rather than for their institutions, about their ideas for strengthening the program to improve the public’s access to government information. In addition, Laurie Hall, GPO’s acting Superintendent of Documents (and a librarian!), testified about the office’s oversight of the program.

Appearing before the committee were:

  • Mike Furlough, executive director of HathiTrust Digital Library, whose members include 128 Federal Depository Libraries
  • Celina McDonald, government documents & criminology librarian at the University of Maryland, the regional depository library for Maryland, Delaware, and the District of Columbia
  • Stephen Parks, State Librarian of the Mississippi State Law Library, which is a selective depository
  • Beth Williams, library director at Stanford Law School, a selective depository

Testimony highlighted the enduring value of the FDLP in ensuring that Americans can access the documents of their government, not just today but in the future. The witnesses also discussed several ideas for facilitating collaboration between GPO and libraries, preserving publications over the long term and improving digital access to government publications.

Committee chairman Rep. Gregg Harper (R-MS3) described the hearing as an opportunity “to see how we can make something that we like, better.” ALA extends our thanks to Chairman Harper and the committee members for their interest in modernizing Title 44 and their thoughtful questions today.

The post After 20 years, House hearing focuses on depository libraries appeared first on District Dispatch.

Terry Reese: MarcEdit 7: Continued Task Refinements

Wed, 2017-09-27 05:26

Now that MarcEdit 7 is available for alpha testers, I’ve been getting back some feedback on the new task processing.  Some of this feedback relates to a couple of errors showing up in tasks that request user interaction…other feedback is related to the tasks themselves and continued performance improvements.

In this implementation, one of the areas that I am really focusing on is performance.  To that end, I changed the way that tasks are processed.  Previously, task processing looked very much like this:

A user would initiate a task via the GUI or command-line, and once the task was processed, the program would then, via the GUI, open a hidden window that would populate each of the “task” windows and then “click” the process button.  Essentially, it was working much like a program that “sends” keystrokes to a window, but in a method that was a bit more automated.

This process had some pros and cons.  On the plus-side, Tasks was something added to MarcEdit 6.x, so this allowed me to easily add the task processing functionality without tearing the program apart.  That was a major win, as tasks then were just a simple matter of processing the commands and filling a hidden form for the user.   On the con-side, the task processing had a number of hidden performance penalties.  While tasks automated processing (which allowed for improved workflows), each task processed the file separately, and after each process, the file would be reloaded into the MarcEditor.  Say you had a file that took 10 seconds to load and a task list with 6 tasks.  The file loading alone, would cost you a minute.  Now, consider if that same file had to be processed by a task list with 60 different task elements – that would be 10 minutes dedicated just to file loading and unloading; and doesn’t count the time to actually process the data.

This was a problem, so with MarcEdit 7, I took the opportunity to actually tear down the way that tasks work.  This meant divorcing the application from the task process and creating a broker that could evaluate tasks being passed to it, and manage the various aspects of task processing.  This has led to the development of a process model that looks more like this now:

Once a task is initiated and it has been parsed, the task operations are passed to a broker.  The broker then looks at the task elements and the file to be processed, and then negotiates those actions directly with the program libraries.  This removes any file loading penalties, and allows me to manage memory and temporary file use at a much more granular way.  It also immediately speeds up the process.  Take that file that takes 10 seconds to load and 60 tasks to complete.  Immediately, you improve processing time by 10 minutes.  But the question still arises, could I do more?

And the answer to this question is actually yes.  The broker has the ability to process tasks in a number of different ways.  One of these is by handling each task process one by one at a file level, the other is handling all tasks all at once, but at a record level.  You might think that record level processing would always be faster, but it’s not.  Consider the task list with 60 tasks.  Some of these elements may only apply to a small subset of records.  In the by file process, I can quickly shortcut processing of records that are out of scope, in a record by record approach, I actually have to evaluate the record.  So, in testing, I found that when records are smaller than a certain size, and the number of task actions to process was within a certain number (regardless of file size), it was almost always better to process the data by file.  Where this changes is when you have a larger task list.  How large, I’m trying to figure that out.  But as an example, I had a real-world example sent to me that has over 950 task actions to process on a file ~350 MB (344,000 records) in size.   While the by file process is significantly faster than the MarcEdit 6.x method (each process incurred a 17 second file load penalty) – this still takes a lot of time to process because you are doing 950+ actions and complete file reads.  While this type of processing might not be particularly common (I do believe this is getting into outlier territory), the process can help to illustrate what I’m trying to teach the broker to do.  I ran this file using the three different processing methodologies, and here’s the results:

  1. MarcEdit 6.3.x: 962 Task Actions, completing in ~7 hours
  2. By File: 962 Task Actions, completing in 3 hours, 12 minutes
  3. By Record: 962 Task Actions, completing in 2 hours and 20 minutes

 

So, that’s still a really, really long time, but taking a closer look at the file and the changes made, and you can start to see why this process takes so much time.  Looking at the results file, >10 million changes have been processed against the 340,000+ records.  Also, consider the number of iterations that must take place.  The average record has approximately 20 fields.  Since each task needs to act upon the results of the task before it, it’s impossible to have tasks process at the same time – rather, tasks must happen in succession.  This means that each task must process the entire record as the results of a task may require an action based on data changed anywhere in the record.  This means that for one record, the program needs to run 962 operations, which means looping through 19, 240 fields (assuming no fields are added or deleted).  Extrapolate that number for 340,000 records, and the program needs to evaluate 6,541,600,000 fields or over 6 billion field evaluations which works out to 49,557,575 field evaluations per minute.

Ideally, I’d love to see the processing time for this task/file pair to be down around 1 hour and 30 minutes.  That would cut the current MarcEdit 7 processing time in half, and be almost 5 hours and 30 minutes faster than the current MarcEdit 6.3.x processing.  Can I get the processing down to that number – I’m not sure.  There are still optimizations to be hand – loops that can be optimized, buffering, etc. – but I think the biggest potential speed gains may possibly be available by adding some pre-processing to a task process to do a cursory evaluation of a recordset if a set of find criteria is present.  This wouldn’t affect every task, but potentially could improve selective processing of Edit Indicator, Edit Field, Edit Subfield, and Add/Delete field functions.  This is likely the next area that I’ll be evaluating.

Of course, the other question to solve is what exactly is the tipping point when By File Processing becomes less efficient than By Record processing.  My guess is that the characteristics that will be most applicable in this decision will be the number of task actions needing to be processed.  Splitting this file for example, into a file of 1000 and running this task by record versus by file – we see the following:

  1. By File processing, 962 Task Actions, completed in: 0.69 minutes
  2. By Record Processing, 962 Task Actions, completed in: 0.36 minutes

 

The processing times are relatively close, but the By Record processing is twice as fast as the By File Processing.  If we reduced the number of tasks to under 20, there is a dramatic switch in the processing time and By File Processing is the clear winner.

Obviously, there is some additional work to be done here, and more testing to do to understand what characteristics and which processing style will lead to the greatest processing gains, but from this testing, I came away with  a couple pieces of information.  First, the MarcEdit 7 process, regardless of method used, is way faster than MarcEdit 6.3.x.  Second, the MarcEdit 7 process and the MarcEdit 6.3.x process suffered from a flaw related to temp file management.  You can’t see it unless you work with files this large and with this many tasks, but the program cleans up temporary files after all processing is complete.  Normally, in a single operation environment, that happens right away.  Since a task represents a single operation, ~962 temporary files at 350 MBs per file were created as part of both processes.  That’s 336, 700 MB of data or 336 GBs of Temporary data!  When you close the program, that data is all cleared, but again, Ouch.  As I say, normally, you’d never see this kind of problem, but in this kind of edge case, it shows up clearly.  This has led me to implement periodic temp file cleanup so that no more than 10 temporary files are stored at any given time.  While that still means that in the case of this test file, up to 3 GB of temporary data could be stored, the size of that temp cache would never grow larger.  This seems to be a big win, and something I would have never seen without working with this kind of data file and use case.

Finally, let’s say after all this work, I’m able to hit the best case benchmarks (1 hr. 30 min.) and a user still feels that this is too long.  What more could be done?  Honestly, I’ve been thinking about that…but really, very little.  There will be a performance ceiling given how MarcEdit has to process task data.  So for those users – if this kind of performance time wasn’t acceptable, I believe only a custom built solution would provide better performance – but even with a custom build, I doubt you’d see significant gains if one continued to require tasks to be processed in sequence.

Anyway – this is maybe a bit more of a deeper dive into how tasks work in MarcEdit 6.3.x and how they will work in MarcEdit 7 than anyone really was looking for – but this particular set of files and use case represented and interesting opportunity to really test the various methods and provide benchmarks that easily demonstrate the impact of the current task process changes.

If you have questions, feel free to let me know.

–tr

DuraSpace News: PASIG 2017 Roundup

Wed, 2017-09-27 00:00

Contributed by Erin Tripp, Business Development Manager for DuraSpace

This was my second time attending PASIG and it was a wonderful experience. It featured stories of successes and failures, detailed explanations of existing guidelines, and expressions of support that we can gain the skills and knowledge to meet our digital preservation goals. If you couldn’t be there, we have you covered:

DuraSpace News: Hot Topics at 4Science: ORCiD API v.2 and New Projects in Four Continents

Wed, 2017-09-27 00:00

From Susanna Mornati, 4Science  4Science released DSpace-CRIS 5.8, including the support for ORCiD API v.2.  

DuraSpace News: Benjamin Gross joins Clarivate Analytics to Support VIVO Implementations

Wed, 2017-09-27 00:00

From Ann Beynon, Clarivate Analytics We are pleased to announce that Benjamin Gross has joined the Clarivate Analytics team to support VIVO implementations, as well as other data integration projects.  Benjamin is well-known in the VIVO community as an active contributor to the VIVO project, Google groups, and conferences. Benjamin was instrumental in supporting VIVO when at UNAVCO.  

David Rosenthal: Sustaining Open Resources

Tue, 2017-09-26 15:00
Cambridge University Office of Scholarly Communication's Unlocking Research blog has an interesting trilogy of posts looking at the issue of how open access research resources can be sustained for the long term:
Below the fold I summarize each of their arguments and make some overall observations.
Lauren CadwalladerFrom the researcher's perspective, Dr. Cadwallader uses the example of the Virtual Fly Brain, a domain-specific repository for the connections of neurons in Drosophila brains. It was established by UK researchers 8 years ago and is now used by about 10 labs in the UK and about 200 worldwide. It was awarded a 3-year Research Council grant, which was not renewed. The Wellcome Trust awarded a further 3 year grant, ending this month. As of June:
it is uncertain whether or not they will fund it in the future. ... On the one hand funders like the Wellcome Trust, Research Councils UK and National Institutes of Health (NIH) are encouraging researchers to use domain specific repositories for data sharing. Yet on the other, they are acknowledging that the current approaches for these resources are not necessarily sustainable. Clearly, this is a global resource not a UK one, but there is no global institution funding research in Drosophila brains. There is a free rider problem; each individual national or charitable funder depends on the resource but would rather not pay for it, and there is no penalty for avoiding paying until it is too late and the resource has gone.
David CarrFrom the perspective of the Open Research team at the Wellcome Trust Carr notes that:
Rather than ask for a data management plan, applicants are now asked to provide an outputs management plan setting out how they will maximise the value of their research outputs more broadly.

Wellcome commits to meet the costs of these plans as an integral part of the grant, and provides guidance on the costs that funding applicants should consider. We recognise, however, that many research outputs will continue to have value long after the funding period comes to an end. We must accept that preserving and making these outputs available into the future carries an ongoing cost.Wellcome has been addressing these on-going costs by providing:
significant grant funding to repositories, databases and other community resources. As of July 2016, Wellcome had active grants totalling £80 million to support major data resources. We have also invested many millions more in major cohort and longitudinal studies, such as UK Biobank and ALSPAC. We provide such support through our Biomedical Resource and Technology Development scheme, and have provided additional major awards over the years to support key resources, such as PDB-Europe, Ensembl and the Open Microscopy Environment. However, these are still grants with end-dates such as faced the Virtual Fly Brain:
While our funding for these resources is not open-ended and subject to review, we have been conscious for some time that the reliance of key community resources on grant funding (typically of three to five years’ duration) can create significant challenges, hindering their ability to plan for the long-term and retain staff. Clearly funders have difficulty committing funds for the long term. And if their short-term funding is successful, they are faced with a "too big to fail" problem. The repository says "pay up now or the entire field of research gets it". Not where a funder wants to end up. Nor is the necessary brinkmanship conducive to "their ability to plan for the long-term and retain staff".

An international workshop of data resources and major funders in the life sciences:
resulted in a call for action (reported in Nature) to coordinate efforts to ensure long-term sustainability of key resources, whilst supporting resources in providing access at no charge to users.  The group proposed an international mechanism to prioritise core data resources of global importance, building on the work undertaken by ELIXIR to define criteria for such resources.  It was proposed national funders could potentially then contribute a set proportion of their overall funding (with initial proposals suggesting around 1.5 to 2 per cent) to support these core data resources.A voluntary "tax" of this kind may be the least bad approach to funding global resources.
Dave GerrardFrom the perspective of a Technical Specialist Fellow from the Polonsky-Foundation-funded Digital Preservation at Oxford and Cambridge project, Gerrard argues that there are two different audiences for open resources. I agree with him about the impracticality of the OAIS concept of Designated Community:
The concept of Designated Communities is one that, in my opinion, the OAIS Reference Model never adequately gets to grips with. For instance, the OAIS Model suggests including explanatory information in specialist repositories to make the content understandable to the general community.

Long term access within this definition thus implies designing repositories for Designated Communities consisting of what my co-Polonsky-Fellow Lee Pretlove describes as: “all of humanity, plus robots”. The deluge of additional information that would need to be added to support this totally general resource would render it unusable; to aim at everybody is effectively aiming at nobody. And, crucially, “nobody” is precisely who is most likely to fund a “specialist repository for everyone”, too.Gerrard argues that the two audiences need:
two quite different types of repository. There’s the ‘ultra-specialised’ Open Research repository for the Designated Community of researchers in the related domain, and then there’s the more general institutional ‘special collection’ repository containing materials that provide context to the science, ... Sitting somewhere between the two are publications – the specialist repository might host early drafts and work in progress, while the institutional repository contains finished, publish work. And the institutional repository might also collect enough data to support these publicationsGerrard is correct to point out that:
a scientist needs access to her ‘personal papers’ while she’s still working, so, in the old days (i.e. more than 25 years ago) the archive couldn’t take these while she was still active, and would often have to wait for the professor to retire, or even die, before such items could be donated. However, now everything is digital, the prof can both keep her “papers” locally and deposit them at the same time. The library special collection doesn’t need to wait for the professor to die to get their hands on the context of her work. Or indeed, wait for her to become a professor.This works in an ideal world because:
A further outcome of being able to donate digitally is that scientists become more responsible for managing their personal digital materials well, so that it’s easier to donate them as they go along.But in the real world this effort to "keep their ongoing work neat and tidy" is frequently viewed as a distraction from the urgent task of publishing not perishing. The researcher bears the cost of depositing her materials, the benefits accrue to other researchers in the future. Not a powerful motivation.

Gerrard argues that his model clarifies the funding issues:
Funding specialist Open Research repositories should be the responsibility of funders in that domain, but they shouldn’t have to worry about long-term access to those resources. As long as the science is active enough that it’s getting funded, then a proportion of that funding should go to the repositories that science needs to support it.Whereas:
university / institutional repositories need to find quite separate funding for their archivists to start building relationships with those same scientists, and working with them to both collect the context surrounding their science as they go along, and prepare for the time when the specialist repository needs to be mothballed. With such contextual materials in place, there don’t seem to be too many insurmountable technical reasons why, when it’s acknowledged that the “switch from one Designated Community to another” has reached the requisite tipping point, the university / institutional repository couldn’t archive the whole of the specialist research repository, describe it sensibly using the contextual material they have collected from the relevant scientists as they’ve gone along, and then store it cheaplyThis sounds plausible but both halves ignore problems:
  • The value of the resource will outlast many grants, where the funders are constrained to award short-term grants. A voluntary "tax" on these grants would diversify the repository's income, but voluntary "taxes" are subject to the free-rider problem. To assure staff recruiting and minimize churn, the repository needs reserves, so the tax needs to exceed the running cost, reinforcing the free-rider's incentives.
  • These open research repositories are a global resource. Once the "tipping point" happens, which of the many university or institutional repositories gets to bear the cost of ingesting and preserving the global resource? All the others get to free-ride. Or does Gerrard envisage disaggregating the domain repository so that each researcher's contributions end up in their institution's repository? If so, how are contributions handled from (a) collaborations between labs, and (b) a researcher's career that spans multiple institutions? Or does he envisage the researcher depositing everything into both the domain and the institutional repository? The researcher's motivation is to deposit into the domain repository. The additional work to deposit into the institutional repository is just make-work to benefit the institution, to which these days most researchers have little loyalty. The whole value of domain repositories is the way they aggregate the outputs of all researchers in a field. Isn't it important to preserve that value for the long term?

LITA: LITA Fall 2017 Online Learning Line Up

Tue, 2017-09-26 14:00

Don’t miss out on the excellent online offerings put together by the LITA Education committee for this fall.

Check out all the offerings at the

LITA Online Learning page.

Select from these titles and find registration details and links on each of the sessions pages.

Building Services Around Reproducibility & Open Scholarship
with presenter: Vicky Steeves
A blended web course with weekly webinars, offered: October 18, 2017 – November 8, 2017

Taking Altmetrics to the Next Level in Your Library’s Systems and Services
with presenter: Lily Troia
Webinar offered: October 19, 2017

Diversity and Inclusion in Library Makerspace
with presenters: Sharona Ginsberg and Lauren Di Monte
Webinar offered: October 24, 2017

Digital Life Decoded: A user-centered approach to cyber-security and privacy
with presenters: Hannah Rainey, Sonoe Nakasone and Will Cross
Webinar offered: November 7, 2017

Introduction to Schema.org and JSON-LD
with presenter: Jacob Shelby
Webinar offered: November 15, 2017

Sign up for any and all of these great sessions today.

Questions or Comments?

Contact LITA at (312) 280-4269 or Mark Beatty, mbeatty@ala.org.

 

DuraSpace News: The 2.5% Commitment

Tue, 2017-09-26 00:00

This blog post was contributed by David W. Lewis, Dean of the IUPUI University Library, dlewis@iupui.edu This work is licensed under a Creative Commons Attribution 4.0 International license 

The Commitment: Every academic library should commit to contribute 2.5% of its total budget to support the common infrastructure needed to create the open scholarly commons.

Academic Libraries and the Open Scholarly Commons

DuraSpace News: Register: Samvera Connect 2017

Tue, 2017-09-26 00:00

November 6 - 9, 2017 join us in Evanston, Illinois for our annual, worldwide gathering of the Samvera Community.  Samvera Connect 2017 is a chance for Samvera Community participants to gather in one place at one time, with an emphasis on synchronizing efforts, technical development, plans, and community links. 

HangingTogether: Making a Book Open Access at HathiTrust.org

Mon, 2017-09-25 22:12

About thirteen years ago I compiled six years worth of my monthly “Digital Library” columns published by Library Journal, and edited, updated, and collected them into chapters that Library Journal published as Managing the Digital Library under it’s Reed Press imprint. Unfortunately, Reed Press imploded not long after the book was published and I was left with only the advance for my effort. The book was first sold to another publisher, then it was eventually passed on to the American Library Association. ALA sent me a royalty check in 2016 for the princely sum of $23.50.

Recently, I happened to come across the book at HathiTrust.org and it occurred to me that I could make it open access. Since I am the copyright holder, I could request that they open it up.

As it turns out, they anticipated this type of request and had created a form for requesting a change in the permissions. All I had to do was fill out the form and mail it in, and within about a week it was open to everyone under a Creative Commons CC-BY license. It couldn’t have been simpler. Thank you HathiTrust!

Raymond Yee: Some of what I missed from the Cmd-D Automation Conference

Mon, 2017-09-25 22:00

The CMD-D|Masters of Automation one-day conference in early August would have been right up my alley:

It’ll be a full day of exploring the current state of automation technology on both Apple platforms, sharing ideas and concepts, and showing what’s possible—all with the goal of inspiring and furthering development of your own automation projects.

Fortunately, those of us who missed it can still get a meaty summary of the meeting by listening to the podcast segment Upgrade #154: Masters of Automation – Relay FM. I've been keen on automation for a long time now and was delighted to hear the panelists express their own enthusiasm for customizing their Macs, iPhones, or iPads to make repetitive tasks much easier and less time-consuming.

Noteworthy take-aways from the podcast include:

  • Something that I hear and believe but have yet to experience in person: non-programmers can make use of automation through applications such as Automator — for macOS — and Workflow for iOS. Also mentioned often as tools that are accessible to non-geeks: Hazel and Alfred – Productivity App for Mac OS X.
  • Automation can make the lives of computer users easier but it's not immediately obvious to many people exactly how.
  • To make a lot of headway in automating your workflow, you need a problem that you are motivated to solve.
  • Many people use AppleScript by borrowing from others, just like how many learn HTML and CSS from copying, pasting, and adapting source on the web.
  • Once you get a taste for automation, you will seek out applications that are scriptable and avoid those that are not. My question is how to make it easier for developers to make their applications scriptable without incurring onerous development or maintenance costs?
  • E-book production is an interesting use case for automation.
  • People have built businesses around scripting Photoshop [is there really a large enough market?]
  • OmniGroup's automation model is well worth studying and using.

I hope there will be a conference next year to continue fostering this community of automation enthusists and professionals.

Dan Cohen: Introducing the What’s New Podcast

Mon, 2017-09-25 17:53

My new podcast, What’s New, has launched, and I’m truly excited about the opportunity to explore new ideas and discoveries on the show. What’s New will cover a wide range of topics, from the humanities, social sciences, natural sciences, and technology, and it is intended for anyone who wants to learn new things. I hope that you’ll subscribe today on iTunes, Google Play, or SoundCloud.

I hugely enjoyed doing the Digital Campus podcast that ran from 2007-2015, and so I’m thrilled to return to this medium. Unlike Digital Campus, which took the format of a roundtable with several colleagues from George Mason University, on What’s New I’ll be speaking largely one-on-one with experts, at Northeastern University and well beyond, to understand how their research is changing our understanding of the world, and might improve the human condition. In a half-hour podcast you’ll come away with a better sense of cutting-edge scientific and medical discoveries, the latest in public policy and social movements, and the newest insights of literature and history.

I know that the world seems impossibly complex and troubling right now, but one of themes of What’s New is that while we’re all paying closer attention to the loud drumbeat of social media, there are people in many disciplines making quieter advances, innovations, and creative works that may enlighten and help us in the near future. So if you’re looking for a podcast with a little bit of optimism to go along with the frank discussion of the difficulties we undoubtedly face, What’s New is for you.

District Dispatch: Next CopyTalk – Understanding Rights Reversion

Mon, 2017-09-25 17:16

Photo credit: trophygeek

Join copyright attorney Brianna Schofield, Executive Director and Erika Wilson, Communications and Operations Manager from the Authors Alliance on October 5th to learn about rights reversion. Authors who have a rights reversion provision in their contractual agreements with publishers can regain their rights of copyright. By doing so, authors can bring their out-of-print books back in print, deposit them in open access repositories, or otherwise make their works available to the public. This CopyTalk will cover why, when, and how to pursue a reversion of rights, review practical tools and resources, and show how librarians can educate authors about their options to regain rights from publishers and make their works newly available.

Tune in and see if rights reversion is right for you!

Mark your calendars for October 5th, 11am Pacific/ 2pm Eastern for “Rights reversion: restoring knowledge and culture, one book at a time.” Go to http://ala.adobeconnect.com/copytalk/ and sign in as a guest. You’re in.

CopyTalks are FREE and brought to you by OITP’s copyright education subcommittee. Archived webinars can be found here.

The post Next CopyTalk – Understanding Rights Reversion appeared first on District Dispatch.

Terry Reese: MarcEdit 7 Downloads Page is live

Mon, 2017-09-25 05:22

I’ve made the MarcEdit 7 Alpha/Beta downloads page live.  Please remember, this is not finished software.  Things will break – but if you are interested in testing and providing feedback as I work towards the Production Release in Nov. 2017, please see: http://marcedit.reeset.net/marcedit-7-alphabeta-downloads-page.

–tr

District Dispatch: 2017 Patterson Copyright Award Winner: Jonathan Band

Fri, 2017-09-22 18:06

Jonathan Band, recipient of the 2017 Patterson Award

We are pleased to announce that Jonathan Band is the 2017 recipient of the L. Ray Patterson Copyright Award. The award recognizes contributions of an individual or group that pursues and supports the Constitutional purpose of the U.S. copyright law, fair use and the public domain.

ALA President James Neal had this to say:

“Jonathan Band has guided the library community over two decades through the challenges of the copyright legal and legislative battles,” said. “His deep understanding of our community and the needs of our users, in combination with his remarkable knowledge and supportive style, has raised our understanding of copyright and our commitment to balanced interpretations and applications of the law. The 2017 L. Ray Patterson Copyright Award appropriately celebrates Jonathan’s leadership, counsel and dedication.”

Band, a copyright attorney at policybandwidth and adjunct professor at Georgetown Law School, has represented libraries and technology associations before Congress, the Executive Branch and the Judicial Branch and has written public comments, testimony, amicus briefs, and countless statements supporting balanced copyright. Band’s amicus brief on behalf of the Library Copyright Alliance was quoted in the landmark Kirstaeng v. Wiley case, where the U.S. Supreme Court ruled that the first sale doctrine applied to books printed abroad, enabling libraries to buy and lend books manufactured overseas. He also represented libraries throughout the Authors Guild v. Google litigation, whose ruling advanced the concept of transformative fair use. Band’s Google Book Settlement/Litigation flow chart, developed to explain the complexity of the case to the public, is widely cited and used worldwide.

The scope of Band’s work extends internationally as well. He has argued for balanced provisions in international trade agreements, including the Trans-Pacific Partnership, and treaties that protect users’ rights and the open internet. He represented U.S. libraries at the World Intellectual Property Organization, which after several years of negotiation adopted the Marrakesh Treaty, mandating enactment of copyright exceptions permitting the making of accessible format copies for print disabled people to help address the worldwide book famine.

(Left to right) Former ALA Washington Office Executive Director Emily Sheketoff, Jonathan Band, Brandon Butler and Mary Rasenberger.

Mr. Band has written extensively on intellectual property and electronic commerce matters, including several books and over 100 articles. He has been quoted as an authority on intellectual property and internet matters in numerous publications, including The New York Times, The Washington Post, USA Today and Forbes, and has been interviewed on National Public Radio, MSNBC and CNN.

The Patterson Award will be presented to Band by ALA President Jim Neal at a reception in Washington, D.C., in October. Several members of the D.C.-based technology policy community also will provide comments on Band’s influential career in advocating for balanced copyright policy.

The post 2017 Patterson Copyright Award Winner: Jonathan Band appeared first on District Dispatch.

District Dispatch: Libraries Ready to Code adds capacity

Fri, 2017-09-22 14:00

Guest post by Linda Braun, CE coordinator for the Young Adult Library Services Association has been involved with ALA’s Libraries Ready to Code initiative since its start, serving as project researcher and now assisting with the administration of the Phase III grant program.

The Libraries Ready to Code (RtC) team is growing by one! Dr. Mega Subramaniam, Associate Professor, College of Information Studies, University of Maryland, will serve as Ready to Code Fellow during 017-2018. Dr. Subramaniam began her involvement with RtC as an advisory committee member and currently serves as co-principal investigator for RtC Phase II. She will contribute overall guidance based on her professional expertise as well as her role as an ALA member.

We are also happy to announce the RtC Selection Committee who will contribute their expertise (and time!) to select a cohort of school and public libraries as part of the next phase of our multi-year initiative. The committee includes representatives of the Association for Library Service to Children (ALSC), the American Association of School Librarians (AASL), the Young Adult Library Services Association (YALSA), and the Office for Information technology Policy (OITP). We are extremely pleased to have such a strong collaboration across different ALA units. Committee members are:

  • Michelle Cooper, White Oak Middle School Media Center, White Oak, TX (AASL)
  • Dr. Colette Drouillard, Valdosta State University, Valdosta, GA (ALSC)
  • Dr. Aaron Elkins, Texas Woman’s University, Denton, TX (AASL)
  • Shilo Halfen (Chair), Chicago Public Library, Chicago, IL (ALSC)
  • Christopher Harris, Genesee Valley Educational Partnership, Le Roy, NY (OITP)
  • Kelly Marie Hincks, Detroit Country Day School, Bloomfield Hills, MI (AASL)
  • Peter Kirschmann, The Clubhouse Network, Boston, MA (YALSA)
  • Dr. Rachel Magee, University of Illinois, Urbana Champaign, Champaign, IL (YALSA)
  • Carrie Sanders, Maryland State Library, Baltimore, MD (YALSA)
  • Conni Strittmatter, Harford County Public Library, Belcamp, MD (ALSC)

The committee is reviewing over 300 applications from across the country to design and implement learning activities that foster youth acquisition of computational thinking and/or computer science (CS) skills. Awards up to $25,000 will be made to as many as 50 libraries. Awardees will form a cohort that will provide feedback for the development of a toolkit of resources and implementation strategies for libraries to use when integrating computational thinking and CS into their activities with and for youth. The resulting toolkit will be made widely available so any library can use it at no cost. The program is sponsored by Google as part of its ongoing commitment to ensure library staff are prepared to provide rich coding/CS programs for youth.

This project is Phase III of the RtC ALA-Google collaboration. The work began with an environmental scan of youth-focused library coding activities. “Ready to Code: Connecting Youth to CS Opportunity Through Libraries,” published as a result of that work, highlights what libraries and library staff need in order to provide high-quality youth-centered computational thinking and computer science activities. Phase II of the project provides faculty at LIS programs across the United States with the opportunity to redesign a syllabus in order to integrate computational thinking and computer science into teaching and learning.

Learn more about the Libraries Ready to Code initiative and access an advocacy video, the Ready to Code report, and infographics on the project website.

The post Libraries Ready to Code adds capacity appeared first on District Dispatch.

Islandora: DuraSpace Migration/ Upgrade Survey: Call for Participation

Fri, 2017-09-22 14:00

From Erin Tripp, Business Development Manager at Duraspace:

I’m collecting anecdotes from people who have undertaken a migration or a major upgrade in the recent past. I hope to collect stories about how the project went, what resources were used or developed during the process, and whether it turned into an opportunity to update skills, re-engage stakeholders, normalize data, re-envision the service, etc.

The data will be used by DuraSpace and affiliate open source communities to develop resources that will fill gaps identified by participants. It will also be used in presentations, blog posts, or other communications that will highlight what we can learn from each other to make migration and upgrade projects a more positive experience in the future.

The data collection will be done through mediated surveys (interview-style) with me (in person, on the phone, or via Skype). Please express your interest in participating by emailing me at etripp@duraspace.org. Or, if you prefer, you can also fill out the survey online by yourself.  The survey will close on Tuesday, October 17, 2017.

Please note: interviewee names are collected for administrative purposes only and will not appear in any published work unless the permission of the interviewee has been obtained in writing.

Here are the survey questions (* denotes a mandatory response):

  • Name
  • Role*
  • Institution
  • What repository software(s) have you migrated from?*
  • What repository software(s) have you migrated to?*
  • Is/are your repository(ies) customized?*
  • If yes, can you tell us how?
  • When did you last undertake a major migration/update?
  • Did customization impact your migration/upgrade?*
  • If yes, tell us how.
  • What were the most significant challenges of the migration/ upgrade process?*
  • Can you elaborate the challenges faced?
  • What were the most significant benefits of the migration/ upgrade process?*
  • Can you elaborate on the benefits?
  • Where there tools or resources that helped you during the process?
  • What element(s) of the project surprised you?
  • What do you wish you knew when you started the project? *
  • What advice would you offer to others who are planning a migration/ upgrade?
  • Is there anything else you’d like to add?

 

FOSS4Lib Recent Releases: Koha - 17.05.04

Fri, 2017-09-22 12:53

Last updated September 22, 2017. Created by Peter Murray on September 22, 2017.
Log in to edit this page.

Package: KohaRelease Date: Friday, September 22, 2017

Lucidworks: Why Facets are Even More Fascinating than you Might Have Thought

Fri, 2017-09-22 08:41

I just got back from an another incredible Lucene/Solr Revolution, this year in Sin City (aka Las Vegas) Nevada. The problem is that there were so many good talks, that I now can’t wait for the video tape to be put up on U-Tube, because I routinely had to make Hobbes choices about which one to see. I was also fortunate to be among those presenting, so my own attempt at cramming well over an hour’s worth of material into 40 minutes will be available for your amusement and hopefully edification as well. In the words of one of my favorite TV comedians from my childhood, Maxwell Smart, I “Missed It by that much”. I ran 4 minutes, 41 seconds over the 40 minutes allotted to be exact.  I know this because I was running a stopwatch on my cell phone to keep me from doing just that. I had done far worse in my science career, cramming my entire Ph.D thesis into a 15 minute slide talk at a Neurosciences convention in Cincinnati – but I was young and foolish then. I should be older and wiser now. You would think.

But it was in that week in Vegas that I reached this synthesis that I’m describing here – and since then have refined even a bit more, which is also why I am writing this blog post.  When I conceived of the talk about a year ago, the idea was to do a sort of review of some interesting things that I had done and blogged about concerning facets. At the time, there must have been a “theme” somewhere in my head – because I remember having been excited about it, but by the time I got around to submitting the abstract four months later and finally putting the slide deck together nearly a year later, I couldn’t remember exactly what that was. I knew that I hadn’t wanted to do a “I did this cool thing, then I did this other cool thing, etc.” about stuff that I had mostly already blogged about, because that would have been a waste of everyone’s time. Fortunately the lens of pressure to get “something” interesting to say after my normal lengthy period of procrastination, plus the inspiration from being at Revolution and the previous days answers to “So Ted, what is your talk going to be about?” led to the light-bulb moment, just in the nick-of-time, that was an even better synthesis than I had had the year before (pretty sure, but again don’t remember, so maybe not – we’ll never know).

My talk was about some interesting things I had done with facets that go beyond traditional usages such as faceted navigation and dashboards. I started with these to get the talk revved up. I also threw in some stuff about the history of facet technologies both to show my age and vast search experience and to compare the terms different vendors used for faceting. At the time, I thought that this was merely interesting from a semantic standpoint, and it also contained an attempt at humor which I’ll get to later. But with my new post-talk improved synthesis – this facet vocabulary comparison is in fact even more interesting  so I am now really glad that I started it off this way (more on this later). I was then planning to launch into my Monty Python “And Now for Something Completely Different” mad scientist section. I also wanted to talk about search and language, which is one of my more predictable soapbox issues. This led up to a live performance of some personal favorite tracks from my quartet of Query Autofilter blogs (1,2,3,4), featuring a new and improved implementation of QAF as a Fusion Query Pipeline Stage (coming soon to Lucidworks Labs) and some new semantic insights gleaned from of my recent eCommerce work for a large home products retailer. I also showed an improved version of the “Who’s In The Who” demo that I had attempted 2 years prior in Austin, based on a cleaner, slicker query patterns (formally Verb Patterns). I used a screenshot for Vegas to avoid the ever present demo gods which had bit me 2 years earlier. I was not worried about the demo per-se with my newly improved and more robust implementation, just boring networking issues and login timeouts and such in Fusion – I needed to be as nimble as I could be. But as I worked on the deck in the week leading up to Revolution – nothing was gelin’ yet.

The Epiphany

I felt that the two most interesting things that I had done with facets were the dynamic boosting typeahead trick from what I like to call my “Jimi Hendrix Blog” and the newer stuff on Keyword Clustering in which I used facets to do some Word-2-Vec’ish things. But as I was preparing to explain these slides – I realized that in both cases, I was doing exactly the same thing at an abstract level!! I had always been talking about “context” as being important – remembering a slide from one of my webinars in which the word CONTEXT was the only word on the slide in bold italic 72 Pt font – a slide that my boss Grant Ingersoll would surely have liked (he had teased me about my well known tendency for extemporizing at lunch before my talk) – I mean, who could talk for more than 2 minutes about one word? As one of my other favorite TV comics from the 60’s and 70’s, Bob Newhart would say – “That … ah … that … would be me”. (but actually not in this case – I timed it – but I’m certainly capable of it) Also, I had always thought of facets as displaying some kind of global result-set context that the UI displayed.

I had also started the talk with a discussion about facets and metadata as being equivalent, but what I realized is that my “type the letter ‘J’ into typeahead, get back alphabetical stuff starting with ‘J’ then search for “Paul McCartney”, then type ‘J’ again and get back ‘John Lennon’ stuff on top” and my heretically mad scientist-esque “facet on all the tokens in a big text field, compute some funky ratios and of the returned 50,000 facet values for the ‘positive’ and ‘negative’ queries for each term and VOILA get back some cool Keyword Clusters” examples were based ON THE SAME PRINCIPAL!!! You guessed it “context”!!!

So, what do we actually mean by “context”?

Context is a word we search guys like to bandy around as if to say, “search is hard, because the answer that you get is dependent on context” – in other words it is often a hand-waving, i.e. B.S. term for “its very complicated”. But seriously, what is context? At the risk of getting too abstractly geeky – I would say that ‘context’ is some place or location within some kind of space. Googling the word got me this definition:

“the circumstances that form the setting for an event, statement, or idea, and in terms of which it can be fully understood and assessed.”

Let me zoom in on “setting for an event” as being roughly equivalent to my original more abstract-mathematical PhD-ie (pronounced “fuddy”) “space” notion. In other words, there are different types of context – personal, interpersonal/social/cultural, temporal, personal-temporal (aka personal history), geospatial, subject/categorical and you can think of them as some kind of “space” in which a “context” is somewhere within that larger space – i.e. some “subspace” as the math and Star Trek geeks would say (remember the “subspace continuum” Trek fans?) – I love this geeky stuff of course, but I hope that it actually helps ‘splain stuff too … The last part “in terms of which is can be fully understood and assessed” is also key and resonates nicely with the Theorem that I am about to unfold.

In my initial discussion on facets as being equivalent to metadata, the totality of all of the facet fields and their values in a Solr collection constitutes some sort of global “meta-informational space”. This led to the recollection/realization that this was why Verity called this stuff “Parametric Search” and led Endeca to call these facet things “Dimensions”. We are dealing with what Math/ML geeks would call an “N-Dimensional hyperspace” in which some dimensions are temporal, some numerical and some textual (whew!). Don’t try to get your head around this – again, just think of it as a “space” in which “context” represents some location or area within that space. Facets then represent vectors or pointers into this “meta-informational” subspace of a collection based on the current query and the collected facet values of the result set. You may want to stop now, get something to drink, watch some TV, take a nap, come back and read this paragraph a few more times before moving on. Or not. But to simplify this a bit (what me? – I usually obfuscate) – lets call a set of facets and their values returned from a query as the “meta-informational context” for that query. So that is what facets do, in a kinda-sorta geeky descriptive way. Works for me and hopefully for you too. In any case, we need to move on.

So, getting back to our example – throw in a query or two and for each, get this facet response which we are now calling the result set’s “meta-informational context” and take another look at the previous examples. In the first case, we were searching for “Paul McCartney” – storing this entity’s meta-informational context and then sending it back to the search engine as a boost query and getting back “John Lennon” related stuff. In the second case, we were searching for each term in the collection, getting back the meta-informational context for that term and then comparing that term’s context with that of all of the other terms that the two facet queries return and computing a ratio, in which related terms have more contextual overlap for the positive than the negative query – so that two terms with similar contexts have high ratios and those with little or no contextual overlap would have low ratio values hovering around 1.0.

Paul McCartney and John Lennon are very similar entities in my Music Ontology and two words that are keywords in the same subject area also have very similar contexts in a subject-based “space” – so these two seemingly different tricks appear to be doing the same thing – finding similar things based on the similarity of their meta-informational contexts – courtesy of facets! Ohhhh Kaaaaay … Cool! – I think we’re on to something here!!

The Facet Theorem

So to boil all of this to an elevator speech – single takeaway slide, I started to think of it as a Theorem in Mathematics – a set of simple, hopefully self-evident assumptions or lemmas that when combined give a cool, and hopefully surprising result. So here goes.

Lemma 1: Similar things tend to occur in similar contexts

Nice. Kinda obvious, intuitive and I added the “tend to” part to cover any hopefully rare contrary edge cases but as this is a statistical thing we are building, that’s OK. Also, I want to start slow with something that seems self-evident to us like “the shortest distance between two points is a straight line” from Euclidian Geometry.

Lemma 2: Facets are a tool for exploring meta-informational contexts

OK, that is what we have just gone through space and time warp explanations to get to, so lets put that in as our second axiom.

In laying out a Theorem we now go to the “it therefore follows that”:

Theorem: Facets can be used to find similar things.

Bingo, we have our Theorem and we already have some data points – we used Paul McCartney’s meta-informational context to find John Lennon, and we used facets to find related keywords that are all related to the same subject area (part 2 document clustering blog is coming soon, promise). So it seems to be workable. We may not have a “proof” yet, but we can use this initial evidence to keep digging for one. So lets keep looking for more examples and in particular for examples that don’t seem to fit this model. I will if you will.

Getting to The Why

So this seems to be a good explanation for why the all of the crazy but disparate seeming stuff that I have been doing with facets works. To me, that’s pretty significant, because we all know that when you can explain “why” something is happening in your code, you’ve essentially got it nailed down, conceptually speaking. It also gets us to a point where we can start to see other use cases that will further test the Facet Theorem (remember, a Theorem is not a Proof – but its how you need to start to get to one). When I think of some more of them, I’ll let you know. Or maybe some optimizations to my iterative, hard to parallalize method.

Facets and UI – Navigation and Visualization

Returning to the synonyms search vendors used for facets – Fast ESP first called these things ‘Navigators’ which Microsoft cleverly renamed to ‘Refiners’. That makes perfect sense for my synthesis – you navigate through some space to get to your goal, or you refine that subspace which represents your search goal – in this case, a set of results. Clean, elegant, it works, I’ll take it. The “goal” though is your final metadata set which may represent some weird docs if your precision sucks – so the space is broken up like a bunch of isolated bubbles. Mathematicians have a word for this – disjointed space. We call it sucky precision. I’ll try to keep these overly technical terms to a minimum from now on, sorry.

As to building way cool interactive dashboards, that is facet magic as well, where you can have lots of cool eye candy in the form of pie charts, bar charts, time-series histograms, scatter plots, tag clouds and the super way cool facet heat maps. One of the very clear advantages of Solr here is that all facet values are computed at query time and are computed wicked fast. Not only that, you can facet on anything, even stuff you didn’t think of when you designed your collection schema through the magic of facet and function queries and ValueSource extensions. Endeca could do some of this too, but Solr is much better suited for this type of wizardry. This is “surfin’ the meta-informational universe” that is your Solr collection. “Universe” is apt here because you can put literally trillions of docs in Solr and it also looks like the committers are realizing Trey’ Grainger’s vision of autoscaling Solr to this order of magnitude, thus saving many intrepid DevOps guys and gals their nights and weekends!  (Great talk as usual by our own Shalin Mangar on this one. Definitely a must-see on the Memorex versions of our talks if you didn’t see his excellent presentation live.) Surfin’ the Solr meta-verse rocks baby!

Facets? Facets? We don’t need no stinkin’ Facets!

To round out my discussion of what my good friend the Search Curmudgeon calls the “Vengines” and their terms for facets, I ended that slide with an obvious reference to everyone’s favorite tag line from the John Huston/Humphrey Bogart classic The Treasure of the Sierra Madre, with the original subject noun replaced with “Facet”.   As we all should know by now, Google uses Larry’s page ranking algorithm also known as Larry Page’s ranking algorithm – to whit PageRank, which is a crowd sourcing algorithm that works very well with hyper linked web pages but is totally useless for anything else. Google’s web search relevance ranking is so good (and continues to improve) that most of the time you just work from the first page so you don’t need no stinkin’ facets to drill in – you are most often already there and what’s the difference between one or two page clicks vs one or two navigator clicks?

I threw in Autonomy here because they also touted their relevance as being auto-magical (that’s why their name starts with ‘Auto’) and to be fair, it definitely is the best feature of that search engine (the configuration layer is tragic).   This marketing was especially true before Autonomy acquired Verity, who did have facets, after which is was much more muddled/wishy washy. One of the first things they did was to create the Fake News that was Verity K2 V7 in which they announced that the APIs would be “pin-for-pin compatible” to K2 V6 but that the core engine would now be IDOL. I now suspect that this hoax  was never really possible anyway (nobody could get it to work) because IDOL could not support navigation, aka facet requests – ’cause it didn’t have them anywhere in the index!! Maybe if they had had Yonik … And speaking of relevance, like the now historical Google Search Appliance “Toaster“, relevance that is autonomous as well as locked down within an intellectual property protection safe is hard to tune/customize. Given that what is relevant is highly contextual – this makes closed systems such as Autonomy and GSA unattractive compared to Solr/Lucene.

But it is interesting that the two engines that consider relevance to be their best feature, eschew facets as unnecessary – and they certainly have a point – facets should not be used as a band-aid for poor relevance in my opinion. If you need facets to find what you are looking for, why search in the first place? Just browse.  Yes Virginia, user queries are often vague to begin with and faceted navigation provides an excellent way to refine the search, but sacrificing too much precision for recall will lead to unhappy users.  This is especially true for mobile apps where screen real estate issues preclude extensive use of facets. Just show me what I want to see, please! So sometimes we don’t want no stinkin’ facets but when we do, they can be awesome.

Finale – reprise of The Theorem

So I want to leave you with the take home message of this rambling, yet hopefully enlightening blog post, by repeating the Facet Theorem I derived here: Facets can be used to find similar things. And the similarity “glue” is one of any good search geek’s favorite words: context. One obvious example that we have always known before, just as Dorothy instinctively knew how to get home from Oz, is in faceted navigation itself – all of the documents that are arrived at by facet queries must share the metadata values that we clicked on – so they must therefore have overlapping meta-informational contexts along our facet click’s navigational axes! The more facet clicks we make, the “space” of remaining document context becomes smaller and their similarity greater! We can now add this to our set of use cases that support the Theorem, along with the new ones I have begun to explore such as text mining, dynamic typeahead boosting and typeahead security trimming. Along these lines, a dashboard is just a way cooler visualization of this meta-informational context for the current query + facet query(ies) within the global collection meta-verse, with charts and histograms for numeric and date range data and tag clouds for text.

So to conclude, facets are fascinating, don’t you agree? And the possibilities for their use go well beyond navigation and visualization. Now to get the document clustering blog out there – darn day job!!!

The post Why Facets are Even More Fascinating than you Might Have Thought appeared first on Lucidworks.

Pages