You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 2 weeks 4 days ago

Islandora: Islandora Altmetrics is now Islandora Badges

Mon, 2017-07-17 14:50
Thanks to the efforts of newly minted Islandora 7.x Committer Brandon Weigel, from the British Colombia Electronic Library Network, the Islandora Altmetrics module has received a major overhaul, moving beyond support for Altmetrics to become a more generalized tool for adding various metrics. With this change in function comes a name change: Islandora Badges. Available badges include:
  • Altmetrics: display social media interactions
  • Scopus: Citation counts via the Scopus database
  • Web of Science: Citation counts via Web of Science
  • oaDOI: Provides a link to a fulltext document for objects without a PDF datastream, via the oadoi.org API
The switch from Altmetrics to Badges will take place with the Islandora 7.x-1.10 release (or now, if you're running on HEAD and want the new update).  Making this change took six months of testing and committing, with a total of 161 commits. Special thanks to Jared Whikloj, Jordan Dukart, Diego Pino,  Dan Aitken, and Nick Ruest for their part in helping the process along, and to Will Panting and Donald Moses, who created the original Islandora Altmetrics during the 2015 Islandoracon Hackfest.

Islandora: Islandora Foundation Annual General Meeting - Friday, July 14th

Mon, 2017-07-17 12:40
The Islandora Foundation held its Annual General Meeting on Friday. The agenda and meeting minutes are here.   We had a relatively brief AGM, as much of the business maintaining and improving Islandora is handled by community groups such as our Committers, Interest Groups, Roadmap Committee, Board of Directors, and in discussions on our listserv, but some highlights from Friday's meeting include:  
  • The announcement of the re-election of Mark Jordan as the Chair of Islandora Foundation Board of Directors.
  • A Treasurer's report showing that the Islandora Foundation's financial status is stable, owing to the support of our members and revenue from our events.
  • Reports highlighting the work of the Board of Directors and Roadmap Committee in 2016/2017
  • Updates on the status of Islandora CLAW and the Fedora Specification, and how the two are related.
  Thank you to everyone who attended.

Terry Reese: MarcEdit 7 visual styles: High Contrast

Fri, 2017-07-14 15:50

An interesting request made while reviewing the Wireframes was if MarcEdit 7 could support a kind of high contrast, or “Dark” theme mode.  An Example of this would be Office:

Some people find this interface easier on the eyes, especially if you are working on a screen all day. 

Since MarcEdit utilizes its own GUI engine to handle font sizing, scaling, and styling – this seems like a pretty easy request.  So, I did some experimentation.  Here’s MarcEdit 7 using the conventional UI:

And here it is under the “high contrast” theme:

Since theming falls into general accessibility options, I’ve put this in the language section of the options:

However, I should point out that in MarcEdit 7, I will be changing this layout to include a dedicated setting area for Accessibility options, and this will likely move into that area.

I’m not sure this is an option that I’d personally use as the “Dark” theme or High Contrast isn’t my cup of tea, but with the new GUI engine added to MarcEdit 7 with the removal of XP support – supporting this option really took about 5 minutes to turn on.

Questions, comments?

–tr

D-Lib: RARD: The Related-Article Recommendation Dataset

Fri, 2017-07-14 11:42
Article by Joeran Beel, Trinity College Dublin, Department of Computer Science, ADAPT Centre, Ireland; Zeljko Carevic and Johann Schaible, GESIS - Leibniz Institute for the Social Sciences, Germany; Gabor Neusch, Corvinus University of Budapest, Department of Information Systems, Hungary

D-Lib: Massive Newspaper Migration - Moving 22 Million Records from CONTENTdm to Solphal<

Fri, 2017-07-14 11:42
Article by Alan Witkowski, Anna Neatrour, Jeremy Myntti and Brian McBride, J. Willard Marriott Library, University of Utah

D-Lib: The Best Tool for the Job: Revisiting an Open Source Library Project

Fri, 2017-07-14 11:42
Article by David J. Williams and Kushal Hada, Queens College Libraries, CUNY, Queens, New York

D-Lib: Ensuring and Improving Information Quality for Earth Science Data and Products

Fri, 2017-07-14 11:42
Article by Hampapuram Ramapriyan, Science Systems and Applications, Inc. and NASA Goddard Space Flight Center; Ge Peng, Cooperative Institute for Climate and Satellites-North Carolina, North Carolina State University and NOAA's National Centers for Environmental Information; David Moroni, Jet Propulsion Laboratory, California Institute of Technology; Chung-Lin Shie, NASA Goddard Space Flight Center and University of Maryland, Baltimore County

D-Lib: Trends in Digital Preservation Capacity and Practice: Results from the 2nd Bi-annual National Digital Stewardship Alliance Storage Survey

Fri, 2017-07-14 11:42
Article by Michelle Gallinger, Gallinger Consulting; Jefferson Bailey, Internet Archive; Karen Cariani, WGBH Media Library and Archives; Trevor Owens, Institute of Museum and Library Services; Micah Altman, MIT Libraries

D-Lib: The End of an Era

Fri, 2017-07-14 11:42
Editorial by Laurence Lannom, CNRI

D-Lib: Explorations of a Very-large-screen Digital Library Interface

Fri, 2017-07-14 11:42
Article by Alex Dolski, Independent Consultant; Cory Lampert and Kee Choi, University of Nevada, Las Vegas Libraries

District Dispatch: Library funding bill passes Labor HHS

Thu, 2017-07-13 23:08

In response to today’s House subcommittee vote, ALA President Jim Neal sent ALA members the following update:

(AP Photo/Carolyn Kaster)

Colleagues,

I am pleased to report that, this evening, the House Appropriations subcommittee that deals with library funding (Labor, Health & Human Services, Education and Related Agencies) voted to recommend level funding in FY2018 for the Institute of Museum and Library Services (IMLS, $231 million), likely including $183 million for the Library Services and Technology Act, as well as $27 million for the Innovative Approaches to Literacy program.

Four months ago, President Trump announced that he wanted to eliminate IMLS and federal funding for libraries. Since then, all of us have been communicating with our members of Congress about the value of libraries. This evening’s Subcommittee vote, one important step in the lengthy congressional appropriations process, shows that our elected officials are listening to us and recognize libraries’ importance in the communities they represent. We are grateful to the leaders of the Subcommittee, Chairman Tom Cole (R-OK-4) and Ranking Member Rosa DeLauro (D-CT-3), and all Subcommittee members, for their support.

We have not saved FY18 federal library funding yet. Hurdles can arise at each stage of the appropriations process, which will continue into the fall. But the fact that federal library funding was not cut at this particular stage shows what can be accomplished when ALA members work together. We expect the full House Appropriations Committee to vote on the subcommittee bills as early as next Wednesday, July 19. I will send an update as soon as we have the results of the full committee’s actions.

In the meantime, I encourage you to stay informed and stay involved. Libraries and the millions of people we serve are in a better position today because of your advocacy.

Thank you,

Jim Neal

The post Library funding bill passes Labor HHS appeared first on District Dispatch.

Jonathan Rochkind: on hooking into sufia/hyrax after file has been uploaded

Thu, 2017-07-13 15:53

 

Our app (not yet publicly accessible) is still running on sufia 7.3. (A digital repository framework based on Rails, also known in other versions or other drawings of lines as hydra, samvera, and hyrax).

I had a need to hook into the point after a file has been added to fedora, to do some post-processing at that point.

(Specifically, we are trying to run a riiif instance on another server, without a shared file system (shared FS are expensive and/or tricky on AWS). So, the riiif server needs to copy the original image asset down from fedora. Since our original images are uncompressed TIFFs that average around 100MB, this is somewhat slow, and we want to have the riiif server “pre-load” at least the originals, if not the derivatives it will create. So after a new image is uploaded, we want to ‘ping’ the riiif server with an info request, causing it to download the original, so it’s there waiting for conversion requests, and at least it won’t have to do that. But it can’t pull down the file until it’s in fedora, so we need to wait until after fedora has it to ping. phew.)

Here are the cooperating objects in Sufia 7.3 that lead to actual ingest in Fedora. As far as I can tell. Much thanks to @jcoyne for giving me some pointers as to where to look to start figuring this out.

Keep in mind that I believe “actor” is just hydra/samvera’s name for a service object involved in handling ‘update a persisted thing’. Don’t get it confused with the concurrency notion of an ‘actor’, it’s just an ordinary fairly simple ruby object (although it can and often does queue up an ActiveJob for further processing).

The sufia default actor stack at ActorFactory includes the Sufia::CreateWithFilesActor.

 

  • AttachFilesToWork job does some stuff, but then calls out to a CurationConcerns::Actors::FileSetActor#create_content. (we are using curation_concerns 1.7.7 with sufia 7.3) — At least if it was a direct file upload (I think is what this means). If the file was a `CarrierWave::Storage::Fog::File` (not totally sure in what circumstances it would be), it instead kicks off an ImportUrlJob.  But we’ll ignore that for now, I think the FileSetActor is the one my codepath is following. 

 

 

 

 

  • We are using hydra-works 0.16.0. AddFileToFileSet I believe actually finishes things off synchronously without calling out to anything else related to ‘get this thing into fedora’. Although I don’t really totally understand what the code does, honestly.
    • It does call out to Hydra::PCDM::AddTypeToFile, which is confusingly defined in a file called add_type.rb, not add_type_to_file.rb. (I’m curious how that doesn’t break things terribly, but didn’t look into it).

 

So in summary, we have six fairly cooperating objects involved in following the code path of “how does a file actually get added to fedora”.  They go across 3-4 different gems (sufia, curation_concerns, hydra-works, and maybe hydra-pcdm, although that one might not be relevant here). Some of the classes involved inherit from, mix-in, or have references to classes from other gems. The path involves at least two (sometimes more in some paths?) bg jobs — a bg job that queues up another bg job (and maybe more).

That’s just trying to follow the path involved in “get this uploaded file into fedora”, some  of those cooperating objects also call out to other cooperating objects (and maybe queue bg jobs?) to do other things, involving a half dozenish additional cooperating objects and maybe one or two more gem dependencies, but I didn’t trace those, this was enough!

I’m not certain how much this changed in hyrax (1.0 or 2.0), at the very least there’d be one fewer gem dependency involved (since Sufia and CurationConcerns were combined into Hyrax). But I kind of ran out of steam for compare and contrast here, although it would be good to prepare for the future with whatever I do.

Oh yeah, what was I trying to do again?

Hook into the point “after the thing has been successfully ingested in fedora” and put some custom code there.

So… I guess…  that would be hooking into the ::IngestFileJob (located in CurationConcerns), and doing something after it’s completed. It might be nice to use the ActiveJob#after_perform hook to this.  I actually hadn’t known about that callback, haven’t used it before — we’d need to get at least the file_set arg passed into it, which the docs say maybe you can get from the passed-in job.arguments?  That’s a weird way to do things in ruby (why aren’t ActiveJob’s instances with their state as ordinary state? I dunno), but okay! Or, of course we could just monkey-patch override-and-call-super on perform to get a hook.

Or we could maybe hook into Hydra::Works::AddFileToFileSet instead, I think that does the actual work. There’s no callbacks there, so that’d just be monkey-patch-and-call-super on #call, I guess.

This definitely seems a little bit risky, for a couple different reasons.

  • There’s at least one place where a potentially different path is followed, if you’re uploading a file that ends up as a CarrierWave::Storage::Fog::File instead of a CarrierWave::SanitizedFile.  Maybe there are more I missed? So configuration or behavior changes in the app might cause my hook to be ignored, at least in some cases.

 

  • Forward-compatibility seems unreliable. Will this complicated graph of cooperating instances get refactored?  Has it already in future versions of Hyrax? If it gets refactored, will it mean the object I hook into no longer exists (not even with a different namespace/name), or exists but isn’t called in the same way?  In some of those failure modes, it might be an entirely silent failure where no error is ever raised, my code I’m trying to insert just never gets called. Which is sad. (Sure, one could try to write a spec for this to warn you…  think about how you’d do that. I still am.)  Between IngestFileJob and AddFileToFileSet, is one ‘safer’ to hook into than the other? Hard to say. If I did research in hyrax master branch, it might give me some clues.

I guess I’ll probably still do one of these things, or find another way around it. (A colleague suggested there might be an entirely different place to hook into instead, not the ‘actor stack’, but maybe in other code around the controller’s update action).

What are the lessons?

I don’t mean to cast any aspersions on the people who put in a lot of work, very well-intentioned work, conscientious work, to get hydra/samvera/sufia/hyrax where it is, being used by lots of institutions. I don’t mean to say that I could or would have done differently if I had been there when this code was written — I don’t know that I could or would have.

And, unfortunately, I’m not saying I have much idea of what changes to make to this architecture now, in the present environment, with regards to backwards compat, with regards to the fact that I’m still on code one or two major versions (and a name change) behind current development (which makes the local benefit from any work I put into careful upstream PR’s a lot more delayed, for a lot more work; I’m not alone here, there’s a lot of dispersion in what versions of these shared dependencies people are using, which adds a lot of cost to our shared development).  I don’t really! My brain is pretty tired after investigating what it’s already doing. Trying to make a shared architecture which is easily customizable like this is hard, no ways around it.  (ActiveSupport::Callbacks are trying to do something pretty analogous to the ‘actor stack’, and are one of the most maligned parts of Rails).

But I don’t think that should stop us from some evaluation.  Going forward making architecture that works well for us is aided immensely by understanding what has worked out how in what we’ve done before.

If the point of the “Actor stack” was to make it easy/easier to customize code in a safe/reliable way (meaning reasonably forward-compatible)–and I believe it was–I’m not sure it can be considered a success. We gotta start with acknowledging that.

Is it better than what it replaced?  I’m not sure, I wasn’t there for what it replaced. It’s probably harder to modify in the shared codebase going forward than the presumably simpler thing it replaced though… I can say I’d personally much rather have just one or two methods, or one ActiveJobs, that I just hackily monkey-patch to do what I want, that if it breaks in a future version will break in a simple way, or one that takes less time and brain to figure out what’s going on anyway. That wouldn’t be a great architecture, but I’d prefer it to what’s there now, I think.  Of course, it’s a pendulum, and the grass is always greener, if I had that, I’d probably be wanting something cleaner, and maybe arrive at something like the ‘actor stack’ — but now we’re all here now with what we’ve got, so we can at least consider that this may have gone in some unuseful directions.

What are those unuseful directions?  I think, not just in the actor stack, but in many parts of hydra, there’s an ethos that breaking things into many very small single-purpose classes/instances is the way to go, then wiring them all together.  Ideally with lots of dependency injection so you can switch in em and out.  This reminds me of what people often satirize and demonize in stereotypical maligned Java community architecture, and there’s a reason it’s satirized and demonized. It doesn’t… quite work out.

To pull this off well — especially in shared library/gem codebase, which I think has different considerations than a local bespoke codebase, mainly that API stability is more important because you can’t just search-and-replace all consumers in one codebase when API changes — you’ve got to have fairly stable APIs, which are also consistent and easily comprehensible and semantically reasonable.   So you can replace or modify one part, and have some confidence you know what it’s doing, when it will be called, and that it will keep doing this for at least a few months of future versions. To have fairly stable and comfortable APIs, you need to actually design them carefully, and think about developer use cases. How are developers intended to intervene in here to customize? And you’ve got to document those. And those use cases also give you something to evaluate later — did it work for those use cases?

It’s just not borne out by experience that if you make everything into as small single-purpose classes as possible and throw them all together, you’ll get an architecture which is infinitely customizable. You’ve got to think about the big picture. Simplicity matters, but simplicity of the architecture may be more important than simplicity of the individual classes. Simplicity of the API is definitely more important than simplicity of internal non-public implementation. 

When in doubt if you’re not sure you’ve got a solid stable comfortable API,  fewer cooperating classes with clearly defined interfaces may be preferable to  more classes that each only have a few lines. In this regard, rubocop-based development may steer us wrong, too much to the micro-, not enough to the forest.

To do this, you’ve got to be careful, and intentional, and think things through, and consider developer use cases, and maybe go slower and support fewer use cases.  Or you wind up with an architecture that not only does not easily support customization, but is very hard to change or improve. Cause there are so many interrelated coupled cooperating parts, and changing any of them requires changes to lots of them, and breaks lots of dependent code in local apps in the process. You can actually make forwards-compatible-safe code harder, not easier.

And this gets even worse when the cooperating objects in a data flow are spread accross multiple gems dependencies, as they often are in the hydra/samvera stack. If a change in one requires a change in another, now you’ve got dependency compatibility nightmares to deal with too. Making it even harder (rather than easier, as was the original goal) for existing users to upgrade to new versions of dependencies, as well as harder to maintain all these dependencies.  It’s a nice idea, small dependencies which can work together — but again, it only works if they have very stable and comfortable APIs.  Which again requires care and consideration of developer use cases. (Just as the Java community gives us a familiar cautionary lesson about over-architecture, I think the Javascript community gives us a familiar cautionary lesson about ‘dependency hell’. The path to code hell is often paved with good intentions).

The ‘actor stack’ is not the only place in hydra/samvera that suffers from some of these challenges, as I think most developers in the stack know.  It’s been suggested to me that one reason there’s been a lack of careful, considered, intentional architecture in the stack is because of pressure from institutions and managers to get things done, why are you spending so much time without new features?  (I know from personal experience this pressure, despite the best intentions, can be even stronger when working as a project-based contractor, and much of the stack was written by those in that circumstance).

If that’s true, that may be something that has to change. Either a change to those pressures — or resisting them by not doing rearchitectures under those conditions. If you don’t have time to do it carefully, it may be better not to commit the architectural change and new API at all.  Hack in what you need in your local app with monkey-patches or other local code instead. Counter-intuitively, this may not actually increase your maintenance burden or decrease your forward-compatibility!  Because the wrong architecture or the wrong abstractions can be much more costly than a simple hack, especially when put in a shared codebase. Once a few people have hacked it locally and seen how well it works for their use cases, you have a lot more evidence to abstract the right architecture from.

But it’s still hard!  Making a shared codebase that does powerful things, that works out of the box for basic use cases but is still customizable for common use cases, is hard. It’s not just us. I worked last year with spree/solidus, which has an analogous architectural position to hydra/samvera, also based on Rails, but in ecommerce instead of digital repositories. And it suffers from many of the same sorts of problems, even leading to the spree/solidus fork, where the solidus team thought they could do better… and they have… maybe… a little.  Heck, the challenges and setbacks of Rails itself can be considered similarly.

Taking account of this challenge may mean scaling back our aspirations a bit, and going slower.   It may not be realistic to think you can be all things to all people. It may not be realistic to think you can make something that can be customized safely by experienced developers and by non-developers just writing config files (that last one is a lot harder).  Every use case a participant or would-be participant has may not be able to be officially or comfortably supported by the codebase. Use cases and goals have to be identified, lines have to drawn. Which means there has to be a decision-making process for who and how they are drawn, re-drawn, and adjudicated too, whether that’s a single “benevolent dictator” person or institution like many open source projects have (for good or ill), or something else. (And it’s still hard to do that, it’s just that there’s no way around it).

And finally, a particularly touchy evaluation of all for the hydra/samvera project; but the hydra project is 5-7 years old, long enough to evaluate some basic premises. I’m talking about the twin closely related requirements which have been more or less assumed by the community for most of the project’s history:

1) That the stack has to be based on fedora/fcrepo, and

2) that the stack has to be based on native RDF/linked data, or even coupled to RDF/linked data at all.

I believe these were uncontroversial assumptions rather than entirely conscious decisions (edit 13 July, this may not be accurate and is a controversial thing to suggest among some who were around then. See also @barmintor’s response.), but I think it’s time to look back and wonder how well they’ve served us, and I’m not sure it’s well.  A flexible powerful out-of-the-box-app shared codebase is hard no matter what, and the RDF/fedora assumptions/requirements have made it a lot harder, with a lot more uncharted territory to traverse, best practices to be invented with little experience to go on, more challenging abstractions, less mature/reliable/performant components to work with.

I think a lot of the challenges and breakdowns of the stack are attributable to those basic requirements — I’m again really not blaming a lack of skill or competence of the developers (and certainly not to lack of good intentions!). Looking at the ‘actor stack’ in particular, it would need to do much simpler things if it was an ordinary ActiveRecord app with paperclip (or better yet shrine), it would be able to lean harder on mature shared-upstream paperclip/shrine to do common file handling operations, it would have a lot less code in it, and less code is always easier to architect and improve than more code. And meanwhile, the actually realized business/institutional/user benefits of these commitments — now after several+ years of work put into it — are still unclear to me.  If this is true or becomes consensual, and an evaluation of the fedora/rdf commitments and foundation does not look kindly upon them… where does that leave us, with what options?


Filed under: General

FOSS4Lib Upcoming Events: VuFind Summit 2017

Thu, 2017-07-13 14:50
Date: Monday, October 9, 2017 - 08:00 to Tuesday, October 10, 2017 - 17:00Supports: VuFind

Last updated July 13, 2017. Created by Peter Murray on July 13, 2017.
Log in to edit this page.

VuFind Summit 2017 conference page

Open Knowledge Foundation: FutureTDM symposium: sharing project findings, policy guidelines and practitioner recommendations

Thu, 2017-07-13 14:16

The FutureTDM project, in which Open Knowledge International participates, actively engages with stakeholders in the EU such as researchers, developers, publishers and SMEs to help improve the uptake of text and data mining (TDM) in Europe (read more). Last month, we held our FutureTDM Symposium at the International Data Science Conference 2017 in Salzburg, Austria. With the project drawing to a close, we shared the project findings and our first expert driven policy recommendations and practitioner guidelines. This blog report has been adapted from the original version on the FutureTDM blog.

The FutureTDM track at the International Data Science Conference 2017 started with a speech by Bernhard Jäger form SYNYO who did a brief introduction to the project and explained the purpose of the Symposium – bringing together policy makers and stakeholder groups to share with them FutureTDM’s findings on how to increase TDM uptake.

Introduction to the FutureTDM project from FutureTDM

This was followed by a keynote speech on the Economic Potential of Data Analytics by Jan Strycharz from Fundacja Projekt Polska, a FutureTDM project partner. It was estimated that automated (big) data and analytics – if developed properly – will bring over 200 B Euro to the European GDP by 2020. This means that algorithms (not to say robots) will be, then, responsible for 1.9% of the European GDP. You can read more on the TDM impact on economy in our report Trend analysis, future applications and economics of TDM.

Dealing with the legal bumps

The plenary session with keynote speeches was followed by the panel: Data Analytics and the Legal Landscape: Intellectual Property and Data Protection. As an introduction to this legal session Freyja van den Boom from Open Knowledge International presented our findings on the legal barriers to TDM uptake that mainly refer to type of content and applicable regime (IP or Data Protection). Having gathered evidence from the TDM community, FutureTDM has identified three types of barriers: uncertainty, fragmentation and restrictiveness and developed guidelines recommendation how to overcome them. We have summarised this in our awareness sheet Legal Barriers and Recommendations.

This was followed by the statements from the panelists: Prodromos Tsiavos (Onassis Cultural Centre/ IP Advisor) stressed the fact that with the recent changes in the European framework, the law faces significant issues and balancing the industrial interest is becoming necessary. He added that in order to initiate the uptake of the industry, a different approach is certainly needed because the industry will continue with license arrangements.

Duncan Campbell (John Wiley & Sons, Inc.) concentrated on Copyright and IP issues. How do we deal with all the knowledge created? How does the copyright rule has influence? He spoke about EU Commission Proposal and UK TDM exception – how to make an exception work?

Marie Timmermann (Science Europe) also focused on the TDM exception and its positive and negative sides. From the positive perspective, she views the fact that TDM exception moved from being optional to mandatory and it is not overridable. From the negative side she stated that the exception is very limited in scope. Startups or SMEs do not fall under this exception. Thus, Europe risks to lose promising researchers to other parts of the world.

This statement was also supported by Romy Sigl (AustrianStartups). She confirmed that anybody can created a startup today, but if startups are not supported by legislation, they move outwards to another country where more potential is foreseen.

The right to read is the right to mine

The next panel was devoted to an overview of FutureTDM case studies: Startups to Multinationals. Freyja van den Boom (OKI) gave on overview of the highlights of the stakeholder consultations, which cover different areas and stakeholder groups within TDM domain. Peter Murray-Rust (ContentMine) presented a researcher’s view and he stressed that the right to read is to right to mine, but we have no legal certainty what a researcher is allowed to do and what not.

Petr Knoth from CORE added that he believed that we needed the data infrastructure to support TDM. Data scientist are very busy with cleaning the data and they have little time to do the real mining. He added that the infrastructure should not be operated by the publishers but they should provide support.

Donat Agosti from PLAZI focused on how you can make the data accessible so that everybody can use it. He mentioned the case of PLAZI repository – TreatmentBank. It is open and extracts each article and creates citable data. Once you have the data you can disseminate it.

Kim Nilsson from PIVIGO spoke about the support for academics – they have already worked with 70 companies and provided support in TDM for 400 PhD academics. She mentioned how important data analytics and the possibility to see all the connections and correlations are for example for the medical sector. She stressed that data analytics is also extremely important for startups – gaining the access is critical for them.

Data science is the new IT

The next panel was devoted to Universities, TDM and the need for strategic thinking on educating researchers. FutureTDM project officer Kiera McNeice (British Library) gave an overview on the skills and education barriers to TDM. She stressed that there are many people saying that they need to have quite a lot of knowledge to use TDM and that there are skills gap between academia and industry. Also, the barriers to enter are still high because use of the TDM tools often require programming knowledge.

We have put together a series of guidelines to help stakeholders overcome the barriers we have identified. Our policy guidelines include encouraging universities to support TDM through both their research and education arm for example by helping university senior management understand the needs of researchers around TDM, and potential benefits of supporting it. You can read more in our Baseline report of policies and barriers of TDM in Europe, or walk through them via our Knowledge Base.

Kim Nilsson from PIVIGO stressed that the main challenge are software skills. The fact is that if you can do TDM you have fantastic options: startups, healthcare, charity. Their task is to offer proper career advice, help people understand what kind of skills are appreciated and assist them to build on them.

Claire Sewell (Cambridge University Library) elaborated on the skills from the perspective of an academic librarian. What important is the basic understanding on copyright law, keeping up with technical skills and data skills. “We want to make sure that if a researcher comes into the library we are able to help him.”- she concluded.

Jonas Holm from Stockholm University Library highlighted the fact that very little strategical thinking is going on in TDM area. “We have struggled to find much strategical thinking on TDM area. Who is strategically looking for improving the uptake at the universities? We couldn’t find much around Europe” – he said.

Stefan Kasberger (ContentMine) stressed that the social part of the education is also important – meaning inclusion and diversity.

Infrastructure for Technology Implementation

The last session was dedicated to technologies and infrastructures supporting Text and Data Analytics: challenges and solutions. FutureTDM Project Officer Maria Eskevich (Radboud University) delivered a presentation on the TDM landscape with respect to infrastructure for technical implementation.

Stelios Piperidis from OpenMinTed stressed the need for an infrastructure. “Following more on what we have discussed, it looks that TDM infrastructure has to respond to 3 key questions: How can I get hold on the data that I need? How can I find the tool to mine the data? How can I deploy the work carried out?”

Mihai Lupu form Data market Austria brought up the issue of data formats: For example, there is a lot of data in csv files that people don’t know how to deal with.

Maria Gavrilidou (clarin:el) highlighted the fact that not only the formats are problem but also identifying the source of data and putting in place lawful procedures with respect to this data. Meta data is also problematic because it very often does not exist.

Nelson Silva (know-centre) focused on using proper tools for mining the data. Very often there is no particular tool that meets your needs and you have to either develop one or search for open source tools. Another challenge is the quality of the data. How much can you rely on the data and how to visualise it? And finally, how to be sure that the people will have the right message.

Roadmap

The closing session was conducted by Kiera McNeice (British Library), who presented A Roadmap to promoting greater uptake of Data Analytics in Europe.  Finally, we also had a Demo Session with flash presentations by:

  • Stefan Kasberger (ContentMine),
  • Donat Agosti (PLAZI), Petr Knoth (CORE),
  • John Thompson-Ralf Klinkenberg (Rapidminer),
  • Maria Gavrilidou (clarin:el),
  • Alessio Palmero Aprosio (ALCIDE)

You can find all FutureTDM reports in our Knowledge Library, or visit our Knowledge Base: a structured collection of resources on Text and Data Mining (TDM) that has been gathered throughout the FutureTDM project.

 

OCLC Dev Network: DEVCONNECT 2017: Dashboards and Artificial Intelligence in Libraries

Thu, 2017-07-13 13:00

Wayne State University Libraries has been working on several projects aimed at better understanding library material usage using OCLC products and APIs. 

Terry Reese: MarcEdit 7: MARC Tools Wireframe

Thu, 2017-07-13 06:03

The changes aren’t big – they are really designed to make the form a little more compact and add common topics to the screen.  The big changes are related to integrations.  In MarcEdit 6.x, when you run across an error, you have to open the validator, pick the correct validation option, etc.  This won’t be the case any longer.  When the tool determines that the problem may be related to the record structure – it will just offer you option to check for errors in your file…no opening the validator, not picking options.  This should make it easier get immediate feedback regarding any structural processing errors that the tool may run up against.

MARC Tools Window Wireframe #1:

The second write frame collapses the list into an autocomplete/autosuggest options, moves data around and demonstrates some of the potential integration options.  I like this one as well – though I’m not sure if having the items in a dropdownlist with autocomplete would be more difficult to use than the current dropdown list.  I also use this as an opportunity to get ride of the Input File and Output file labels.  I’m not sure these are always necessary, and I honestly hate seeing them.  But I know that iconography maybe isn’t the best way to express meeting.  I think attaching tooltips to each button and textbox might allow me to finally let these labels go.

MARC Tools Wireframe #2:

*Update*

Based on feedback, it sounds like the labels are still desired.  So here is wireframe #3 with a slight modification to allow for labels in the window.

MARC Tools Wireframe #3:

–tr

District Dispatch: Breaking #SaveIMLS news from ALA Pres. Jim Neal

Wed, 2017-07-12 23:25

Today, ALA President Jim Neal sent this #SaveIMLS campaign update to the ALA membership regarding tomorrow’s vote on part of the House Appropriations Committee bill:

Colleagues,

I’m pleased, but with important cautions, to tell you that all of our collective work to Fight for Libraries! is poised to pay off dramatically. Key parts of the House Appropriations Subcommittee bill that is scheduled to be voted on tomorrow afternoon at 4:30 EDT were released late this afternoon. The bill does NOT cut last year’s funding to the Institute of Museum and Library Services. Once final, that would mean no cuts to LSTA in this critical first vote stage! (We’ll know about Innovative Approaches to Literacy tomorrow.)

Now the cautions… While unlikely, an amendment could be offered to the bill that changes the IMLS appropriation. In addition, after tomorrow’s vote, there will be at least two further procedural opportunities for the bill to be amended. After the House acts, of course, the Senate will take its turn, though probably not for some months.

As the phrase goes, therefore, this is definitely NOT over until it’s over. We will report immediately on the results of tomorrow’s Subcommittee vote. I hope very much that the next thing for us to do will be to thank our House supporters.

Until early evening tomorrow in Washington, please join me in crossing your fingers.

Sincerely
Jim Neal
ALA President

Further updates will be sent out as we get them. In the meantime, if you are a constituent of a Representative working on the Labor HHS subcommittee, give them a call!

The post Breaking #SaveIMLS news from ALA Pres. Jim Neal appeared first on District Dispatch.

David Rosenthal: Is Decentralized Storage Sustainable?

Wed, 2017-07-12 19:49
There are many reasons to dislike centralized storage services. They include business risk, as we see in le petit musée des projets Google abandonnés, monoculture vulnerability and rent extraction. There is thus naturally a lot of enthusiasm for decentralized storage systems, such as MaidSafe, DAT and IPFS. In 2013 I wrote about one of their advantages in Moving vs. Copying. Among the enthusiasts is Lambert Heller. Since I posted Blockchain as the Infrastructure for Science, Heller and I have been talking past each other. Heller is talking technology; I have some problems with the technology but they aren't that important. My main problem is an economic one that applies to decentralized storage irrespective of the details of the technology.

Below the fold is an attempt to clarify my argument. It is a re-statement of part of the argument in my 2014 post Economies of Scale in Peer-to-Peer Networks, specifically in the context of decentralized storage networks.

To make my argument I use a model of decentralized storage that abstracts away the details of the technology. The goal is a network with a large number of peers each providing storage services. This network is:
  • decentralized in the sense that no single entity, or small group of entities, controls the network (the peers are independently owned and operated), and
  • sustainable, in that the peers do not lose financially by providing storage services to the network.
I argue that this network is economically unstable and will, over time, become centralized. This argument is based on work from the 80s by the economist W. Brian Arthur1.

Let us start by supposing that such a decentralized storage network has, by magic, been created:
  • It consists of a large number of peers, initially all providing the same amount of storage resource to the network.
  • Users submit data to be stored to the network, not to individual peers. The network uses erasure coding to divide the data into shards and peers store shards.
  • Each peer incurs costs to supply this resource, in the form of hardware, bandwidth, power, cooling, space and staff time.
  • The network has no central organization which could contract with the peers to supply their resource. Instead, it rewards the peers in proportion to the resource they supply by a token, such as a crypto-currency, that the peers can convert into cash to cover their costs.
  • The users of the network rent space in the network by buying tokens for cash on an exchange, setting a market price at which peers can sell their tokens for cash. This market price sets the $/TB/month rent that users must pay, and that peers receive as income. It also ensure that users do not know which peers store their data.
Although the income each peer receives per unit of storage is the same, as set by the market, their costs differ. One might be in Silicon Valley, where space, power and staff time are expensive. Another might be in China, where all these inputs are cheap. So providing resources to the network is more profitable in China than in Silicon Valley.

Suppose the demand for storage is increasing. That demand will preferentially be supplied from China, where the capital invested in adding capacity can earn a greater reward. Thus peers in China will add capacity faster than those in Silicon Valley and will enjoy not merely a lower cost base because of location, but also a lower cost base from economies of scale. This will increase the cost differential driving the peers to China, and create a feedback process.

Competition among the peers and decreasing hardware costs will drive down the  $/TB/month rent to levels that are uneconomic for Silicon Valley peers, concentrating the storage resource in China (as we see with Bitcoin miners).

Lets assume that all the peers in China share the same low cost base. But some will have responded to the increase in demand before others. They will have better economies of scale than the laggards, so they will in turn grow at the laggards' expense. Growth may be by increasing the capacity of existing peers, or adding peers controlled by the entity with the economies of scale.

The result of this process is a network in which the aggregate storage resource is overwhelmingly controlled by a small number of entities, controlling large numbers of large peers in China. These are the ones which started with a cost base advantage and moved quickly to respond to demand. The network is no longer decentralized, and will suffer from the problems of centralized storage outlined above.

This should not be a surprise. We see the same winner-take-all behavior in most technology markets. We see this behavior in the Bitcoin network.

I believe it is up to the enthusiasts to explain why this model does not apply to their favorite decentralized storage technology, and thus why it won't become centralized. Or, alternatively, why they aren't worried that their decentralized storage network isn't actually decentralized after all.

References:

  1. Arthur, W. Brian. Competing technologies and lock-in by historical small events: the dynamics of allocation under increasing returns. Center for Economic Policy Research, Stanford University, 1985. in Arthur, W. Brian. Increasing Returns and Path Dependence in the Economy, Michigan University Press, 1994.

Jodi Schneider: QOTD: Working out scientific insights on paper, Lavoisier case study

Wed, 2017-07-12 19:04

…language does do much of our thinking for us, even in the sciences, and rather than being an unfortunate contamination, its influence has been productive historically, helping individual thinkers generate concepts and theories that can then be put to the test. The case made here for the constitutive power of figures [of speech] per se supports the general point made by F.L. Holmes in a lecture addressed to the History of Science Society in 1987. A distinguished historian of medicine and chemistry, Holmes based his study of Antoine Lavoisier on the French chemist’s laboratory notebooks. He later examined drafts of Lavoisier’s published papers and discovered that Lavoisier wrote many versions of his papers and in the course of careful revisions gradually worked out the positions he eventually made public (Holmes, 221). Holmes, whose goal as a historian is to reconstruct the careful pathways and fine structure of scientific insights, concluded from his study of Lavoisier’s drafts

We cannot always tell whether a thought that led him to modify a passage, recast an argument, or develop an alternative interpretation occurred while he was still engaged in writing what he subsequently altered, or immediately afterward, or after some interval during which he occupied himself with something else; but the timing is, I believe, less significant than the fact that the new developments were consequences of the effort to express ideas and marshall supporting information on paper (225).

– page xi of Rhetorical Figures in Science by Jeanne Fahnestock, Oxford University Press, 1999.

She is quoting Frederich L. Holmes. 1987. Scientific writing and scientific discovery. Isis 78:220-235. DOI:10.1086/354391

As Moore summarizes,

Lavoisier wrote at least six drafts of the paper over a period of at least six months. However, his theory of respiration did not appear until the fifth draft. Clearly, Lavoisier’s writing helped him refine and understand his ideas.

Moore, Randy. Language—A Force that Shapes Science. Journal of College Science Teaching 28.6 (1999): 366. http://www.jstor.org/stable/42990615
(which I quoted in
a review I wrote recently)

Fahnestock adds:
“…Holmes’s general point [is that] there are subtle interactions ‘between writing, thought, and operations in creative scientific activity’ (226).”

Pages