You are here

Feed aggregator

Mark E. Phillips: Metadata Quality Interfaces: Facet Dashboard

planet code4lib - Thu, 2017-08-31 14:12

This is the second post in a series that discusses the new metadata interfaces we have been developing for the UNT Libraries’ Digital Collections metadata editing environment. The previous post was related to the item views that we have created.

This post discusses our facet dashboard in a bit of depth.  Let’s get started.

Facet Dashboard

A little bit of background is in order so that you can better understand the data that we are working with in our metadata system.  The UNT Libraries uses a locally-extended Dublin Core metadata element set. In addition to locally-extending the elements to include things like collection, partner, degree, citation, note, and meta fields we also qualify many of the fields. A qualifier usually specifics what type of value is represented.  So a subject could be a Keyword, or an LCSH value. A Creator could be an author, or a photographer.  Many of the fields have the ability to have one qualifier for the value.

When we index records in our Solr instance we store strings of each of these elements, and each of the elements plus qualifiers, so we have fields we can facet on.  This results in facet fields for creator as well as specifically creator_author, or creator_photographer.  For fields that we expect the use of a qualifier we also capture when there isn’t a qualifier in a field like creator_none.  This results in many hundreds of fields in our Solr index but we do this for good reason,  to be able to get at the data in ways that are helpful for metadata maintainers.

The first view we created around this data was our facet dashboard.  The image below shows what you get when you go to this view.

Default Facet Dashboard

On the left side of the screen you are presented with facets that you can make use of to limit and refine the information you are interested in viewing.  I’m currently looking at all of the records from all partners and all collections.  This is a bit over 1.8 million records.

The next step is to decide which field you are interested in seeing the facet values for.  In this case I am choosing the Creator field.

Selecting a field to view facet values

After you make a selection you are presented with a paginated view of all of the creator values in the dataset (289,440 unique values in this case). These are sorted alphabetically so the first values are the ones that generally start with punctuation.

In addition to the string value you are presented the number of records in the system that have that given value.

All Creator Values

Because there can be many many pages of results sometimes it is helpful to jump directly to a subset of the records.  This can be accomplished with a “Begins With” dropdown in the left menu.  I’m choosing to look at only facets that start with the letter D.

Limit to a specific letter

After making a selection you are presented with the facets that start with the letter D instead of the whole set.  This makes it a bit easier to target just the values you are looking for.

Creator Values Starting with D

Sometimes when you are looking at the facet values you are trying to identify values that fall next to each other but that might differ only a little bit. One of the things that can make this a bit easier is having a button that can highlight just the whitespace in the strings themselves.

Highlight Whitespace Button

Once you click this button you see that the whitespace is now highlighted in green.  This highlighting in combination with using a monospace font makes it easier to see when values only differ with the amount of whitespace.

Highlighted Whitespace

Once you have identified a value that you want to change the next thing to do is just click on the link for that facet value.

Identified Value to Correct

You are taken to a new tab in your browser that has just the records that have the selected value.  In this case there was just one record with “D & H Photo” that we wanted to edit.

Record with Identified Value

We have a convenient highlighting of visited rows on the facet dashboard so you know which values you have clicked on.

Highlighted Reminder of Selected Value

In addition to just seeing all of the values for the creator field you can also limit your view to a specific qualifier by selecting the qualifier dropdown when it is available.

Select an Optional Qualifier

You can also look at items that don’t have a given value, for example Creator values that don’t have a name type designated.  This is identified with a qualifier value of none-type.

Creator Values Without a Designated Type

You get just the 900+ values in the system that don’t have a name type designated.

All of this can be performed on any of the elements or any of the qualified elements of the metadata records.

While this is a useful first step in getting metadata editors directly to both the values of fields and their counts in the form of facets, it can be improved upon.  This view still requires users to scan a long long list of items to try and identify values that should be collapsed because they are just different ways of expressing the same thing with differences in spacing or punctuation. It is only possible to identify these values if they are located near each other alphabetically.  This can be a problem if you have a field like a name field that can have inverted or non-inverted strings for names.  So there is room for improvement of these interfaces for our users.

Our next interface to talk about is our Count Dashboard.  But that will be in another post.

If you have questions or comments about this post,  please let me know via Twitter.

 

 

 

Ed Summers: Analyzing Retweets

planet code4lib - Thu, 2017-08-31 04:00

Yesterday I got into conversation with Ben Nimmo and Brian Krebs who were the subjects of an intense botnet attack on Twitter. They were experiencing a large number of followers in a short period of time, and a selection of their tweets were getting artificially boosted by retweets of up to 80,000 times. You can read Brian’s detailed writeup here.

At first it seemed completely counter-intuitive to me that someone would direct their botnet (which in all likelihood they are paying for) to boost the followers and messages of people they disagree with. But I wasn’t thinking deviously enough. As Brian points out in his post, Twitter appear to have stepped up suspending botnet accounts and the beneficiaries of the botnet traffic. So boosting a user you don’t like could get them suspended. In addition, as Ben wrote about a few days ago, it is also an intimidation tactic that disrupts the target’s use of Twitter.

Our specific conversation was about how to analyze the retweets since there are tens of thousands and Twitter’s statuses/retweets API endpoint is limited to fetching the last 100 retweets. However, it is possible to use the search/tweets endpoint to search for the retweets using the text of the tweet, as long as the retweets have been sent in the last 7 days, which is furthest back Twitter allow you to search for tweets in. So there is a brief window in which you can fetch the retweets.

If you, like Ben and Brian find yourself needing to collect retweets I thought I would document the process a little bit here. The basic approach should work with different Twitter clients if you prefer to work in another language—I used twarc because I’m familiar with it, and it handles rate limiting easily. I also worked from the command line to explain the process at a higher level. You could certainly write a small program to do this.

So, I wanted to get the retweets for a tweet from Brian that generated a great deal of rapid retweet traffic that appeared to him to be bot driven:

Bring on the bots and sock puppet accounts. Amazing how a tweet about Putin always engenders defensive responses about Trump.

— briankrebs ((???)) August 29, 2017

First you’ll need to install twarc for interacting with the Twitter API from the command line. If you don’t have Python yet you’ll need to go get that first.

pip install twarc

Now you are ready to tell twarc about your Twitter API keys. Go over to apps.twitter.com, create an app and note the keys down so you can tell twarc about them with:

twarc configure

With twarc and your twitter keys in hand you are ready to collect the tweets using twarc’s search command. To run a search you need a query. In this case we’re going to use some identifying text from the tweet in question. The results are line-oriented-json, where every line is a complete JSON document for a tweet. The JSON is exactly what is returned from the Twitter API for a tweet.

twarc search 'Bring on the bots and sock puppet accounts Amazing how a tweet about Putin' > briankrebs.jsonl

This command could run for a while depending on how many retweets you can get. Twitter only allow you to get 17,000 ever 15 minutes. twarc will handle waiting until it can go get more. You can see a file twarc.log which contains information about what it is doing.

Once that finishes you probably want to be absolutely sure the file only includes retweets of that specific tweet. It’s possible that your search generated some false positives if the words happened to be used in tweets that were not retweets of your subject tweet. One handy way of doing this is to use jq to filter them using the tweet id of the original tweet:

jq -cr 'select(.retweeted_status.id_str = "902545914304319491")' briankrebs.jsonl > briankrebs-filtered.jsonl

Now that you have the JSON for the retweets you can do analysis of the users by creating a CSV file of information about them. For example I was interested in looking at the followers, friends and tweets counts, as well as when the account was created and the user’s preferred language. Yes, this user profile information can be found in the information you get for each tweet, or in this case, retweet. For the full details checkout the Tweet Field Guide from Twitter. jq is also pretty good at extracting bits of the json and writing it as CSV:

jq -r '[.user.screen_name, .user.followers_count, .user.friends_count, .user.status_count, .user.created_at, .user.lang, .user.status_count] | @csv' briankrebs-filtered.jsonl > briankrebs.csv

Here is the file I generated. You should be able to open that in your spreadsheet software of choice and look for patterns.

One other thing I was interested in doing was seeing what connections there might be between the retweeters of Brian and the retweeters of this tweet by Ben, which he thought got artificially boosted as well:

Hey, (???) and (???), have you seen what happens when you tweet about a bot attack on (???)? pic.twitter.com/GDQwBSTRRY

— Ben Nimmo ((???)) August 29, 2017

So I went through the exact same process to generate a file of user information for the retweeters of that tweet. With that file in hand I just needed to see what users were present in both. One nice little trick for doing the join is to use csvkit’s csvjoin.

csvsql --query "SELECT briankrebs.* FROM briankrebs, benimmo WHERE briankrebs.screen_name = benimmo.screen_name" briankrebs.csv benimmo.csv > briankrebs_benimmo.csv

You can see the resulting file here.

I realize this post was a bit of an esoteric post, but I wanted to write up the process in case you find yourself wanting to analyze retweets. One thing I don’t know the answer to is why the number of retweets returned isn’t exactly the same as the number of retweets displayed in the tweet. One explanation for this is that the search index is imperfect, and there is some hidden limitation apart from the ~7 day window that it will return results in. Another more likely explanation is that some of the retweets were from accounts that have been suspended or deleted, but the retweet count has not been adjusted to account for that. I guess only Twitter know the answer to that one.

Jonathan Rochkind: full-res pan-and-zoom JS viewer on a sufia/hyrax app

planet code4lib - Thu, 2017-08-31 02:13

Our Digital Collections web app  is written using the samvera/hydra stack, and is currently based on sufia 7.3.

The repository currently has around 10,000 TIFF scanned page and photographic images. They are stored (for better or worse) as TIFFs with no compression, and can be 100MB and up, each. (Typically around 7500 × 4900 pixels). It’s a core use case for us that viewers be able to pan-and-zoom on the full-res images in the browser. OpenSeadragon is the premiere open source solution to this, although some samvera/hydra stack apps use other JS projects that wrap OpenSeadragon with more UI/UX, like UniversalViewer.   All of our software is deployed on AWS EC2 instances.

OpenSeadragon works by loading ’tiles’: Sub-regions of the source image, at the appropriate zoom level,  for what’s in the viewport. In samvera/hydra community it seems maybe popular to use an image server that complies with the IIIF Image API as a tile source, but OpenSeadragon (OSD) can work with a variety of types of tile sources, with an easy plug-in architecture for adding your own too.

Our team ended up spending 4 or 5 weeks investigating various options, finding various levels of suitability, before arriving at a solution that was best for us. I’m not sure if our needs and environment are different than most others in the community; if others are having better success with some solutions than we did; if others are using other solutions not investigated by us. At any rate, I share our experiences hoping to give others a head-start. It can sometimes be difficult for me/us to figure out what use cases, options, or components in the samvera/hyrax stack are mature, production-battle-tested, and/or commonly used, and which are at more of the proof-of-concept stage.

As tile sources for OpenSeadragon, we investigated, experimented with, and developed at least basic proofs of concept for: riiif, imgix.com, cantaloupe, and pre-generated “Deep Zoom Image” tiles to be stored on AWS S3. We found riiif unsuitable; imgix worked but was going to have some fairly high latency for users; cantaloupe worked pretty fine, but may take fairly expensive server resources to handle heavy load; the pre-generated DZI images actually ended up being what we chose to put into production, with excellent performance and maintainability for minimal cost.

Details on each:

Riiif

A colleague and I learned about riiif at advanced hydra camp in May and Minneapolis. riiif is a Rails engine that lets you add a IIIF server to a new or existing Rails app. It was easy to set up a simple proof of concept at the workshop. It was easy to incorporate authorization and access controls for your image. I left with the impression that riiif was more-or-less a community standard, and was commonly used in production.

So we started out our work assuming we would be using riiif, and got to work on implementing it.

Knowing that tile generation would likely be CPU-intensive, disk-IO-intensive, and RAM intensive, we decided at the outset that the actual riiif IIIIF image server would live on a different server than our main Rails app, so it could be scaled independently and wouldn’t have resource contention with the main app.  I included the riiif stuff in the same Rails app and repo, but used some rails routing definition tricks so that our main app server(s) would refuse to serve riiif IIIIF routes, and the “image server” would refuse to serve anything but IIIF routes.

Since the riiif image server was obviously also not on the same server as our fedora repo, and shared disk can be expensive and/or unreliable in AWS-land, riiif would be using its HTTPFileResolver to fetch the originals from fedora. Since our originals are big and we figured this would be slow, we planned to give it enough disk space to cache all of them. And additionally worked up code to ‘ping’ the riiif server with an ‘info’ request for all of our images, forcing it to download them to it’s local cache on bootstrapped startup or future image uploads, thereby “pre-loading” them.

Later, in experiments with other tools, I think we saw that downloading even huge files from a fedora on one AWS EC2 to another EC2 on same account is actually pretty quick (AWS internal network is awfully fast), and this may have been a premature optimization. However, it did allow us to do some performance testing knowing that time to download originals from fedora was not a factor.

riiif worked fine in development, and even worked okay with only one user using in a deployed staging app. (Sometimes you had to wait 2-3 seconds for viewer tiles, which is not ideal, but not as disastrous as it got…).

But when we did a test with 3 or 4 (manual, human) users simultaneously opening viewers (on the same or different objects), things got very rough. People waiting 10+ seconds for tiles, or even timing out on OpenSeadragon’s default 30 second timeout.

We spent a lot of time trying to diagnose and trying different things. Among other things, we tried having riiif use GraphicsMagick instead of ImageMagick. When testing image operations individually, GM did perform around 50% faster than IM, and I recommend using it instead of IM wherever appropriate. We also tried increasing the size of our EC2 instance. We tried an m4.large, then a c4.xlarge, and then also keeping our data on a RAID-arrayed EBS trying to increase disk access speeds.   But things were still disastrous under multi-user simultaneous use.

Originally, we were using riiif additionally for generating thumbnails for our ‘show’ pages and search results pages. I had the impression from a slack conversation this was a common choice, and it certainly is convenient if you already have an image server to use it this way. (Also makes it super easy to generate multiple resolutions for responsive srcset attribute). At one point in trying to get riiif working better, we turned off riiif for thumbs, using riiif only on the viewer, to significantly reduce load on the image server. But still, no dice.

After much investigation, we saw that CPU use would often go to 99-100%, and disk IO levels were also through the roof during concurrent use tests. (RAM was actually okay).  Also doing a manual command-line imagemagick  conversion on the server at the same time as concurrent use testing, operations were seen to sometimes take 30+ seconds that would only take a few seconds on an otherwise unloaded server.  We gave up on riiif before diagnosing exactly what was going on (this kind of lower-level OS/infrastructure profiling and diagnosis is kinda hard!), but my guess is saturated disk IO.  If you look at what riiif does, this seems plausible. Riiif will do a separate shell-out to an imagemagick or graphicsmagick command line operation for every image request, if the derivative requested is not yet cached.

If you open up an OpenSeadragon viewer, OSD can start out by sending requests for a dozen+ different tiles. Each one, with a riiif-backed tile source, will result in an independent shell-out to imagemagick/graphicsmagick command line tool, with one of our 100MB+ source image TIFFs as input argument. With several people concurrently using the viewer, this could be several dozens of imagemagick shellouts, each trying to use a 100MB+ file on disk as input.  You could easily imagine this filling up even a fairly fat disk IO pipeline, and/or all of these processes fighting for access to various OS concurrency locks involved in reading from the file system, etc. But this is just hypothesis supported by what we observed, we didn’t totally nail down a diagnosis.

At some point, after spending a week+ on trying to solve this, and with deadlines looming, we decided to explore the other tile source alternatives we’ll discuss, even without being entirely sure what was going on with riiif.

It may be that other people have more success with riiif. Perhaps they are using smaller original sources; or running on AWS EC2 may have exacerbated things for us; or we just had bad luck for some as of yet undiscovered reason.

But we got curious how many people were using riiif in production, and what their image corpus looked like. We asked on samvera-tech listserv, and got only a handful of answers, and none of them were using riiif! I have an in-progress side-project I’m working on that gives some dependency-use statistics for public github repos holding hydra/samvera apps — of 20 public repos I had listed, 3 of them had riiif as a dependency, but I’m not sure how many of those are in production.   I did happen to talk to one other developer on samvera slack who confirmed they were having similar problems to ours. Still interested in hearing from anyone that is using riiif successfully in production, and if so what your source images are like, and how many concurrent viewers it can handle.

Cantaloupe

Wanting to try alternatives to riiif, the obvious choice was another IIIF server. There aren’t actually a whole lot of mature, reliable, open source IIIF server options, I think IIIF as a technology hasn’t caught on much outside of library/cultural heritage digital repositories. We knew from the samvera-tech listserv question that Loris (written in python) was a popular choice in the community, but Loris is currently looking for a new maintainer, which gave us pause.

We eventually decided on Cantaloupe, written in Java, as the best bet to try first. While it being in Java gave us a bit of concern, as nobody on the team is much of a Java dev, even without being Java experts we could tell looking at the source code that it was impressively clean and readable. The docs are good, and revealed attention to performance issues. The Github repo has all the signs of being an active and well-maintained project.

Cantaloupe too was having a bit of performance trouble when we tried using it for show-page thumbnails too (we have a thumb for every page in a work on our ‘show’ page, as in default sufia/hyrax, and our works can have 200+ pages). So we decided fairly early on that we’d just keep using a pre-generated derivative for thumbs, and stick to our priority use case, the viewer, in our tests of all our alternatives from here on out.

And then, Cantaloupe did pretty well, so long as it had a powerful and RAM-ful enough EC2.

Cantaloupe lets you configure caching separately for originals and derivatives, but even with no caching turned on, it was somehow doing noticeably better than riiif. Max 1-2 second wait times even with 3-4 simultaneous viewers. I’m not really sure how it pulled off doing better, but it did!

Our largest image is a whopping 1G TIFF. When we asked cantaloupe to generate derivatives for that, it unfortunately crashed with a Java OOM, and was then unresponsive until it was manually restarted. We gave the server and cantaloupe more RAM, now it handled that fine too (although our EC2 was getting more expensive). We hadn’t quite figured out how to approach defining how many simultaneous viewers we needed to support and how much EC2 was necessary to do that before moving on to other alternatives.

We started out running cantaloupe on an EC2 c4.xlarge (4 core Xeon E5 and 7.5 GB RAM), and ended up with a m4.2xlarge (8 core and 32 GB RAM) which could handle our 1G image, and generally seemed to handle load better with lower latency.  We used the JAI image processor for Cantaloupe. (It seemed to perform better than the Java2D processor; Cantaloupe also supports ImageMagick or GraphicsMagick as a processor, but we didn’t get to trying those out much).

Auth

If you’re not using riiif, and have images meant to be only available to certain/all logged-in users, you need to think about auth. With any external image server, you could do auth by proxying all access through your rails app, which would check auth in the usual way. But I worry about taking up web worker processes/threads in the main app with dozens of image requests. It would be better to keep the servers entirely separate.

There also might be a way to have apache/nginx proxying directly, rather than through the rails app, which would make me more comfortable, but you’d have to figure out how to use a custom auth plugin for apache or nginx.

Cantaloupe also has the very nice option of writing your own custom auth in ruby (even though the server is Java; thanks JRuby!), so we could actually check the existing Rails session (just make sure the cantaloupe server knows your Rails secret key, and figure out the right classes to call to decrypt and load data from the session cookie), and then Fedora/Solr to check auth in the usual samvera way.

Any live checking of auth before delivering an image tile is of course going to increase image response latency.

These were the options we thought of, but we didn’t get to any of them before ultimately deciding to choose pre-generated tile images.

However, Cantaloupe was definitely our second choice — and first choice if we really were to need need a full IIIIF server — it for sure looked like it could have worked well, although at potentially somewhat expensive AWS charges.

Imgix.com

Imgix.com is a commercial cloud-hosted image server.  I had a good opinion of it from using it for thumbnail-generation on ecommerce projects while working at Friends of the Web last year.  Imgix pricing is pretty affordable.

Imgix does not conform to IIIF API, but it does do pretty much all the image operations that IIIF can do, plus more. Definitely everything we needed for an OpenSeadragon tile source.

I figured, let’s give it a try, get out of the library/cultural-heritage silo, and use a popular, successful, well-regarded commercial service operating in the general web app space.

OpenSeadragon can not use imgix.com as a tile source out of the box, but OSD makes it pretty easy to write your own tile source. In a day I had a working proof of concept for an OSD imgix.com tile source, and in a couple more had filed off all the rough edges.

It totally worked. But. It was slow.  Imgix.com was willing to take our 100MB TIFF sources, but it was clear this was not really the use case it was designed for.  It was definitely slow downloading our original sources from our web app–the difference, I guess, between downloading directly from fedora on the same AWS subnet, and downloading via our Rails app from who knows where. (I did have to make a bugfix/improvement to samvera code to make sure HTTP headers were delivered quicker for a streaming download, to avoid timing out imgix. Once that was done, no more imgix timeout problems).  We tried pinging it to “pre-load” all originals as we had been doing with riiif — but as a cloud service, and one not expecting originals to be so huge, we had no control over when imgix purged originals from cache, and we found it did sometimes purge not-recently-accessed originals fairly quickly.

Also imgix has a (not really unreasonable) 512MB max for original images; our one 1G TIFF was not gonna be possible (and of course, that’s the one you really want a pan-and-zoom viewer for, it’s huge!).

On the plus side:

  • with imgix, we don’t need to worry about the infrastructure at all, it’s outsourced. We don’t need to plan some redundancy for max uptime or scaling for heavy use, they’re already doing it.
  • The response times are unlikely to change under heavy use, it’s already a cloud-scale service designed for heavy use.
  • Can handle the extra load of using it for thumbs too, just as well as it can for viewer tiles.
  • Our cost estimates had it looking cheaper (by 50%+) than hosting our own Cantaloupe on an EC2.
  • Once originals and derivatives being accessed (say tiles for a given viewer) were cached, it was lightning fast, reliably just 10s of ms for a tile image. But again, you have no control over how long these stay in cache before being purged.

For non-public images, imgix offers signed-and-expiring URLs.  The downside of these is you kind of lose HTTP cacheability of your images. And imgix.com doesn’t provide any way to auth imgix to get your originals, so if they’re not public you would have to put in some filters recognizing imgix.com IP addresses (which are subject to change, although they’re good at giving you advance notice), and let them in to private images.

But ultimately the latency was just too high. For images where the originals were cached but not the derivatives, it could take 1-4 seconds to get our tile derivatives; if the originals were not cached, it could take 10 or higher.  (One option would be trying to give it a much smaller lzw or zip compressed TIFF as a source, instead of our uncompressed original originals, cutting down transfer time for fetching originals. But I think this would be probably unlikely to improve latency sufficiently, and we moved on to pre-generated DZI. We would need to give imgix a lossless full-res original of some kind, cause full-res zoom is the whole goal here!)

I think imgix is potentially a workable last resort (and I still love it for creating thumbs for more reasonably sized sources), but it wasn’t as good an option as other alternatives explored for this use case, a tile source for enormous TIFFs.

Pre-Generated Deep Zoom Tiles

Eventually we came back to an earlier idea we originally considered, but then assumed would be too expensive and abandoned/forgot about.  When I realized Cantaloupe was recommending pyramidal TIFFs , which require some preprocessing/prerendering anyway, why not go all the way and preprocess/prerender every tile, and store them somewhere (say, cheap S3?)?  OpenSeadragon has a number of tile sources it supports that are or can be pre-generated images, including the first format it ever supported, Deep Zoom Image (file suffix .dzi).   (I had earlier done a side-project using OpenSeadragon and Deep Zoom tiles to put the awesome Beehive Collective Mesoamerica Resiste poster online).

But then we learned about riiif and it looked cool and we got on that tip, and kind of assumed pre-generating so many images would be unworkable. It took us a while to get back to exploring pre-generated Deep Zoom tiles.  But it actually works great (of course we had learned a lot of domain knowledge about manipulating giant images and tile sources at this point!).

We use vips (rather than imagemagick or graphicsmagick) to generate all the tiles. vips is really optimized for speed, CPU and RAM usage, and if you’re creating all the tiles at once vips can read the source image in once and slice it up into tiles.  We do this as a background job, that we have running on a different server than the Rails app; the built-in sufia bg jobs still run on our single rails app server. (In sufia 7.3, out of the box they can’t run on a different server without a shared file system; I think this may have been improved in as-yet-unreleased-hyrax-master).

We hook into the sufia/hyrax actor stack to trigger the bg job on file upload. A small EC2  (t2.medium 4 GB RAM, 2 core CPU) with five resque workers can handle the dzi creation much faster than the existing actor stack can trigger them when doing a bulk ingest (the existing actor stack is slow, and the dzi creation can’t happen until the file is actually in fedora, so that the job can retrieve it from fedora to make it into dzi tiles. So DZI’s can’t be generated any faster than sufia can do the ingests no matter what).  The files are uploaded to an S3 bucket.

We also provide a rake task to create the .dzi files for all Files in our fedora repo, for initial bootstrapping or if corruption ever needs to be fixed, etc. For our 8000-file staging server, running the creation task on our EC2 t2.medium, it takes around 7 hours to create and upload them all to S3 (I use some multi-threaded concurrency in the uploading), and results in ~3.2 million S3 keys taking up approx 32GB.

Believe it or not, this is actually the cheapest option, taking account of S3 storage and our bg jobs EC2 instance for dzi creation (that we’ll probably try to move more bg jobs to in the future). Cheaper than imgix, cheaper than our own Cantaloupe on an EC2 big enough to handle it.

If you have 800K or 8 million images instead of 8000, it’ll get more complicated/expensive. But S3 is so cheap, and a spot-priced temporary fleet of EC2s to do a mass dzi-creation ingest affordable enough you might be surprised how affordable it is. Alas fedora makes it a lot less straightforward to parallelize ingest than if it were a more conventional stack, but there’s probably some way to do it. Unless/until fedora itself becomes your bottleneck. There are costs to our stack.

It handles our 1GB original source just fine (takes about 30 seconds to create all tiles for this one). It’s also definitely fast for the end-user. Once the tiles are pre-generated, it’s just a request from S3. Which I’m seeing taking like 40-80ms in Chrome Dev Tools. Under a really lot of load (I’m guessing 100+ users concurrently using viewer), or to reduce latency beyond that 40-80ms, the standard approach would be to put a CDN in front of S3.  Probably either Amazon’s own CloudFront, or CloudFlare. This should be simple and affordable. But this is already reliably faster than any of our other options, and can handle way more concurrent load without a CDN compared to our other options, we aren’t worrying about it for now.  When/if we want to add a CDN, it oughta only be a couple clicks and a hostname change to implement.

And of course, there’s very little server maintenance to deal with, once the files are generated, they’re just static assets on S3, there’s nothing to “crash” really. (Well, except S3 itself, which happens very occasionally. If you wanted to be very safe, you’d mirror your S3 bucket to another AWS region or another cloud provider). Just one pretty standard and easily-scalable bg-job-running server for creating the DZIs on image upload.

We’re still punting on auth for now. Which talking on slack channel, seems to be a common choice with auth and IIIF image servers. One dev told me they just didn’t allow non-public images to be viewed in the image viewer (or via their image server) at all, which I guess works if your non-public images are all really just in-progress-in-workflow only viewable to staff.  As is true for us here. Another dev told me they just don’t worry about it, no links will be generated to non-public images, but they’ll be there via the image server without auth — which again works if your non-public images aren’t actually sensitive or legally un-shareable, they’re just in-process-not-quite-ready. Which is also true for us, for now anyway.  (I would not ever count on “nobody knows the URL”-based-security for actual sensitive or legally un-shareable content, for anything where it actually matters if someone happens to come across it. For our current and foreseeable future content, it doesn’t really. knock on wood. It does make me nervous!).

There are some auth options with the S3 approach, read about them as well as some internal overview documentation of what we’ve done in our code here, or see our PR  for initial implementation of the pre-generated-DZI-on-S3 approach for our complete solution.  Pre-generated DZI on S3 is indeed the approach we are going with.

IIIF vs Not

Some readers may have noticed that two of the alternatives we examined are not IIIF servers, and the one we ended up with — pre-generated DZI tiles — is not a dynamic image server at all. You may be reacting with shocked questions: You can do that? But what are you missing? Can you still use UniversalViewer? What about those other IIIF things?

Well, the first thing we miss is truly dynamic image generation. We instead need to pre-generate all the image derivatives we need upon image upload, including the DZI tiles. If we wanted a feature like, say, user enters a number of pixels and we deliver a JPG scaled to the user-specified width, a dynamic image server would be a lot more convenient. But I only thought of that feature when brainstorming for things that would be hard without a dynamic image server, it’s not something we are likely to prioritize. For thumbs and downloads at various preset sizes, pre-generating should work just fine with regards to performance and cost, especially with a bg job running on it’s own jobs server and stored on S3 (both don’t happen out of the box on sufia, but may in latest unreleased-master hyrax).

So, UniversalViewer. UniversalViewer uses OpenSeadragon to do the actual pan-and-zoom viewer.  Mirador  seems to as well. I think OpenSeadragon is pretty much the only viable open source JS pan-and-zoom viewer, which is fine, because OSD is pretty great.  UV, I believe, just wraps OSD in some additional UI/UX, with some additional features like table of contents viewing, downloads, etc.

We decided, even when we were still planning on using riiif, to not use UniversalViewer but instead develop directly with OpenSeadragon. Some of the fancier UV features we didn’t really need right now, and it was unclear if it would take more time to customize UV UX to our needs, or just develop a relatively light-weight UI of our own on top of OSD.

As these things do, our UI took slightly longer to develop than estimated, but it was still relatively quick, and we’re pretty happy with what we’ve got.  It is still subject to changes as we continue to user-test — but developing our own gives us a lot more control of the UI/UX to respond to such.  Later it turned out in other useful non-visual UX ways too — in our DZI implementation, we put something in our front-end that, if the dzi file is not available on S3, automatically degrades to a smaller not-very-zoomable image with an apology/warning message.  I’m not sure if I would have wanted to try and hack that into UV.

So using OpenSeadragon directly, we don’t need to give it an IIIF Image API URL, we can give it anything it handles (or you write a plugin for), and it works just fine. No code changes necessary except giving it a URL pointing to a different thing. No problem, everything just worked, it required no extra work in our front-end to use DZI instead of IIIF. (We did do some extra work to add some feature toggles so we could switch between various back-ends easily). No problem at all, the format of your tile source, so long as OSD can handle it, is a very loosely coupled dependency.

But what if you want to use UV or Mirador? (And we might in the future ourselves, if we need features they provide and we discover they are non-trivial to develop in our homegrown UI).  They take IIIF as input, right?

To be clear, we need to distinguish between the IIIF Image API (the one where a server provides image derivatives on demand), and the IIIF Manifest spec. The Manifest spec, part of the IIIF Presentation API, defines a JSON-ld file that “represents a single object and any intellectual work or works embodied within that object…  includes the descriptive, rights and linking information for the object… embeds the sequence(s) of canvases that should be rendered to the user.”

It’s the IIIF Manifest that is input to UV or Mirador. Normally these tools would extract one or more IIIF Image API URLs out of the Manifest, and just hand them to OpenSeadragon. Do they do anything else with an IIIF Image API url except hand it to OSD? I’m not sure, but I don’t think so. So if they just handed any other URI that OSD can handle to OSD, it should work fine? I think so.

An IIIF Manifest doesn’t actually need to include an IIIF Image API url.  “If a IIIF Image API service is available for the image, then a link to the service’s base URI should be included.” If. And an IIIF Manifest can include any other sort of image resource, from any external service,  identified by a uri in @context field.  So you can include a link to the .dzi file in the IIIF Manifest now, completely legally, the same IIIF Manifest you’d do otherwise just with a .dzi link instead of an IIIF Image API link — you’d just have to choose a @context URI to identify it as a DZI. Perhaps `https://msdn.microsoft.com/en-us/library/cc645077(VS.95).aspx`, although that might not be the most reliable URI identifier. But, really, we could be just as standards-compliant as ever and put a DZI URL in the IIIF Manifest instead of an IIIF Image API URL.

Of course, as with all linked data work, standards-compliant doesn’t actually make it interoperable. We need mutually-recognizable shared vocabulary. UV or Mirador would have to recognize the image resource URL supplied in the Manifest as being a DZI, or at any rate at least as something that can be passed to OSD. As far as I know UV or Mirador won’t do this now. It should probably be pretty trivial to get them to, though, perhaps by supporting configuration for “recognize this @context uri as being something you can pass to OSD.”  If we in the future have need for UV/Mirador (or IIIF Manifests), I’d look into getting them to do that, but we don’t right now.

What about these other tools that take IIIF Manifests and aggregate images from different sites?  Probably the same deal, they just gotta recognize an agreed-upon identifier meaning “DZI format”, and know they can pass such to OpenSeadragon.

I’m not sure if any such tools currently exist used for real work or even real recreation, rather than as more of a demo. I’ll always choose to achieve greatness rather than mediocrity for our current actual real prioritized use cases, above engineering for a hoped-for-but-uncertain future. Of course, when you can do both without increasing expense or sacrificing quality for your present use cases, that’s great too, and it’s always good to keep an eye out for those opportunities.

But I’m feeling pretty good about our DZI choice at the moment. It just works so well, cheaply, with minimal expected ongoing maintenance, compared to other options — and works better for end-users too, with reliable nearly instantaneous delivery of tiles even under heavy load. Now, if you have a lot more images than us, the cost-benefit calculus may end up different. Especially because a dynamic image server scales (gets more expensive) with number of concurrent users/requests more or less regardless of number of images, while the pre-gen DZI solution gets more expensive with more images more or less regardless of concurrent request level. If you have a whole lot of images (say two orders of magnitude bigger than our 10K), your app typically gets pretty low use (and you don’t care about it supporting lots of concurrent use), and maybe additionally if your original source images aren’t nearly as large as ours, pre-gen DZI might not be such a sweet spot. However, you might be surprised, pre-generated DZI is in the end just such a simple solution, and S3 storage is pretty affordable.


Filed under: General

Jonathan Rochkind: Gem dependency use among Sufia/Hyrax apps

planet code4lib - Wed, 2017-08-30 20:32

I have a little side project that uses the GitHub API (and a little bit of rubygems API) to analyze what gem dependencies and versions (from among a list of ‘interesting’ ones) are being used in a list of open Github repos with `Gemfile.lock`s, that I wrote out of curiosity regarding sufia/hyrax apps. I think it could turn into a useful tool for any ruby open source community using common dependencies to use to see what the community is up to.

It’s far from done, it just generates an ASCII report, and is missing many features I’d like. There are things I’m curious about that it doesn’t report on yet, like history of dependency use, how often do people upgrade a given dependency. And I’d like an interactive HTML interface that lets you slice and dice the data a bit (of people using a given gem, how many are also using another gem, etc).  And then maybe set it up so it’s on the public web and regularly updates itself.

But it’s been a couple of months since I’ve worked on it, and I thought just the current snapshot in limited ASCII report format was useful enough that I should share a report.

The report, intentionally, for now, does not tell you which repos are using which dependencies, it just gives aggregate descriptive statistics. (Although you could of course manually find that out from their open Gemfile.locks). I wanted to avoid seeming to ‘call out’ anyone for using old versions or whatever. Although it would be useful to know, so you can, say, get in touch with people using the same things or same versions as you, I wanted to get some community feedback first.  Thoughts on if it should?

I got the list of repos from various public lists of sufia or hyrax repos. Some things on the lists didn’t actually have open github repos at that address anymore — or had an open repo, but without a Gemfile.lock! Can only analyze with a Gemfile.lock in the repo. But I don’t really know which of these repos are in production, and which might be not yet, no longer, or never were.  If you have a repo you’d like me to add or remove from the list, let me know! Also any other things you might want the report to include or questions you might want to let it help you answer. Or additional ‘interesting’ gems you’d like included in the report?

I do think it’s pretty cool that the combination of machine-readable Gemfile.lock and the GitHub API lets us do some pretty cool stuff here! If I get around to writing an interactive HTML interface, I’m thinking of trying to do it all in static file Javascript. That would require rewriting some of the analysis tools I’ve already written in ruby, in JS, but might be a good project to experiment with, say, vue.js. I don’t have much fancy new-gen JS experience, and this is a nice isolated thing for trying it out.

I am not sure what to read into these results. They aren’t necessarily good or bad, they just are a statement of what things are, which I think is interesting and useful in itself, and helps us plan and coordinate. I do think it’s worth recognizing that when developers in the community are on old major versions of shared dependencies, it increases the cost for them to contribute back upstream, makes it harder to do as part of “scratching their own itch”, and probably decreases such contributions.  I also found it interesting how many repos use unreleased straight-from-github versions of some dependencies (17 of 28 do at least once), as well as the handful of gems that are fairly widely used in production but still don’t have a 1.0 release.

And here’s the ugly ascii report!

38 total input URLs, 28 with fetchable Gemfile.lock total apps analyzed: 28 with dependencies on non-release (git or path) gem versions: 17 with git checkouts: 16 with local path deps: 1 Date of report: 2017-08-30 15:11:20 -0400 Repos analyzed: https://github.com/psu-stewardship/scholarsphere https://github.com/psu-stewardship/archivesphere https://github.com/VTUL/data-repo https://github.com/gwu-libraries/gw-sufia https://github.com/gwu-libraries/scholarspace https://github.com/duke-libraries/course-assets https://github.com/ualbertalib/HydraNorth https://github.com/ualbertalib/Hydranorth2 https://github.com/aic-collections/aicdams-lakeshore https://github.com/osulp/Scholars-Archive https://github.com/durham-university/collections https://github.com/OregonShakespeareFestival/osf_digital_archives https://github.com/cul/ac3_sufia https://github.com/ihrnexuslab/research-repo https://github.com/galterlibrary/digital-repository https://github.com/chemheritage/chf-sufia https://github.com/vecnet/vecnet-dl https://github.com/vecnet/dl-discovery https://github.com/osulibraries/dc https://github.com/uclibs/scholar_uc https://github.com/uvalib/Libra2 https://github.com/pulibrary/plum https://github.com/curationexperts/laevigata https://github.com/csuscholarworks/bravado https://github.com/UVicLibrary/Vault https://github.com/mlibrary/heliotrope https://github.com/ucsdlib/horton https://github.com/pulibrary/figgy Gems analyzed: rails hyrax sufia curation_concerns qa hydra-editor hydra-head hydra-core hydra-works hydra-derivatives hydra-file_characterization hydra-pcdm hydra-role-management hydra-batch-edit browse-everything solrizer blacklight-access_controls hydra-access-controls blacklight blacklight-gallery blacklight_range_limit blacklight_advanced_search active-fedora active_fedora-noid active-triples ldp linkeddata riiif iiif_manifest pul_uv_rails mirador_rails osullivan bixby orcid rails: apps without dependency: 0 apps with dependency: 28 (100%) git checkouts: 0 local path dep: 0 3.x (3.0.0 released 2010-08-29): 1 (4%) 3.2.x (3.2.0 released 2012-01-20): 1 (4%) 4.x (4.0.0 released 2013-06-25): 16 (57%) 4.0.x (4.0.0 released 2013-06-25): 2 (7%) 4.1.x (4.1.0 released 2014-04-08): 1 (4%) 4.2.x (4.2.0 released 2014-12-20): 13 (46%) 5.x (5.0.0 released 2016-06-30): 11 (39%) 5.0.x (5.0.0 released 2016-06-30): 8 (29%) 5.1.x (5.1.0 released 2017-04-27): 3 (11%) Latest release: 5.1.4.rc1 (2017-08-24) hyrax: apps without dependency: 20 (71%) apps with dependency: 8 (29%) git checkouts: 4 (50%) local path dep: 0 1.x (1.0.1 released 2017-05-24): 4 (50%) 1.0.x (1.0.1 released 2017-05-24): 4 (50%) 2.x ( released unreleased): 4 (50%) 2.0.x ( released unreleased): 4 (50%) Latest release: 1.0.4 (2017-08-22) sufia: apps without dependency: 10 (36%) apps with dependency: 18 (64%) git checkouts: 8 (44%) local path dep: 0 0.x (0.0.1.pre1 released 2012-11-15): 1 (6%) 0.1.x (0.1.0 released 2013-02-04): 1 (6%) 3.x (3.0.0 released 2013-07-22): 1 (6%) 3.7.x (3.7.0 released 2014-02-07): 1 (6%) 4.x (4.0.0 released 2014-08-21): 2 (11%) 4.1.x (4.1.0 released 2014-10-31): 1 (6%) 4.2.x (4.2.0 released 2014-11-25): 1 (6%) 5.x (5.0.0 released 2015-06-06): 1 (6%) 5.0.x (5.0.0 released 2015-06-06): 1 (6%) 6.x (6.0.0 released 2015-03-27): 6 (33%) 6.0.x (6.0.0 released 2015-03-27): 2 (11%) 6.2.x (6.2.0 released 2015-07-09): 1 (6%) 6.3.x (6.3.0 released 2015-08-12): 1 (6%) 6.6.x (6.6.0 released 2016-01-28): 2 (11%) 7.x (7.0.0 released 2016-08-01): 7 (39%) 7.0.x (7.0.0 released 2016-08-01): 1 (6%) 7.1.x (7.1.0 released 2016-08-11): 1 (6%) 7.2.x (7.2.0 released 2016-10-01): 4 (22%) 7.3.x (7.3.0 released 2017-03-21): 1 (6%) Latest release: 7.3.1 (2017-04-26) curation_concerns: apps without dependency: 21 (75%) apps with dependency: 7 (25%) git checkouts: 1 (14%) local path dep: 1 (14%) 1.x (1.0.0 released 2016-06-22): 7 (100%) 1.3.x (1.3.0 released 2016-08-03): 2 (29%) 1.6.x (1.6.0 released 2016-09-14): 3 (43%) 1.7.x (1.7.0 released 2016-12-09): 2 (29%) Latest release: 2.0.0 (2017-04-20) qa: apps without dependency: 11 (39%) apps with dependency: 17 (61%) git checkouts: 0 local path dep: 0 0.x (0.0.1 released 2013-10-04): 9 (53%) 0.3.x (0.3.0 released 2014-06-20): 1 (6%) 0.8.x (0.8.0 released 2016-07-07): 1 (6%) 0.10.x (0.10.0 released 2016-08-16): 3 (18%) 0.11.x (0.11.0 released 2017-01-04): 4 (24%) 1.x (1.0.0 released 2017-03-22): 8 (47%) 1.2.x (1.2.0 released 2017-06-23): 8 (47%) Latest release: 1.2.0 (2017-06-23) hydra-editor: apps without dependency: 3 (11%) apps with dependency: 25 (89%) git checkouts: 2 (8%) local path dep: 0 0.x (0.0.1 released 2013-06-13): 3 (12%) 0.5.x (0.5.0 released 2014-08-27): 3 (12%) 1.x (1.0.0 released 2015-01-30): 6 (24%) 1.0.x (1.0.0 released 2015-01-30): 4 (16%) 1.2.x (1.2.0 released 2016-01-21): 2 (8%) 2.x (2.0.0 released 2016-04-28): 1 (4%) 2.0.x (2.0.0 released 2016-04-28): 1 (4%) 3.x (3.1.0 released 2016-08-09): 15 (60%) 3.1.x (3.1.0 released 2016-08-09): 6 (24%) 3.3.x (3.3.1 released 2017-05-04): 9 (36%) Latest release: 3.3.2 (2017-05-23) hydra-head: apps without dependency: 1 (4%) apps with dependency: 27 (96%) git checkouts: 0 local path dep: 0 5.x (5.0.0 released 2012-12-11): 1 (4%) 5.4.x (5.4.0 released 2013-02-06): 1 (4%) 6.x (6.0.0 released 2013-03-28): 1 (4%) 6.5.x (6.5.0 released 2014-02-18): 1 (4%) 7.x (7.0.0 released 2014-03-31): 3 (11%) 7.2.x (7.2.0 released 2014-07-18): 3 (11%) 9.x (9.0.1 released 2015-01-30): 6 (22%) 9.1.x (9.1.0 released 2015-03-06): 2 (7%) 9.2.x (9.2.0 released 2015-07-08): 2 (7%) 9.5.x (9.5.0 released 2015-11-11): 2 (7%) 10.x (10.0.0 released 2016-06-08): 16 (59%) 10.0.x (10.0.0 released 2016-06-08): 1 (4%) 10.3.x (10.3.0 released 2016-09-02): 3 (11%) 10.4.x (10.4.0 released 2017-01-25): 4 (15%) 10.5.x (10.5.0 released 2017-06-09): 8 (30%) Latest release: 10.5.0 (2017-06-09) hydra-core: apps without dependency: 1 (4%) apps with dependency: 27 (96%) git checkouts: 0 local path dep: 0 5.x (5.0.0 released 2012-12-11): 1 (4%) 5.4.x (5.4.0 released 2013-02-06): 1 (4%) 6.x (6.0.0 released 2013-03-28): 1 (4%) 6.5.x (6.5.0 released 2014-02-18): 1 (4%) 7.x (7.0.0 released 2014-03-31): 3 (11%) 7.2.x (7.2.0 released 2014-07-18): 3 (11%) 9.x (9.0.0 released 2015-01-30): 6 (22%) 9.1.x (9.1.0 released 2015-03-06): 2 (7%) 9.2.x (9.2.0 released 2015-07-08): 2 (7%) 9.5.x (9.5.0 released 2015-11-11): 2 (7%) 10.x (10.0.0 released 2016-06-08): 16 (59%) 10.0.x (10.0.0 released 2016-06-08): 1 (4%) 10.3.x (10.3.0 released 2016-09-02): 3 (11%) 10.4.x (10.4.0 released 2017-01-25): 4 (15%) 10.5.x (10.5.0 released 2017-06-09): 8 (30%) Latest release: 10.5.0 (2017-06-09) hydra-works: apps without dependency: 13 (46%) apps with dependency: 15 (54%) git checkouts: 1 (7%) local path dep: 0 0.x (0.0.1 released 2015-06-05): 15 (100%) 0.12.x (0.12.0 released 2016-05-24): 1 (7%) 0.14.x (0.14.0 released 2016-09-06): 2 (13%) 0.15.x (0.15.0 released 2016-11-30): 2 (13%) 0.16.x (0.16.0 released 2017-03-02): 10 (67%) Latest release: 0.16.0 (2017-03-02) hydra-derivatives: apps without dependency: 2 (7%) apps with dependency: 26 (93%) git checkouts: 0 local path dep: 0 0.x (0.0.1 released 2013-07-23): 4 (15%) 0.0.x (0.0.1 released 2013-07-23): 1 (4%) 0.1.x (0.1.0 released 2014-05-10): 3 (12%) 1.x (1.0.0 released 2015-01-30): 6 (23%) 1.0.x (1.0.0 released 2015-01-30): 1 (4%) 1.1.x (1.1.0 released 2015-03-27): 3 (12%) 1.2.x (1.2.0 released 2016-05-18): 2 (8%) 3.x (3.0.0 released 2015-10-07): 16 (62%) 3.1.x (3.1.0 released 2016-05-10): 3 (12%) 3.2.x (3.2.0 released 2016-11-17): 7 (27%) 3.3.x (3.3.0 released 2017-06-15): 6 (23%) Latest release: 3.3.2 (2017-08-17) hydra-file_characterization: apps without dependency: 3 (11%) apps with dependency: 25 (89%) git checkouts: 0 local path dep: 0 0.x (0.0.1 released 2013-09-17): 25 (100%) 0.3.x (0.3.0 released 2013-10-24): 25 (100%) Latest release: 0.3.3 (2015-10-15) hydra-pcdm: apps without dependency: 13 (46%) apps with dependency: 15 (54%) git checkouts: 0 local path dep: 0 0.x (0.0.1 released 2015-06-05): 15 (100%) 0.8.x (0.8.0 released 2016-05-12): 1 (7%) 0.9.x (0.9.0 released 2016-08-31): 14 (93%) Latest release: 0.9.0 (2016-08-31) hydra-role-management: apps without dependency: 17 (61%) apps with dependency: 11 (39%) git checkouts: 0 local path dep: 0 0.x (0.0.1 released 2013-04-18): 11 (100%) 0.2.x (0.2.0 released 2014-06-25): 11 (100%) Latest release: 0.2.2 (2015-08-14) hydra-batch-edit: apps without dependency: 10 (36%) apps with dependency: 18 (64%) git checkouts: 0 local path dep: 0 0.x (0.0.1 released 2012-06-15): 1 (6%) 0.1.x (0.1.0 released 2012-12-21): 1 (6%) 1.x (1.0.0 released 2013-05-10): 10 (56%) 1.1.x (1.1.0 released 2013-10-01): 10 (56%) 2.x (2.0.2 released 2016-04-20): 7 (39%) 2.0.x (2.0.2 released 2016-04-20): 1 (6%) 2.1.x (2.1.0 released 2016-08-17): 6 (33%) Latest release: 2.1.0 (2016-08-17) browse-everything: apps without dependency: 3 (11%) apps with dependency: 25 (89%) git checkouts: 3 (12%) local path dep: 0 0.x (0.1.0 released 2013-09-24): 25 (100%) 0.6.x (0.6.0 released 2014-07-31): 1 (4%) 0.7.x (0.7.0 released 2014-12-10): 1 (4%) 0.8.x (0.8.0 released 2015-02-27): 5 (20%) 0.10.x (0.10.0 released 2016-04-04): 5 (20%) 0.11.x (0.11.0 released 2016-12-31): 1 (4%) 0.12.x (0.12.0 released 2017-03-01): 2 (8%) 0.13.x (0.13.0 released 2017-04-30): 2 (8%) 0.14.x (0.14.0 released 2017-07-07): 8 (32%) Latest release: 0.14.0 (2017-07-07) solrizer: apps without dependency: 1 (4%) apps with dependency: 27 (96%) git checkouts: 1 (4%) local path dep: 0 2.x (2.0.0 released 2012-11-30): 1 (4%) 2.1.x (2.1.0 released 2013-01-18): 1 (4%) 3.x (3.0.0 released 2013-03-28): 25 (93%) 3.1.x (3.1.0 released 2013-05-03): 1 (4%) 3.3.x (3.3.0 released 2014-07-17): 7 (26%) 3.4.x (3.4.0 released 2016-03-14): 17 (63%) 4.x (4.0.0 released 2017-01-26): 1 (4%) 4.0.x (4.0.0 released 2017-01-26): 1 (4%) Latest release: 4.0.0 (2017-01-26) blacklight-access_controls: apps without dependency: 12 (43%) apps with dependency: 16 (57%) git checkouts: 0 local path dep: 0 0.x (0.1.0 released 2015-12-01): 16 (100%) 0.5.x (0.5.0 released 2016-06-08): 1 (6%) 0.6.x (0.6.0 released 2016-09-01): 15 (94%) Latest release: 0.6.2 (2017-03-28) hydra-access-controls: apps without dependency: 1 (4%) apps with dependency: 27 (96%) git checkouts: 0 local path dep: 0 5.x (5.0.0 released 2012-12-11): 1 (4%) 5.4.x (5.4.0 released 2013-02-06): 1 (4%) 6.x (6.0.0 released 2013-03-28): 1 (4%) 6.5.x (6.5.0 released 2014-02-18): 1 (4%) 7.x (7.0.0 released 2014-03-31): 3 (11%) 7.2.x (7.2.0 released 2014-07-18): 3 (11%) 9.x (9.0.0 released 2015-01-30): 6 (22%) 9.1.x (9.1.0 released 2015-03-06): 2 (7%) 9.2.x (9.2.0 released 2015-07-08): 2 (7%) 9.5.x (9.5.0 released 2015-11-11): 2 (7%) 10.x (10.0.0 released 2016-06-08): 16 (59%) 10.0.x (10.0.0 released 2016-06-08): 1 (4%) 10.3.x (10.3.0 released 2016-09-02): 3 (11%) 10.4.x (10.4.0 released 2017-01-25): 4 (15%) 10.5.x (10.5.0 released 2017-06-09): 8 (30%) Latest release: 10.5.0 (2017-06-09) blacklight: apps without dependency: 0 apps with dependency: 28 (100%) git checkouts: 0 local path dep: 0 4.x (4.0.0 released 2012-11-30): 2 (7%) 4.0.x (4.0.0 released 2012-11-30): 1 (4%) 4.7.x (4.7.0 released 2014-02-05): 1 (4%) 5.x (5.0.0 released 2014-02-05): 10 (36%) 5.5.x (5.5.0 released 2014-07-07): 2 (7%) 5.9.x (5.9.0 released 2015-01-30): 1 (4%) 5.11.x (5.11.0 released 2015-03-17): 1 (4%) 5.12.x (5.12.0 released 2015-03-24): 1 (4%) 5.13.x (5.13.0 released 2015-04-10): 1 (4%) 5.14.x (5.14.0 released 2015-07-02): 2 (7%) 5.18.x (5.18.0 released 2016-01-21): 2 (7%) 6.x (6.0.0 released 2016-01-21): 16 (57%) 6.3.x (6.3.0 released 2016-07-01): 1 (4%) 6.7.x (6.7.0 released 2016-09-27): 5 (18%) 6.10.x (6.10.0 released 2017-05-17): 6 (21%) 6.11.x (6.11.0 released 2017-08-10): 4 (14%) Latest release: 6.11.0 (2017-08-10) blacklight-gallery: apps without dependency: 4 (14%) apps with dependency: 24 (86%) git checkouts: 0 local path dep: 0 0.x (0.0.1 released 2014-02-05): 24 (100%) 0.1.x (0.1.0 released 2014-09-05): 2 (8%) 0.3.x (0.3.0 released 2015-03-18): 2 (8%) 0.4.x (0.4.0 released 2015-04-10): 5 (21%) 0.6.x (0.6.0 released 2016-07-07): 4 (17%) 0.7.x (0.7.0 released 2017-01-24): 1 (4%) 0.8.x (0.8.0 released 2017-02-07): 10 (42%) Latest release: 0.8.0 (2017-02-07) blacklight_range_limit: apps without dependency: 24 (86%) apps with dependency: 4 (14%) git checkouts: 0 local path dep: 0 5.x (5.0.0 released 2014-02-11): 1 (25%) 5.0.x (5.0.0 released 2014-02-11): 1 (25%) 6.x (6.0.0 released 2016-01-26): 3 (75%) 6.0.x (6.0.0 released 2016-01-26): 1 (25%) 6.1.x (6.1.0 released 2017-02-17): 2 (50%) Latest release: 6.2.0 (2017-08-29) blacklight_advanced_search: apps without dependency: 11 (39%) apps with dependency: 17 (61%) git checkouts: 0 local path dep: 0 2.x (2.0.0 released 2012-11-30): 2 (12%) 2.1.x (2.1.0 released 2013-07-22): 2 (12%) 5.x (5.0.0 released 2014-03-18): 9 (53%) 5.1.x (5.1.0 released 2014-06-05): 7 (41%) 5.2.x (5.2.0 released 2015-10-12): 2 (12%) 6.x (6.0.0 released 2016-01-22): 6 (35%) 6.0.x (6.0.0 released 2016-01-22): 1 (6%) 6.1.x (6.1.0 released 2016-09-28): 2 (12%) 6.2.x (6.2.0 released 2016-12-13): 3 (18%) Latest release: 6.3.1 (2017-06-15) active-fedora: apps without dependency: 1 (4%) apps with dependency: 27 (96%) git checkouts: 1 (4%) local path dep: 0 5.x (5.0.0 released 2012-11-30): 1 (4%) 5.6.x (5.6.0 released 2013-02-02): 1 (4%) 6.x (6.0.0 released 2013-03-28): 1 (4%) 6.7.x (6.7.0 released 2013-10-29): 1 (4%) 7.x (7.0.0 released 2014-03-31): 3 (11%) 7.1.x (7.1.0 released 2014-07-18): 3 (11%) 9.x (9.0.0 released 2015-01-30): 6 (22%) 9.0.x (9.0.0 released 2015-01-30): 1 (4%) 9.1.x (9.1.0 released 2015-04-16): 1 (4%) 9.4.x (9.4.0 released 2015-09-03): 1 (4%) 9.7.x (9.7.0 released 2015-11-30): 2 (7%) 9.8.x (9.8.0 released 2016-02-05): 1 (4%) 10.x (10.0.0 released 2016-06-08): 3 (11%) 10.0.x (10.0.0 released 2016-06-08): 1 (4%) 10.3.x (10.3.0 released 2016-11-21): 2 (7%) 11.x (11.0.0 released 2016-09-13): 13 (48%) 11.1.x (11.1.0 released 2017-01-13): 2 (7%) 11.2.x (11.2.0 released 2017-05-18): 4 (15%) 11.3.x (11.3.0 released 2017-06-13): 3 (11%) 11.4.x (11.4.0 released 2017-06-28): 4 (15%) Latest release: 11.4.0 (2017-06-28) active_fedora-noid: apps without dependency: 9 (32%) apps with dependency: 19 (68%) git checkouts: 0 local path dep: 0 0.x (0.0.1 released 2015-02-14): 1 (5%) 0.3.x (0.3.0 released 2015-07-14): 1 (5%) 1.x (1.0.1 released 2015-08-06): 3 (16%) 1.0.x (1.0.1 released 2015-08-06): 1 (5%) 1.1.x (1.1.0 released 2016-05-10): 2 (11%) 2.x (2.0.0 released 2016-11-29): 15 (79%) 2.0.x (2.0.0 released 2016-11-29): 8 (42%) 2.2.x (2.2.0 released 2017-05-25): 7 (37%) Latest release: 2.2.0 (2017-05-25) active-triples: apps without dependency: 3 (11%) apps with dependency: 25 (89%) git checkouts: 0 local path dep: 0 0.x (0.0.1 released 2014-04-29): 25 (100%) 0.2.x (0.2.0 released 2014-07-01): 3 (12%) 0.6.x (0.6.0 released 2015-01-14): 2 (8%) 0.7.x (0.7.0 released 2015-05-14): 7 (28%) 0.11.x (0.11.0 released 2016-08-25): 13 (52%) Latest release: 0.11.0 (2016-08-25) ldp: apps without dependency: 6 (21%) apps with dependency: 22 (79%) git checkouts: 0 local path dep: 0 0.x (0.0.1 released 2013-07-31): 22 (100%) 0.2.x (0.2.0 released 2014-12-11): 1 (5%) 0.3.x (0.3.0 released 2015-04-03): 1 (5%) 0.4.x (0.4.0 released 2015-09-18): 4 (18%) 0.5.x (0.5.0 released 2016-03-08): 3 (14%) 0.6.x (0.6.0 released 2016-08-11): 6 (27%) 0.7.x (0.7.0 released 2017-06-12): 7 (32%) Latest release: 0.7.0 (2017-06-12) linkeddata: apps without dependency: 10 (36%) apps with dependency: 18 (64%) git checkouts: 0 local path dep: 0 1.x (1.0.0 released 2013-01-22): 12 (67%) 1.1.x (1.1.0 released 2013-12-06): 7 (39%) 1.99.x (1.99.0 released 2015-10-31): 5 (28%) 2.x (2.0.0 released 2016-04-11): 6 (33%) 2.2.x (2.2.0 released 2017-01-23): 6 (33%) Latest release: 2.2.3 (2017-08-27) riiif: apps without dependency: 21 (75%) apps with dependency: 7 (25%) git checkouts: 0 local path dep: 0 0.x (0.0.1 released 2013-11-14): 2 (29%) 0.2.x (0.2.0 released 2015-11-10): 2 (29%) 1.x (1.0.0 released 2017-02-01): 5 (71%) 1.4.x (1.4.0 released 2017-04-11): 3 (43%) 1.5.x (1.5.0 released 2017-07-20): 2 (29%) Latest release: 1.5.1 (2017-08-01) iiif_manifest: apps without dependency: 24 (86%) apps with dependency: 4 (14%) git checkouts: 1 (25%) local path dep: 0 0.x (0.1.0 released 2016-05-13): 4 (100%) 0.1.x (0.1.0 released 2016-05-13): 2 (50%) 0.2.x (0.2.0 released 2017-05-03): 2 (50%) Latest release: 0.2.0 (2017-05-03) pul_uv_rails: apps without dependency: 26 (93%) apps with dependency: 2 (7%) git checkouts: 2 (100%) local path dep: 0 2.x ( released unreleased): 2 (100%) 2.0.x ( released unreleased): 2 (100%) No rubygems releases mirador_rails: apps without dependency: 28 (100%) apps with dependency: 0 git checkouts: 0 local path dep: 0 Latest release: 0.6.0 (2017-08-02) osullivan: apps without dependency: 27 (96%) apps with dependency: 1 (4%) git checkouts: 0 local path dep: 0 0.x (0.0.2 released 2015-01-16): 1 (100%) 0.0.x (0.0.2 released 2015-01-16): 1 (100%) Latest release: 0.0.3 (2015-01-21) bixby: apps without dependency: 26 (93%) apps with dependency: 2 (7%) git checkouts: 0 local path dep: 0 0.x (0.1.0 released 2017-03-30): 2 (100%) 0.2.x (0.2.0 released 2017-03-30): 2 (100%) Latest release: 0.2.2 (2017-08-07) orcid: apps without dependency: 27 (96%) apps with dependency: 1 (4%) git checkouts: 1 (100%) local path dep: 0 0.x (0.0.1.pre released 2014-02-21): 1 (100%) 0.9.x (0.9.0 released 2014-10-27): 1 (100%) Latest release: 0.9.1 (2014-12-09)
Filed under: General

LITA: Jobs in Information Technology: August 30, 2017

planet code4lib - Wed, 2017-08-30 17:32

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Library of the University of California, Davis, Health Sciences Librarian, Davis, CA

Penn State University Libraries, Reference and Technology Librarian, University Park, PA

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

District Dispatch: ALA tells FCC: the library community needs you to protect net neutrality

planet code4lib - Wed, 2017-08-30 14:26

Today, the American Library Association (ALA) told federal regulators that rolling back strong, enforceable net neutrality rules that keep the internet open would hurt libraries and the communities they serve. In comments to the Federal Communications Commission (FCC), ALA reiterated the fact that 120,000 libraries depend on the open internet to carry out their missions and ensure the protection of freedom of speech, educational achievement and economic growth.

Today’s comment deadline was another stop in a longer fight. In 2015, the Obama FCC adopted strong net neutrality rules that prohibit internet service providers like AT&T, Comcast and Verizon from blocking, censoring or discriminating against any online content. The rules were subsequently upheld by a federal court. In May 2017, the new Chairman of the FCC announced a plan to do away with the rules, a move which greatly concerns us, along with thousands of businesses and startups, consumer advocacy organizations and millions of consumers. We filed initial comments and, today, had the opportunity to respond to arguments raised by other commenters and raise additional issues.

ALA has been on the front lines of the net neutrality battle with the FCC, Congress and the federal courts for more than a decade, working in coalition with other library and higher education organizations as well as broader coalitions of net neutrality advocates. In addition to ALA’s comments, thousands of librarians and library staff from across the country filed comments on their own or via the ALA’s Action Alert as part of a coordinated “Internet Day of Action” on July 12. In fact, more than 1,640 alerts had been sent through the action center as of the morning of July 13, and there were more than 140,000 impressions via Twitter and nearly 85,000 via Facebook for ALA and I Love Libraries social channels.

The comments we filed today relied on the voices and stories from individual library professionals, libraries, systems and state library agencies and associations to tell regulators just how damaging efforts to roll back net neutrality could be.

So, what’s next? The FCC will proceed to draft a formal rule and vote on those rules, likely this fall. It is highly probable that some members of Congress will seek to broker a legislative solution. For ALA’s part, we will continue to advocate for strong, enforceable net neutrality protections and educate policymakers about the concerns of libraries and other institutions. We thank the ALA community for their engagement and will continue to keep you updated about opportunities to take action to support net neutrality.

The post ALA tells FCC: the library community needs you to protect net neutrality appeared first on District Dispatch.

Mark E. Phillips: Updating Metadata Interfaces: Item Views

planet code4lib - Wed, 2017-08-30 13:00

As we get started with a new school year it is good to look back on all of the work that we accomplished over the summer.

There are a few reasons that we are interested in improving our metadata entry systems. First as we continue to add records and approach our 2 millionth item in the system, it is clear that effective management of metadata is important. We also see an increase in the resources being allocated to editing metadata in our digital library systems. We are to a point where there are more people using the backend systems for non-MARC metadata than we have working with the catalog and MARC based metadata. Because of this we are seeing our metadata workers spend more and more time in this system so it is important that we try and make things better so that they can complete their tasks easier.  This improves quality, costs, and our workers sanity.

This blog post is just a quick summary of some of the changes that we have made around item records in the UNT Libraries Digital Collection’s Edit System.

Dashboard View

Not much really has changed with the edit dashboard other thank making room for the views that I’m going to talk about later in this post.  Historically when an editor clicked on either the title or the thumbnail of the record, it would take them to the edit view for the record.  In fact that was pretty much the only public view for an item’s record, edit.

While editing or creating a record is the primary activity you want to do in this system, there are many times when you want to look at the records in different ways.  This previously wasn’t possible without having to do a little URL hacking.

Now when you click on the thumbnail, title, or summary button you are taken to the summary page that I will talk about next.  If you click the Edit button you are then taken to the edit view .

Edit Dashboard

Record Summary View

We wanted to add a new landing page for an item in our editing system so that a user could just view or look at a record instead of always having to jump into the edit window.  There are a number of reasons for this.  First off the edit view isn’t the easiest to see what is going on in a metadata record,  it is designed to edit fields and does not give you a succinct view of the record.  Second it actually locks the record for a period of time when it is open.  So even if you just open it and leave it alone, it will be locked in the system for about half an hour.  This doesn’t cause too many issues but it isn’t ideal. Finally just having an edit view resulted in a high number of “edits” to records that really weren’t edits at all, users were just hitting publish in order to clear out the record an close it.  Because we version all of our metadata changes this just adds versions that don’t really represent much in a way of change between records.

We decided that we should introduce a summary view for each record.  This would allow us to provide an overview of the state of an item record as well as providing a succinct metadata view and finally a logical place to put additional links to other important record views.

The image below is the top portion of a summary view for a metadata record in the system. I will go thorough some of the details below.

Record Summary

The top portion of the summary view gives a large image that represents the item.  This is usually the first page of the publication, the front of a photograph or map (we scan the backs of everything), or a thumbnail view of a moving image item.  Next to that you will see a large title to easily identify the title of the item.

Below that is a quick link to edit the record in the edit view.  If the item is visible to the public then you can quickly jump to the live view of the record with the “View in the Digital Library” link. Next we have a “View Item” link that takes you to a viewer that allows you to page through the object even if it isn’t online yet.  This item view is used during the creation of metadata records.  Finally you see a link to the “View History” link that takes you to an overview of the history of the item to see when things changed and who changed them.

Below are some quick visuals for if the item is public, if it is currently unlocked and able to be edited by the user, if the metadata has a completeness score of 1.0 (minimally viable record) and finally if all of the dates in the item are valid Extended Date Time dates.

This is followed by the number of unique editors that have edited the record, and the username of the last editor of the record.  Finally the date the item was last edited, and when it was added to the system are shown.

Record Interactions

Since we do keep the version history of all of the changes to a metadata record over time we wanted to give an idea of the lifecycle of the record.  A record can go back and forth from a state of hidden to public and sometimes back to hidden.  We decided a simple timeline would be a good way to better visualize these different states of records over time.

Record Timeline

The final part of the summary view is the succinct metadata display.  This is helpful to get a quick overview of a record.  It is in a layout that is consistent across fields and records of different types.  It will all print to about a page if you need to print it out in paper format (something that you really need to be able to do from time to time).

Succinct Record Display

History View

We have had a history view for each item for a number of years but until this summer it was only available if you knew to add /history/ to the end of a URL in the edit system.  When we added the summary page we now had a logical place to place a link to this page.

The only modification we’ve done for the page is a little bit of coloring for when a change results in a size difference in the record.  Blue for growth and orange for a reduction in size. There are a few more changes that I would like us to make to the history view which we will probably work on in the fall.  The main thing I want to add is more information about what changed in the records,  which fields for instance.  That’s very helpful in trying to track down oddities in records.

Record History

Title Case Helper

This last set of features are pretty small but are actually a pretty big help when they are needed.  We work with quite a bit of harvested data and metadata from different systems that we add to our digital collections. When we get these dataset sometimes they had different views of when to capitalize and when not to capitalize words. We have a collection from the federal government that has all of the titles and names all in uppercase.  Locally we tend to recommend a fairly standard title case for things like titles, and names also tend to follow this pattern.

We added some javascript helpers to identify when a title, creator, contributor, or publisher is in upper case and present the user with a warning message.  We actually are looking at instances that have more than 50% capital letters in the string. The warning doesn’t keep a person from saving the record, just gives them a visual clue that there is something they might want to change.

Title Case Warnings

After the warning we wanted to make it easier for users to convert a string from upper case to title case. If you haven’t tried to do this recently it is actually pretty time consuming and generally results in you just having to retype the value instead of changing it letter by letter. We decided that a button that could automatically convert the value into title case would save quite a bit of time.  The image below shows where this TC button is located for the title, creator, contributor, and publisher fields.

Creator Title Case Detail

Once you click the button it will change the main value to something that resembles title case.  It has some logic to deal with short words that are generally not capitalized like: and, of, or, the.

Corrected Title Case Detail

This saves quite a bit of time but isn’t perfect.  If you have abbreviations in the string they will be lost so you sometimes have to edit things after hitting the tc button.  Even so, it helps with a pretty fiddly task.

That covers many of changes that we made this summer to the item and record views in our system.  There are a few more interfaces that we added to our edit system that I will try and cover in the next week or so.

If you have questions or comments about this post,  please let me know via Twitter.

Terry Reese: MarcEdit 7: Super charging Task Processing

planet code4lib - Wed, 2017-08-30 03:21

One of the components getting a significant overhaul in MarcEdit 7 is how the application processes tasks.  This work started in MarcEdit 6.3.x, when I introduced a new –experimental bit when processing tasks from the command-line.  This bit shifted task processing from within the MarcEdit application to directly against the libraries where the underlying functions for each task was run.  The process was marked as experimental, in part, because task process have always been tied to the MarcEdit GUI.  Essentially, this is how a task works in MarcEdit:

Essentially, when running a task, MarcEdit opens and closes the corresponding edit windows and processes the entire file, on each edit.  So, if there are 30 steps in a task, the program will read the entire file, 30 times.  This is wildly inefficient, but also represents the easiest way that tasks could be added into MarcEdit 6 based on the limitations within the current structure of the program.

In the console program, I started to experiment with accessing the underlying libraries directly – but still, maintained the structure where each task item represented a new pass through the program.  So, while the UI components were no longer being interacted with (improving performance), the program was still doing a lot of file reading and writing.

In MarcEdit 7, I re-architected how the application interacts with the underlying editing libraries, and as part of that, included the ability to process tasks at that more abstract level.  The benefit of this, is that now all tasks on a record can be completed in one pass.  So, using the example of a 30 item task – rather than needing to open and close a file 30 times, the process now opens the file once and then processes all defined task operations on the record.  The tool can do this, because all task processing has been pulled out of the MarcEdit application, and pushed into a task broker.  This new library accepts from MarcEdit the file to process, and the defined task (and associated tasks), and then facilitates task processing at a record, rather than file, level.  I then modified the underlying library functions, which actually was really straightforward given how streams work in .NET. 

Within MarcEdit, all data is generally read and written using the StreamReader/StreamWriter classes, unless I specifically have need to access data at the binary level.  In those cases, I’d use a MemoryStream.  The benefit of using the StreamReader/Writer classes, however, is that it is an instance of the abstract TextReader class.  .NET also has a StringReader class, that allows C# to read strings like a stream – it too is an instance of the TextReader class.  This means that I’ve been able to make the following changes to the functions, and re-use all the existing code while still providing processing at both a file and  a record level:

string function(string sSource, string sDest, bool isFile=true) {

StringBuilder output = new StringBuilder(sDest);

System.IO.TextReader reader = null;
System.IO.TextWriter writer = null;

if (isFile) {

    reader = new System.IO.StreamReader(sSource);
    writer = new System.IO.StreamWriter(output.ToString(), false);

} else {

      output.Clear();  
     reader = new System.IO.StringReader(sSource);
     writer = new System.IO.StringWriter(output);

}

//…Do Stuff

return output.ToString()

}

As a TextReader/TextWriter, I now have access to the necessary functions needed to process both data streams like a file.  This means that I can now handle file or record level processing using the same code – as long as both data sources are in the mnemonic format.  Pretty cool.

What does this mean for users?  It means that in MarcEdit 7, tasks will be supercharged.  In testing, I’m seeing tasks that use to take 1, 2, or 3 minutes to complete now run in a matter of seconds.  So, while there are a lot of really interesting changes planned for MarcEdit 7, this enhancement feels like the one that might have the biggest impact for users as it will represent significant time savings when you consider processing time over the course of a month or year. 

Questions, let me know.

–tr

Access Conference: Register for Hackfest Workshops *Deadline Fri Sept 8*

planet code4lib - Tue, 2017-08-29 22:27

Planning on attending the Access 2017 hackfest on Wed. Sept 27th?

The hackfest is included in your conference registration fee and we hope you will be able to make it. For those new to hackfests, we understand that jumping in can be a bit intimidating (particularly if you aren’t a programmer), so we are offering a few options that will allow everyone to learn something new and use these skills to hack on a real project.

We will still have the traditional Access hackfest where attendees can form a team and work on a problem or project together, but we are also offering the following four workshops. Most of the workshops have limited space, so we are asking you to rank the hackfest options in order of preference using this online registration form by midnight, Fri Sept. 8th

We will do our best to accommodate everyone in their first choice, but if a section fills up we will take people on a first-response, first-served basis. Please take a minute to fill out our short form and let us know if you are planning to be there and what your preference is. We will let everyone who has registered know which workshops they have been placed in by Mon, Sept 11th.

Workshops:

DIY/Open Source File Scanner (Kinograph) – Curt Campbell & Donald Johnson, Provincial Archives of Saskatchewan (PAS)
Half day – pm only

The Kinograph is a prototype Maker/Open Source film scanner. It is an inexpensive hardware and software platform for the digitization of film. In this workshop, we will use the PAS Kinograph to introduce the concepts and issues of film scanning. Then we will experiment with the control and processing software for demonstration and discussion. The goal for the afternoon is to scan a film or two and come away with a ready-to-use video.  PAS will be bringing some films, but if you have a short 16mm film you’d like to scan, please bring it.

  • Expertise: Familiarity and understanding of digital imaging for still and moving images
  • Required: A laptop is recommended but not required
  • Programming experience: Experience with media/image processing tools and software development
  • Maximum registrants: 15

FOLIO – Andrew Nagy, EBSCO
Full day 

FOLIO is a library services platform – infrastructure that allows cooperating library apps to share data. This hackfest session is a hands-on introduction to FOLIO for developers. In this tutorial you will work with your own FOLIO setup through a series of exercises designed to demonstrate how to install an app on the platform and use the data sources and design elements the platform provides.

  • Expertise: Basic Mac/Windows/Linux administration
  • Required: Laptop with curriculum prerequisites (to be emailed to attendees ahead of time)
  • Programming experience: Familiarity with RESTful APIs recommended.
  • Maximum registrants: unlimited

Getting Started with Drupal 8 – John Yobb, U Sask Library
Full day

Getting started with Drupal 8 will give participants the opportunity to develop a website by installing and configuring modules. Preconfigured Drupal environments will be available to all participants and we will start the day by diving right into developing a simple module to gain an understanding of how the Drupal 8 module system works. Once their module is completed, participants will have time to build and theme a website of of their choosing (Blog, Image hosting, Intranet) by downloading and configuring Drupal modules.

  • Expertise: A familiarity with the Linux command line is helpful but not required.  A basic command line reference will be available.
  • Required: Laptop with Putty or Mac terminal
  • Programming experience: None
  • Maximum registrants: 15

Raspberry Spy – Darryl Friesen, U Sask Library
Full day

Raspberry Pi devices are inexpensive mini-computers that can be used with multiple sensors and add-ons to build all manner of projects. In this session, we will explore some nefarious and non-nefarious uses for the Raspberry Pi and its external sensors. The morning will consist of an informal presentation and discussion where we will examine how the camera, PIR sensor, and beam break sensor can be used to create a gate counter, traffic monitor, or automated In-Out board. In the afternoon, participants will have the opportunity to go hands-on with the Raspberry Pi devices and sensors to create their own projects. All materials will be supplied.

  • Expertise: A familiarity with the Linux command line is helpful but not required.
  • Required: Laptop with Putty or Mac terminal
  • Programming experience: Experience with Python or similar languages is helpful but not required
  • Maximum registrants: 12

David Rosenthal: Don't own cryptocurrencies

planet code4lib - Tue, 2017-08-29 15:00
A year ago I ended a post entitled The 120K BTC Heist:
So in practice blockchains are decentralized (not), anonymous (not and not), immutable (not), secure (not), fast (not) and cheap (not). What's (not) to like?  Below the fold, I update the answer to the question with news you can use if you're a cryptocurrency owner.

Many Americans evidently believe that cryptocurrencies are anonymous enough to use bitcoin to evade taxes:
The IRS has claimed that only 802 people declared bitcoin losses or profits in 2015; clearly fewer than the actual number of people trading the cryptocurrency—especially as more investors dip into the world of cryptocurrencies, and the value of bitcoin punches past the $4,000 mark. Maybe lots of bitcoin traders didn't realize the government expects to collect tax on their digital earnings, or perhaps some thought they'd be able to get away with stockpiling bitcoin thanks to the perception that the cryptocurrency is largely anonymous.Perhaps they should reconsider:
[the IRS] has purchased specialist software to track those using bitcoin, according to a contract obtained by The Daily Beast.Especially, as Zeljka Zorz reports at Helpnetsecurity, if they used their bitcoin to buy something:
More and more shopping Web sites accept cryptocurrencies as a method of payment, but users should be aware that these transactions can be used to deanonymize them – even if they are using blockchain anonymity techniques such as CoinJoin.

Independent researcher Dillon Reisman and Steven Goldfeder, Harry Kalodner and Arvind Narayanan from Princeton University have demonstrated that third-party online tracking provides enough information to identify a transaction on the blockchain, link it to the user’s cookie and, ultimately, to the user’s real identity.The paper is here. But owning bitcoins is a problem even if you don't use them to buy anything [my emphasis]:
First the hacker grabbed access to my friend’s Facebook Messenger and contacted everyone on his list that was interested in cryptocurrency, including me. ... Once it was clear that I had some bitcoin somewhere the hackers decided I was their next target.Once you're a target the bad guys have two techniques for grabbing bitcoin from savvy owners who have enabled two-factor authentication (2FA) on their accounts using SMS, which is by far the most common 2FA technique. The first is SIM hijacking:
a hacker swapped his or her own SIM card with mine, presumably by calling T-Mobile. This, in turn, shut off network services to my phone and, moments later, allowed the hacker to change most of my Gmail passwords, my Facebook password, and text on my behalf. All of the two-factor notifications went, by default, to my phone number so I received none of them and in about two minutes I was locked out of my digital life.This has become a routine ocurrence, as Nathaniel Popper reports in Identity Thieves Hijack Cellphone Accounts to Go After Virtual Currency:
“My iPad restarted, my phone restarted and my computer restarted, and that’s when I got the cold sweat and was like, ‘O.K., this is really serious,’” said Chris Burniske, a virtual currency investor who lost control of his phone number late last year.

A wide array of people have complained about being successfully targeted by this sort of attack, including a Black Lives Matter activist and the chief technologist of the Federal Trade Commission. The commission’s own data shows that the number of so-called phone hijackings has been rising. In January 2013, there were 1,038 such incidents reported; by January 2016, that number had increased to 2,658.

But a particularly concentrated wave of attacks has hit those with the most obviously valuable online accounts: virtual currency fanatics like Mr. Burniske.

Within minutes of getting control of Mr. Burniske’s phone, his attackers had changed the password on his virtual currency wallet and drained the contents — some $150,000 at today’s values.

...

“Everybody I know in the cryptocurrency space has gotten their phone number stolen,” said Joby Weeks, a Bitcoin entrepreneur.

Mr. Weeks lost his phone number and about a million dollars’ worth of virtual currency late last year, despite having asked his mobile phone provider for additional security after his wife and parents lost control of their phone numbers.

The attackers appear to be focusing on anyone who talks on social media about owning virtual currencies or anyone who is known to invest in virtual currency companies, such as venture capitalists. And virtual currency transactions are designed to be irreversible.The problem is that the security of your account depends on the ability of your cellphone carrier's front-line support to resist social engineering, a notoriously weak defense:
Adam Pokornicky, a managing partner at Cryptochain Capital, asked Verizon to put extra security measures on his account after he learned that an attacker had called in 13 times trying to move his number to a new phone.

But just a day later, he said, the attacker persuaded a different Verizon agent to change Mr. Pokornicky’s number without requiring the new PIN. The second technique is abusing the SS7 signalling protocol:
A known security hole in the networking protocol used by cellphone providers around the world played a key role in a recent string of attacks that drained bank customer accounts, according to a report published Wednesday.


The unidentified attackers exploited weaknesses in Signalling System No. 7, a telephony signaling language that more than 800 telecommunications companies around the world use to ensure their networks interoperate. SS7, as the protocol is known, makes it possible for a person in one country to send text messages to someone in another country. It also allows phone calls to go uninterrupted when the caller is traveling on a train.

The same functionality can be used to eavesdrop on conversations, track geographic whereabouts, or intercept text messages. Security researchers demonstrated this dark side of SS7 last year when they stalked US Representative Ted Lieu using nothing more than his 10-digit cell phone number and access to an SS7 network.

In January, thieves exploited SS7 weaknesses to bypass two-factor authentication banks used to prevent unauthorized withdrawals from online accounts, the German-based newspaper Süddeutsche Zeitung reported. Specifically, the attackers used SS7 to redirect the text messages the banks used to send one-time passwords. Instead of being delivered to the phones of designated account holders, the text messages were diverted to numbers controlled by the attackers. The attackers then used the mTANs—short for "mobile transaction authentication numbers"—to transfer money out of the accounts.Because the vulnerability is a basic feature of SS7 implementations, there is nothing you can do to defend against the SS7 attack except not using phones for 2FA.

So, if you own bitcoin:
  • Don't use them to buy anything.
  • Don't, especially, use them to do anything illegal.
  • Don't let anyone know that you own them.
  • Don't write anything on-line sounding even mildly enthusiastic about cryptocurrencies.
  • Don't use phone-based 2FA on any of your accounts.
  • Do report any gains and losses to the tax authorities in your country.
Have fun!




Terry Reese: MarcEdit 7 release schedule planning

planet code4lib - Tue, 2017-08-29 13:10

I’m going to put this here to help folks that need to work with IT depts when putting new software on their machines.  At this point, with the new features, the updates related to the .NET language changes, the filtering of old XP code and the updated performance code, and new installer – this will be the largest update to the application since I ported the codebase from Assembly to C#.  Just looking at this past weekend, I added close to 17,000 lines of code while completing the clustering work, and removed ~3000 lines of code doing optimization work and removing redundant information. 

In total, work on MarcEdit 7 has been ongoing since April 2017 (formally), and informally since Jan. 2017.  However, last night, I hit a milestone of sorts – I setup the new build environment for MarcEdit 7.  In fact, this morning (around 1 am), I created the first version of the new MarcEdit 7 installer that can installed without administrator permissions.  I’ve heard again and again, the administrator requirements are one of the single biggest issues for users in staying up today.  With MarcEdit 7, the program will provide multiple installation options that should help to alleviate these problems. 

Anyway, given the pace of change and my desire to have some folks put this through its paces prior to the formal release, I’ll be making multiple versions of MarcEdit 7 available for testing using the following schedule below.  Please note, the Alpha and Beta dates are soft dates (they could move up or down by a few days), but the Release Date is a hard date.  Please note, unlike previous versions of MarcEdit, MarcEdit 7 will be able to be installed along-side MarcEdit 6, so both versions will be able to be installed on the same machine.  To simplify this process, all test builds of MarcEdit will be released requiring non-administrator access to install as this will allow me to sandbox the software easier.

Alpha Testing

Sept. 14, 2017 – this will be the first version of MarcEdit.  It won’t be feature complete, but the features included should be finished and working – but I’m expecting to hear from people that some things are broken.  Really, this first version is for those waiting to get their hands on the installer and play with software that likely is a little broken.

Beta Testing:

Oct 2, 2017 – First beta build will be created.  New builds will likely be made available biweekly.

MarcEdit 7 Release Date:

Nov. 25, 2017 – MarcEdit 7.0.x release date.  The release will happen over the U.S. Thanksgiving Holiday. 

This gives users approximately 3 months to ensure that their local systems will be ready for the new update.  Remember, the system requirements are changing.  As of MarcEdit 7, the software will have the following system requirements on Windows (mac and linux already require these requirements):

System Requirements:

  1. Operating System
    Windows 7-present (software may work on Windows Vista, but given the low install-base [smaller than Windows XP], Windows 7 will be the lowest version of Windows I’ll be officially testing on and supporting)
  2. .NET Version
    4.6.1+ –  Version 4.6.1 is the minimal required version of the .NET platform.  If you have Windows 8-10,you should be fine.  If you have Windows 7, you may have to update your .NET instance (though, this will happen automatically if you accept Microsoft’s updates).  If you have questions, you’ll want to contact your IT departments.

That’s it.  But this does represent a very significant change for the program.  For years, I’ve been limping Windows XP support along, and MarcEdit 7 does represent a break from that platform.  I’ll be keeping the last version of MarcEdit 6.3.x available for users that run an unsupported operating system and cannot upgrade, though, I won’t be making any more changes to MarcEdit 6.3.x after MarcEdit 7 comes out. 

If you have questions, let me know.

–tr

Peter Murray: Adding Islandora Viewers Capability to Basic Image Solution Pack

planet code4lib - Tue, 2017-08-29 01:15

Putting this here because I didn’t see it mentioned elsewhere and it might be useful for others. Thinking about the history of the Islandora solution packs for different media types, the Basic Image Solution Pack was probably the first one written. Displaying a JPEG image, after all, is — well — pretty basic. I’m working on an Islandora project where I wanted to add a viewer to Basic Image objects, but I found that the solution pack code didn’t use them. Fortunately, Drupal has some nice ways for me to intercede to add that capability!

Step 1: Alter the /admin/islandora/solution_pack_config/basic_image form
The first step is to alter the solution pack admin form to add the Viewers panel. Drupal gives me a nice way to alter forms with hook_form_FORM_ID_alter().

/** * Implements hook_form_FORM_ID_alter(). * * Add a viewers panel to the basic image solution pack admin page */function islandora_ia_viewers_form_islandora_basic_image_admin_alter(&$form, &$form_state, $form_id) { module_load_include('inc', 'islandora', 'includes/solution_packs'); $form += islandora_viewers_form('islandora_image_viewers', 'image/jpeg', 'islandora:sp_basic_image');}

Step 2: Insert ourselves into the theme preprocess flow
The second step is a little trickier, and I’m not entirely sure it is legal. We’re going to set a basic image preprocess hook and in it override the contents of $variables['islandora_content']. We need to do this because that is where the viewer sets its output.

/** * Implements hook_preprocess_HOOK(&$variables) * * Inject ourselves into the islandora_basic_image theme preprocess flow. */function islandora_ia_viewers_preprocess_islandora_basic_image(array &$variables) { $islandora_object = $variables['islandora_object']; module_load_include('inc', 'islandora', 'includes/solution_packs'); $params = array(); $viewer = islandora_get_viewer($params, 'islandora_image_viewers', $islandora_object); if ($viewer) { $variables['islandora_content'] = $viewer; }}

I have a sneaking suspicion that the hooks are called in alphabetical order, and since islandora_ia_viewers comes after islandora_basic_image it all works out. (We need our function to be called after the Solution Pack’s preprocess function so our 'islandora_content' value is the one that is ultimately passed to the theming function. Still, it works!

Lucidworks: Evolving the Optimal Relevancy Scoring Model at Dice.com

planet code4lib - Mon, 2017-08-28 18:28

As we countdown to the annual Lucene/Solr Revolution conference in Las Vegas next month, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Dice.com’s Simon Hughes’ talk, “Evolving the Optimal Relevancy Scoring Model at Dice.com”.

A popular conference topic in recent years is using machine learned ranking (MLR) to re-rank the top results of a Solr query to improve relevancy. However, such approaches fail to first ensure that they have the optimal query configuration for their search engine, without which the re-ranked results may fail to contain the most relevant items for each query (lowering recall). Solr offers many configuration options to control how documents are ranked and scored in terms of relevancy to a user’s query, including what boosts to assign to each field, and how strongly to boost phrasal matches. It is common for companies to manually tune these parameters to optimize relevancy, but this process is highly subjective and not guaranteed to produce the optimal results. We will show a data-driven approach to relevancy tuning that uses optimization algorithms, such as evolutionary algorithms, to evolve a query configuration that optimizes the relevancy of the results returned using data captured from our query logs. We will also discuss how we experimented with evolving a custom similarity algorithm to out-perform BM25 and tf.idf similarity on our dataset. Finally, we’ll discuss the dangers of positive feedback loops when training machine learned ranking models.

 


 
Join us at Lucene/Solr Revolution 2017, the biggest open source conference dedicated to Apache Lucene/Solr on September 12-15, 2017 in Las Vegas, Nevada. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Evolving the Optimal Relevancy Scoring Model at Dice.com appeared first on Lucidworks.

David Rosenthal: Recent Comments Widget

planet code4lib - Mon, 2017-08-28 15:00
I added a "Recent Comments" widget to the sidebar of my blog. I should have done this a long time ago, sorry! The reason it is needed is that I frequently add comments to old, sometimes very old, posts as a way of tracking developments that don't warrant a whole new post.

For example, my post from last December BITAG on the IoT has accumulated 52 comments, the most recent from August 25th. That's more than one a week! I've been using it as a place to post notes about the evolving security disaster that is the IoT. I need to do a new post about the IoT but it hasn't risen to the top of the stack of draft posts yet.

One thing the widget will show is that not many of you comment on my posts. I'm really very grateful to those who do, so please take the risk of commenting. I moderate comments, so they don't show up immediately. And if I think they're off-topic or unsuitable they won't show up at all. But comments I disagree with are welcome, and can spark a useful exchange. See, for example, the discussion of inflation in the comments on Economic Model of Long-Term Storage, which clarified a point I thought was obvious but clearly wasn't. Thank you, Rick Levine!

Hat tip to Nitin Maheta,  from whose recent comments widget mine was adapted.

District Dispatch: CopyTalk must go on!

planet code4lib - Mon, 2017-08-28 14:00

We just dodged a calamity that would have forced us to cancel CopyTalk!

What was the problem? Our moderator, Patrick Newell and the person who hits the record button (me) will not be available on the 7th. What’s the solution? We have pitch hitters! Laura Quilter, acclaimed copyright expert and counsel at the University of Massachusetts-Amherst and long-time OITP copyright education subcommittee member, will moderate. The technical genius will be ALA’s own Julianna Kloeppel, eLearning Specialist. The webinar is bound to go off without a technical glitch.

Our speaker is Eric Harbeson, music copyright expert as well as a long-time member of the copyright education subcommittee. He will present the ins and outs of music and copyright (obviously). As you know, dealing with music and copyright questions is more difficult than cataloging a sound recording. And there’s copyright legislation—The CLASSICS Act—to discuss.

Details:

September 7, 2017 at 2 p.m. (Eastern) /11 a.m. (Pacific)

Go to ala.adobeconnect.com/copytalk and sign in as a guest. You’re in!

This free program is brought to you by OITP’s copyright education subcommittee.

The post CopyTalk must go on! appeared first on District Dispatch.

OCLC Dev Network: Mobile Development Part 1 - Hybrid App Authentication

planet code4lib - Mon, 2017-08-28 14:00

Would you like to build mobile applications against OCLC API's? We show you how to rapidly develop a hybrid app and walk through the authentication flow. Development is efficient and inexpensive because the open source PhoneGap framework has one code base for both Android and iOS devices.

FOSS4Lib Recent Releases: VuFind - 4.0.1

planet code4lib - Mon, 2017-08-28 13:36
Package: VuFindRelease Date: Monday, August 28, 2017

Last updated August 28, 2017. Created by Demian Katz on August 28, 2017.
Log in to edit this page.

Minor bug fix release.

DuraSpace News: Nigeria Launches National Federated DSpace Repository

planet code4lib - Mon, 2017-08-28 00:00

By Michael Guthrie, Director, KnowledgeArc  The Nigerian Research and Education Network Federated Repository (NgREN)  has launched a federated repository with assistance from KnowledgeArc. The new DSpace archive brings together the works from 42 different DSpace instances in the country, and provides a hub to showcase and highlight the intellectual output of Nigeria's universities.

DuraSpace News: VIVO Updates for August 27 — Is VIVO FAIR? Plus Camp and 1.10 testing

planet code4lib - Mon, 2017-08-28 00:00

VIVO Camp  Summer may be almost over, but that doesn't mean you've missed a chance to attend camp.  VIVO Camp will be held November 9-11 in beautiful Durham North Carolina.

DuraSpace News: VIVO Updates for Aug 13 — Fall Camp, Community Projects, Adding PubMed links, Implementation, VIVO site list

planet code4lib - Mon, 2017-08-28 00:00

Fall Camp November 9-11, Durham, NC  Planning a VIVO project?  Need to learn more about VIVO – sources of data, community engagement, data representation, queries, data management? VIVO Camp is 2 1/2 days of training.  Register today! https://goo.gl/ocvznb  Ideas for camp?

Pages

Subscribe to code4lib aggregator