Last updated May 26, 2015. Created by David Nind on May 26, 2015.
Log in to edit this page.
Koha 3.20 is the latest major release. It includes 5 new features, 114 enhancements and 407 bug fixes.
For more details see:
- Koha 3.20.0 - http://koha-community.org/koha-3-20-0-released/ (22 May 2015 - major six-monthly release)
Koha's release cycle:
Today I found the following resources and bookmarked them on Delicious.
- Open Hub, the open source network Discover, Track and Compare Open Source
- Arches: Heritage Inventory & Management System Arches is an innovative open source software system that incorporates international standards and is built to inventory and help manage all types of immovable cultural heritage. It brings together a growing worldwide community of heritage professionals and IT specialists. Arches is freely available to download, customize, and independently implement.
Digest powered by RSS Digest
- Learn about Open Source from Me and Infopeople
- Online Presentations
- CIL2008: Open Source Solutions to Offer Superior Service
Tuesday May 26, 2015.
Today we had a lively half hour free webinar presentation by Kimberly Bryant and Lake Raymond from Black Girls CODE about their latest efforts and the exciting LITA preconference they will be giving at ALA Annual in San Francisco. Here’s the link to the recording from todays session:
For more information check out the previous LITA Blog entry:
Did you attend the webinar, or view the recording? Give us your feedback by taking the Evaluation Survey.
Then register for and attend the LITA preconference at ALA Annual. This opportunity is following up on the 2014 LITA President’s Program at ALA Annual where then LITA President Cindi Trainor Blyberg welcomed Kimberly Bryant, founder of Black Girls Code.
The Black Girl Code Vision is to increase the number of women of color in the digital space by empowering girls of color ages 7 to 17 to become innovators in STEM fields, leaders in their communities, and builders of their own futures through exposure to computer science and technology.
“To bring together the records of the past and to house them in buildings where they will be preserved for the use of men and women in the future, a Nation must believe in three things.
It must believe in the past.
It must believe in the future.
It must, above all, believe in the capacity of its own people so to learn from the past that they can gain in judgement in creating their own future.”
– Franklin Roosevelt At the dedication of his library on June 30, 1941
Earlier this month it was announced the President Barack Obama’s Presidential Library will be built on the south side of Chicago. It will be our 14th Presidential Library.
The idea originated with FDR who in his second term “on the advice of noted historians and scholars, established a public repository to preserve the evidence of the Presidency for future generations”
Then in 1955, Congress passed the Presidential Libraries Act, establishing a system of privately erected and federally maintained libraries.
Here’s a sampling of images from the Digital Public Library of America related to our presidents and their libraries. Enjoy!
In my copious spare time I have hacked together a thing I’m calling the HathiTrust Research Center Workset Browser, a (fledgling) tool for doing “distant reading” against corpora from the HathiTrust. 
The idea is to: 1) create, refine, or identify a HathiTrust Research Center workset of interest — your corpus, 2) feed the workset’s rsync file to the Browser, 3) have the Browser download, index, and analyze the corpus, and 4) enable to reader to search, browse, and interact with the result of the analysis. With varying success, I have done this with a number of worksets ranging on topics from literature, philosophy, Rome, and cookery. The best working examples are the ones from Thoreau and Austen. [2, 3] The others are still buggy.
As a further example, the Browser can/will create reports describing the corpus as a whole. This analysis includes the size of a corpus measured in pages as well as words, date ranges, word frequencies, and selected items of interest based on pre-set “themes” — usage of color words, name of “great” authors, and a set of timeless ideas.  This report is based on more fundamental reports such as frequency tables, a “catalog”, and lists of unique words. [5, 6, 7, 8]
The whole thing is written in a combination of shell and Python scripts. It should run on just about any out-of-the-box Linux or Macintosh computer. Take a look at the code.  No special libraries needed. (“Famous last words.”) In its current state, it is very Unix-y. Everything is done from the command line. Lot’s of plain text files and the exploitation of STDIN and STDOUT. Like a Renaissance cartoon, the Browser, in its current state, is only a sketch. Only later will a more full-bodied, Web-based interface be created.
The next steps are numerous and listed in no priority order: putting the whole thing on GitHub, outputting the reports in generic formats so other things can easily read them, improving the terminal-based search interface, implementing a Web-based search interface, writing advanced programs in R that chart and graph analysis, provide a means for comparing & contrasting two or more items from a corpus, indexing the corpus with a (real) indexer such as Solr, writing a “cookbook” describing how to use the browser to to “kewl” things, making the metadata of corpora available as Linked Data, etc.
‘Want to give it a try? For a limited period of time, go to the HathiTrust Research Center Portal, create (refine or identify) a collection of personal interest, use the Algorithms tool to export the collection’s rsync file, and send the file to me. I will feed the rsync file to the Browser, and then send you the URL pointing to the results.  Let’s see what happens.
Fun with public domain content, text mining, and the definition of librarianship.Links
- HTRC Workset Browser – http://bit.ly/workset-browser
- Thoreau – http://bit.ly/browser-thoreau
- Austen – http://bit.ly/browser-austen
- Thoreau report – http://ntrda.me/1LD3xds
- Thoreau dictionary (frequency list) – http://bit.ly/thoreau-dictionary
- usage of color words in Thoreau — http://bit.ly/thoreau-colors
- unique words in the corpus – http://bit.ly/thoreau-unique
- Thoreau “catalog” — http://bit.ly/thoreau-catalog
- source code – http://ntrda.me/1Q8pPoI
- HathiTrust Research Center – https://sharc.hathitrust.org
Economists like to say there are no bad people, just bad incentives. The incentives to publish today are corrupting the scientific literature and the media that covers it. Until those incentives change, we’ll all get fooled again.Earlier this year I saw Tom Stoppard's play The Hard Problem at the Royal National Theatre, which deals with the same issue. The tragedy is driven by the characters being entranced by the prospect of publishing an attention-grabbing result. Below the fold, more on the problem of bad incentives in science.
Back in April, after a Wellcome Trust symposium on the reproducibility and reliability of biomedical science, Richard Horton, editor of The Lancet, wrote an editorial entitled What is medicine’s 5 sigma? that is well worth a read. His focus is also on incentives for scientists:
In their quest for telling a compelling story, scientists too often sculpt data to fit their preferred theory. Or they retrofit hypotheses to fit their data. and journal editors:
Our acquiescence to the impact factor fuels an unhealthy competition to win a place in a select few journals. Our love of "significance" pollutes the literature with many a statistical fairy-tale. We reject important confirmations.and Universities:
in a perpetual struggle for money and talent, endpoints that foster reductive metrics, such as high-impact publication. National assessment procedures, such as the Research Excellence Framework, incentivise bad practices.Horton points out that:
Part of the problem is that no-one is incentivised to be right. Instead, scientists are incentivised to be productive and innovative.He concludes:
The good news is that science is beginning to take some of its worst failings very seriously. The bad news is that nobody is ready to take the first step to clean up the system.Six years ago Marcia Angell, the long-time editor of a competitor to The Lancet wrote in an review of three books pointing out the corrupt incentives that drug companies provide researchers and Universities:
It is simply no longer possible to believe much of the clinical research that is published, or to rely on the judgment of trusted physicians or authoritative medical guidelines. I take no pleasure in this conclusion, which I reached slowly and reluctantly over my two decades as an editor of The New England Journal of Medicine.In most fields, little has changed since then. Horton points to an exception:
Following several high-profile errors, the particle physics community now invests great effort into intensive checking and re-checking of data prior to publication. By filtering results through independent working groups, physicists are encouraged to criticise. Good criticism is rewarded. The goal is a reliable result, and the incentives for scientists are aligned around this goal.Unfortunately, particle physics is an exception. The cost of finding the Higgs Boson was around $13.25B, but no-one stood to make a profit from it. A single particle physics paper can have over 5,000 authors. The resources needed for "intensive checking and re-checking of data prior to publication" are trivial by comparison. In other fields, the incentives for all actors are against devoting resources which would represent a significant part of the total for the research to such checking.
Fixing these problems of science is a collective action problem; it requires all actors to take actions that are against their immediate interests roughly simultaneously. So nothing happens, and the long-term result is, as Arthur Caplan (of the Division of Medical Ethics at NYU's Langone Medical Center) pointed out, a total loss of science's credibility:
The time for a serious, sustained international effort to halt publication pollution is now. Otherwise scientists and physicians will not have to argue about any issue—no one will believe them anyway.(see also John Michael Greer). I am not optimistic, based on the fact that the problem has been obvious for many years, and that this is but one aspect of society's inability to deal with long-term problems.
Metadata quality and assessment is a concept that has been around for decades in the library community. Recently it has been getting more interest as new aggregations of metadata become available in open and freely reusable ways such as the Digital Public Library of America (DPLA) and Europeana. Both of these groups make available their metadata so that others can remix and reuse the data in new ways.
I’ve had an interest in analyzing the metadata in the DPLA for a while and have spent some time working on the subject fields. This post will continue along those lines in trying to figure out what some of the metrics that we can calculate with the DPLA dataset that we can use to define “quality”. Ideally we will be able to turn these assessments and notions of quality into concrete recommendations for how to improve metadata records in the originating repositories.
This post will focus on normalization of subject strings, and how those normalizations might be useful as a way of assessing quality of a set of records.
One of the the powerful features of OpenRefine is the ability to cluster a set or data and combine these clusters into a single entry. Often times this will significantly reduce the number of values that occur in a dataset in a quick and easy manner.
OpenRefine has a number different algorithms that can be used for this work that are documented in their Clustering in Depth documentation. Depending on ones data one approach may perform better than others for this kind of clustering.Normalization
Case normalization is probably the easiest to kind of normalization to understand. If you have two strings, say “Mark” and “marK” if you converted each of the strings to lowercase you would end up with a single value of “mark”. Many more complicated normalizations assume this as a start because it reduces the number of subjects without drastically transforming the original string values.
Case folding is another kind of transformation that is fairly common in the world of libraries. This is the process of taking a string like “José” and converting it to “Jose”. While this can introduce issues if a string is meant to have a diacritic and that diacritic makes the word or phrase different than the one without the diacritic, often times it can help to normalize inconsistently notated versions of the same string.
In addition to case folding and lower casing, libraries have been normalizing data for a long time, there have been efforts in the past to formalize algorithms for the normalization of subject strings for use in matching these strings. Often referred to as NACO normalizations rules, they are Authority File Comparison Rules. I’ve always found this work to be intriguing and have a preference for the work and simplified algorithm that was developed at OCLC in their NACO Normalization Service. In fact we’ve taken the sample Python implementation there and created a stand-alone repository and project called pynaco on GitHub for the code so that we could add tests and then work to port it Python 3 in the near future.
Another common type of normalization that is performed on strings in library land is stemming. This is often done within search applications so that if you search one of the phrases run, runs, running you would get documents that contain each of these.
What I’ve been playing around with is if we could use the reduction in unique terms for a field in a metadata repository as an indicator of quality.
Here is an example.
If we have the following sets of subjects:Musical Instruments Musical Instruments. Musical instrument Musical instruments Musical instruments, Musical instruments.
If you applied the simplified NACO normalization from pynaco you would end up with the following strings:musical instruments musical instruments musical instrument musical instruments musical instruments musical instruments
If you then applied the porter stemming algorithm to the new set of subjects you would end up with the following:music instrument music instrument music instrument music instrument music instrument music instrument
So in effect you have normalized the original set of six unique subjects down to one unique subject strings with a NACO transformation followed by a normalization with the Porter Stemming algorithm.Experiment
In some past posts here, here, here, and here, I discussed some of the aspects of the subject fields present in Digital Public Library of America dataset. I dusted that dataset off and extracted all of the subjects from the dataset so that I could work with them by themselves.
I ended up with a set of text files that were 23,858,236 lines long that held the item identifier and a subject value for each subject of each item in the DPLA dataset. Here is a short snippet of what that looks like.d8f192def7107b4975cf15e422dc7cf1 Hoggson Brothers d8f192def7107b4975cf15e422dc7cf1 Bank buildings--United States d8f192def7107b4975cf15e422dc7cf1 Vaults (Strong rooms) 4aea3f45d6533dc8405a4ef2ff23e324 Public works--Illinois--Chicago 4aea3f45d6533dc8405a4ef2ff23e324 City planning--Illinois--Chicago 4aea3f45d6533dc8405a4ef2ff23e324 Art, Municipal--Illinois--Chicago 63f068904de7d669ad34edb885925931 Building laws--New York (State)--New York 63f068904de7d669ad34edb885925931 Tenement houses--New York (State)--New York 1f9a312ffe872f8419619478cc1f0401 Benedictine nuns--France--Beauvais
Once I have the data in this format I could experiment with different normalizations to see what kind of effect they had on the dataset.Total vs Unique
The first thing I did was to make the 23,858,236 long text file only contain unique values. I do this with the tried and true method of using unix sort and uniq.sort subjects_all.txt | uniq > subjects_uniq.txt
After about eight minutes of waiting I ended up with a new text file subjects_uniq.txt that contains the unique subject strings in the dataset. There are a total of 1,871,882 unique subject strings in this file.Case folding
Using a Python script to perform case folding on each of the unique subjects I’m able to see is that causes a reduction in the number of unique subjects.
I started out with 1,871,882 unique subjects and after case folding ended up with 1,867,129 unique subjects. That is a difference of 4,753 or a 0.25% reduction in the number of unique subjects. So nothing huge.Lowercase
The next normalization tested was lowercasing of the values. I chose to do this on the set of subjects that were already case folded to take advantage of the previous reduction in the dataset.
By converting the subject strings to lowercase I reduced the number of unique case folded subjects from 1,867,129 to 1,849,682 which is a reduction of 22,200 or a 1.2% reduction from the original 1,871,882 unique subjects.NACO Normalization
Next we look at the simple NACO normalization from pynaco. I applied this to the unique lower cased subjects from the previous step.
With the NACO normalization, I end up with 1,826,523 unique subject strings from the 1,849,682 that I started with from the lowercased subjects. This is a difference of 45,359 or a 2.4% reduction from the original 1,871,882 unique subjects.Porter stemming
Moving along, I looked at for this work was applying the Porter Stemming algorithm to the output of the NACO normalized subjects from the previous step. I used the Porter implementation from the Natural Language Tool Kit (NLTK) for Python.
With the Portal stemmer applied, I ended up with 1,801,114 unique subject strings from the 1,826,523 that I started with from the NACO normalized subjects. This is a difference of 70,768 or a 3.8% reduction from the original 1,871,882 unique subjects.Fingerprint
Finally I used a python porting of the fingerprint algorithm that OpenRefine uses for its clustering feature. This will help to normalize strings like “phillips mark” and “mark phillips” into a single value of “mark phillips”. I used the output of the previous Porter stemming step as the input for this normalization.
With the fingerprint algorithm applied, I ended up with 1,766,489 unique fingerprint normalized subject strings. This is a difference of 105,393 or a 5.6% reduction from the original 1,871,882 unique subjects.Overview Reduction Occurrences Percent Reduction Unique 0 1,871,882 0% Case Folded 4,753 1,867,129 0.3% Lowercase 22,200 1,849,682 1.2% NACO 45,359 1,826,523 2.4% Porter 70,768 1,801,114 3.8% Fingerprint 105,393 1,766,489 5.6% Conclusion
I think that it might be interesting to apply this analysis to the various Hubs in the whole DPLA dataset to see if there is anything interesting to be seen across the various types of content providers.
I’m also curious if there are other kinds of normalizations that would be logical to apply to the subjects that I’m blanking on. One that I would probably want to apply at some point is the normalization for LCSH that splits a subject into its parts if it has the double hype — in the string. I wrote about the effect on the subjects for the DPLA dataset in a previous post.
As always feel free to contact me via Twitter if you have questions or comments.
Last week I had the pleasure of presenting a short talk at the second virtual meeting of the NISO effort to reach a Consensus Framework to Support Patron Privacy in Digital Library and Information Systems. The slides from the presentation are below and on SlideShare, followed by a cleaned-up transcript of my remarks.
It looks like in the agenda that I’m batting in the clean-up role, and my message might be pithily summarized as “Can’t we all get along?” A core tenet of librarianship — perhaps dating back to the 13th and 14th century when this manuscript was illuminated — is to protect the activity trails of patrons from unwarranted and unnecessary disclosure.
This is embedded in the ethos of librarianship. As Todd pointed out in the introduction, third principle of the American Library Association’s Code of Ethics states: “We protect each library user’s right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.” Librarians have performed this duty across time and technology, and as both have progressed the profession has sought new ways to protect the privacy of patrons.
For instance, there was once a time when books had a pocket in the back that held a card showing who had checked out the book and when it was due. Upon checkout the card was taken out, had the patron’s name embossed or written on it, and was stored in a date-sorted file so that the library knew when it was due and who had it checked out. When the book was returned, the name was scratched through before putting the card in the pocket and the book on the shelf. Sometimes, as a process shortcut, the name was left “in the clear” on the card, and anyone that picked the book off the shelf could look on the card to see who had checked it out.
When libraries automated their circulation management with barcodes and database records, the card in the back of the book and the information it disclosed was no longer necessary. This was hailed as one of the advantages to moving to a computerized circulation system. While doing away with circulation cards eliminated one sort of privacy leakage — patrons being able to see what each other had checked out — it enabled another: systematic collection of patron activity in a searchable database. Many automation systems put in features that automatically removed the link between patron and item after it was checked in. Or, if that information was stored for a period of time, it was password protected so only approved staff could view the information. Some, however, did not, and this became a concern with the passage of the USA PATRIOT act by the United States Congress.
We are now in an age where patron activity is scattered across web server log files, search histories, and usage analytics of dozens of systems, some of which are under the direct control of the library while others are in the hands of second and third party service providers. Librarians that are trying to do their due diligence in living up to the third principle of the Code of Ethics have a more difficult time accounting for all of the places where patron activity is collected. It has also become more difficult for patrons to make informed choices about what information is collected about their library activity and how it is used.
In the mid-2000s, libraries and content providers had a similar problem: the constant one-off negotiation of license terms was a burden to all parties involved. In order to gain new efficiencies in the process of acquiring and selling licensed content, representatives from the library and publisher communities came together under a NISO umbrella to reach a shared understanding of what the terms of an agreement would be and a registry of organizations that ascribed to those terms. Quoting from the forward of the 2012 edition: “The Shared Electronic Resource Understanding (SERU) Recommended Practice offers a mechanism that can be used as an alternative to a license agreement. The SERU statement expresses commonly shared understandings of the content provider, the subscribing institution and authorized users; the nature of the content; use of materials and inappropriate uses; privacy and confidentiality; online performance and service provision; and archiving and perpetual access. Widespread adoption of the SERU model for many electronic resource transactions offers substantial benefits both to publishers and libraries by removing the overhead of bilateral license negotiation.”
Today’s web service is filled with social sharing widgets (Facebook, Twitter, and the like), web analytics tools (Google Analytics), and content from advertising syndicates. While these tools provide useful services to the patrons, libraries and service providers, they also become centralized points of data gathering that can aggregate a user’s activity across the web. Does your library catalog page include a Facebook “Like” button? Whether or not the patron clicks on that button, Facebook knows that user has browsed to that web page and can gleen details of user behavior from that. Does your service use Google Analytics to understand user behavior and demographics? Google Analytics tracks user behavior across an estimated one half of the sites on the internet. Your user’s activity as a patron of your services is commingled with their activity as a general user.
A “filter bubble” is phrase coined by Eli Pariser to describe a system that adapts its output based on what it knows about a user: location, past searches, click activity, and other signals. The system is using these signals to deliver what it deems to be more relevant information to the user. In order to do this, the system must gather, store and analyze this information from patrons. However, a patron may not want his or her past search history to affect their search results. Or, even worse, when activity is aggregated from a shared terminal, the results can be wildly skewed.
Simply using a library-subscribed service can transmit patron activity and intention to dozens of parties, and all of it invisible to the user. To uphold that third principle in the ALA Code of Ethics, librarians need to examine the patron activity capturing practices its information suppliers, and that can be as unwieldy as negotiating bilateral license agreements between each library and supplier. If we start from the premise that libraries, publishers and service providers want to serve the the patron’s information needs while respecting their desire to do so privately, what is needed is a shared understanding of how patron activity is captured, used, and discarded. A new gathering of librarians and providers could accomplish for patron activity what they did for electronic licensing terms a decade ago. One could imagine discussions around these topics:
What Information is Collected From the Patron: When is personally identifiable information captured in the process of using the provider’s service. How is activity tagged to a particular patron — both before and after the patron identifies himself or herself? Are search histories stored? Is the patron activity encrypted — both in transit on the network and at rest on the server?
What Activity That Can Be Gleaned by Other Parties: If a patron follows a link to another website, how much of the context of the patron’s activity is transferred to the new website. Are search terms included in the URL? Is personally identifiable information in the URL? Does the service provider employ social sharing tools or third party web analytics that can gather information about the patron’s activity? Such activity could include IP address (and therefore rough geolocation), content of the web page, cross-site web cookies, and so forth.
How does patron activity influence service delivery: Is relevancy ranking altered based on the past activity of the user? Can the patron modify the search history to remove unwanted entries or segregate research activities from each other?
What is the disposition of patron activity data: Is a patron activity data anonymized and co-mingled with others? How is that information used and to whom is it disclosed? How long does the system keep patron activity data? Under what conditions would a provider release information to third parties?
It is arguably the responsibility of libraries to protect patron activity data from unwarranted collection and distribution. Service providers, too, want clear guidance from libraries so they can efficiently expend their efforts to develop systems that librarians feel comfortable promoting. To have each library and service provider audit this activity for each bilateral relationship would be inefficient and cumbersome. By coming to a shared understanding of how patron activity data is collected, used, and disclosed, libraries and service providers can advance their educational roles and offer tools to patrons to manage the disclosure of their activity.Link to this post!
I’ve been working hard on making a few changes to a couple of the MarcEdit internal components to improve the porting work. To that end, I’ve posted an update that targets improvements to the Deduping and the Merging tools.
- Update: Dedup tool — improves the handling of qualified data in the 020, 022, and 035.
- Update: Merge Records Tool — improves the handling of qualified data in the 020, 022, and 035.
Downloads can be picked up using the automated update tool or by going to: http://marcedit.reeset.net/downloads/
From Claire Knowles, Library Digital Development Manager, University of Edinburgh
Edinburgh, Scotland We are pleased to announce that Repository Fringe returns this year on the 3rd and 4th of August 2015. The event will be held at the University of Edinburgh and coincides once again with preview week to the Edinburgh Festival Fringe.
Some time ago I promised I'd keep this space up to date on how my return to grad school was doing. Turns out I'm pretty lazy with doing that.
While working on the migration mappings for fcrepo3->fcrepo4 properties, I documented all known RELS-EXT and RELS-INT predicates in the Islandora 7.x-1.x code base. The predicates came from two namespaces; fedora and islandora.
The fedora namespace has a published ontology that we use -- relations-external -- and that can be referenced. However, the islandora namespace did not have any published ontologies associated with it.
That said, I have worked over the last couple of weeks with some very helpful folks on drafting initial version of Islandora RELS-EXT and RELS-INT ontologies, and the Islandora Roadmap Committee voted that it should be published. The published version of the RELS-EXT ontology can be viewed here, and the published version of the RELS-INT ontology can be viewed here. In addition, the ontologies were drafted in rdfs, and include a handy rdf2html.xsl to quickly create a publishable html version. This available on GitHub.
What does this all mean?
We have now documented what we have been doing for the last number of years, and we have a referencable version of our ontologies. In addition, this is extremely helpful for referencing and documenting predicates that will be apart of an fcrepo3-fcrepo4 migration.
The initial versions of each ontology have proposed rdfs comments, ranges and and skos *matches for a number of predicates. However, this is by no means complete, and I would love to see some community input/feedback on rdfs comments, ranges, additional skos *matches, or anything else that you think should be included in the RELS-EXT ontology.
How to provide feedback?
I'd like to have everything handled through 'issues' on the GitHub repo. If you comfortable with forking and creating pull requests, by all means do so. If you're more comfortable with replying here, that's works as well. All contributions are welcome! The key thing -- for me at least -- is to have community consensus around our understanding of these documented predicates :-)
I have not licensed the repository yet. I had planned on using the Apache 2.0 License as is done with PCDM, but I'd like your thoughts/opinions on proceeding before I make a LICENSE commit.
I hope I have covered it all. But, if you have have any questions, don't hesitate to ask.
It is almost a sure bet that certain NSA programs will expire at the end of the month. The next Senate vote is set for May 31st. We will be sure to provide updates as we hear them.
It’s that time of the year again. That time when the international open data community descends on an unsuspecting city for a jam packed week of camps, meet-ups, hacks and conference events. Next week, open data enthusiasts will be taking over Ottawa and Open Knowledge will be there in full force! As we don’t want to miss an opportunity to meet with anyone, we have put together a list of events that we will be involved in and ways to get in touch.We have also started collecting this information in a spreadsheet!
The School of Data team is arriving early for the second annual School of Data Summer Camp. Every year we strive to bring the entire School of Data community together for three intense days to plan future activities, to learn from each other, to improve our skills and ways of working and to give new School of Data fellows the opportunity to meet their future collaborators! This year’s School of Data Summer Camp will take place at the HUB Ottawa. We’ll have a meet and greet on one of the evenings for School of Data family and friends – so watch this space for details, or follow @SchoolofData on Twitter.
Wednesday is going to be a busy day as we will be spread out across three events – CKANCon, organised by the CKAN association, the Opening Parliaments Fringe Event and the Open Data Con Research Symposium, where we will be presenting new work on measuring and assessing open data initiatives and on “participatory data infrastructures”.
At the International Open Data Conference, Open Knowledge team members are co-organising or presenting at the following sessions:
- Data & Public Money Thursday May 28th, 11:00 – 12:15
- Data & Fiscal Transparency – Thursday May 28th, 13:30 – 15:30
- Open Data Advocacy Clinic – Friday 29th, 10:30 – 12:30
- Capacity Building for All – Friday 29th, 10:30 – 12:00
- Measuring Open Data Impacts – Friday 29th, 13:30 – 15:00
- Public Interest Innovation with Open Data – Friday 29th, 13:30 – 15:00
- School of Data Day– Friday 29th, featuring a drop in data-clinic session in the morning, where you can come to us with your data projects, proble and questions, and we’ll talk them through with you -then a short data expedition in the afternoon, 13:30 – 15:00
As you can probably see, the week is going to be a busy one and we are aware that it will be difficult to schedule meetings with everyone! To accommodate, the Open Knowledge team and the entire School of Data family are organising informal drinks at The Brig Pub from 7:30 PM Thursday evening! We would love for you to come say hello in person or you can always find Pavel (Open Knowledge’s new CEO!!!!), Zara, Milena, Jonathan, Mor, Sander, Katelyn, School of Data & of course Open Knowledge on twitter!
Safe travels and we will see you in Ottawa!
Via Gary Price’s announcement on InfoDocket comes word of a cost-benefit analysis for the wholesale adoption of ORCID identifiers by eight institutions in the U.K. The report, Institutional ORCID, Implementation and Cost Benefit Analysis Report [PDF], looks at the perspectives of stakeholders, a summary of findings from the pilot institutions, a preliminary cost-benefit analysis, and a 10-page checklist of consideration points for higher education institutions looking to adopt ORCID identifiers in their information systems. The most interesting bits of the executive summary came from the part discussing the findings from the pilot institutions.
Perhaps surprisingly, technical issues were not the major issue for most pilot institutions. A range of technical solutions to the storage of researchers’ ORCID iDs were utilised during the pilots. … Of the eight pilot institutions, only one chose to bulk create ORCID iDs for their researchers, the others opted for the ‘facilitate’ approach to ORCID registration.
Most pilot institutions found it relatively easy to persuade senior management about the institutional benefits of ORCID but many found it difficult to articulate the benefits to individual researchers. Several commented that staff saw it as ‘another level of bureaucracy’ and it was also noted that concurrent Open Access (OA), REF and ORCID activities can make the message confused, as they overlap. … Clear and effective messages (as short and precise as possible), creating a well-defined brand for ORCID and the targeting of specific audiences and audience segments were identified as being especially important.
One thing I found surprising in the report was the lack of the mention of the usefulness of ORCID identifiers in the linked data universe. The word “linked” appeared six times in the report; five of the six mentions talk about connections between campus systems and ORCID. It would seem that some of the biggest benefits of ORCID ids will come when they appear as the object of a subject-predicate-object triple in data published and consumed by various systems on the open web. That is, part of the linked open data.Link to this post!
A book that a few of our colleagues have been working on for quite some time has now been released: Library Linked Data in the Cloud: OCLC’s Experiments with New Models of Resource Description. You can also preview it on Google Books.
OCLC Research has been working with linked data for years, and we have developed processes for mining our MARC record database into linked and linkable entities. This book reports on a lot of that work, the problems we ran into and some of the solutions we created.
The main sections are:
- Library Standards and the Semantic Web
- Modeling Library Authority Files
- Modeling and Discovering Creative Works
- Entity Identification Through Text Mining
- The Library Linked Data Cloud
There are likely few people who have had as much experience parsing library data into linked data triples than the authors of this book and their OCLC Research colleagues. Therefore, anyone seeking to create or use library linked data would do well to study this book. You can take my word for it.About Roy Tennant
Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.Mail | Web | Twitter | Facebook | LinkedIn | Flickr | YouTube | More Posts (89)
[UPDATE: IMLS HAS POSTPONED THIS WEBINAR, AND WILL ANNOUNCE A NEW DATE AND TIME IN THE COMING WEEKS]
Next week, the Institute of Museum and Library Services (IMLS) and U.S. Citizenship and Immigration Services (USCIS) will host a free webinar for public librarians on the topic of immigration and U.S. citizenship. Join in to learn more about what resources are available to assist libraries that provide immigrant and adult education services. The webinar will provide an overview of how libraries can expand these services and even acquire free materials to display.
Date: May 27, 2015
Time: 2:00 – 3:00 p.m. EDT
Click here to register
Prior participation in previous webinars on this topic is not required. Registration is not requried, but the agencies recomment that you check your system for compatibility in advance.
This series was developed as part of a partnership between IMLS and USCIS to ensure that librarians have the necessary tools and knowledge to refer their patrons to accurate and reliable sources of information on immigration-related topics. To find out more about the partnership and the webinar series, visit the Serving New Americans page of the IMLS website or on the USCIS website.
The post IMLS announces new immigration webinar for public libraries appeared first on District Dispatch.