You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 10 hours 2 min ago

Meredith Farkas: You could learn a lot from us: community college librarians at ACRL

Mon, 2015-04-13 17:29

ACRL was ridiculously amazing this year. I feel energized, affirmed, and hopeful (and completely exhausted and sick since it ended). The programming was so high-quality and relevant that, in most cases, I had at least four options in every time slot on my planner that I wanted to attend. Luckily, ACRL records all the sessions and will be putting them online in the next few weeks; there are so many I want to listen to! It’s really nice to go to a conference when you feel like you’re actually in a position to implement some of the things you’ve heard about.

I have such warm feelings for my amazing colleagues at PCC, but I also have such warm feelings for the community college library community. Everyone has been so welcoming and positive about my move. It feels like marrying into a family, but only in the good ways. I feel so very lucky. I think community college librarianship is the best kept secret in our profession and I’m just happy I got the chance to figure it out.

Secret seems to be the problem though. Two years ago, I heard a good deal of complaining on Twitter that there wasn’t much programming for or by community college librarians at ACRL 2013. This was definitely not the case this year, where there were more sessions by community college librarians than any one person could attend. What was interesting this year was that, in many of those sessions, only community college librarians attended. I went to a great session presenting projects from the Assessment in Action program that were done by community colleges, and the session was attended almost entirely by community college librarians. And yet there was so much any academic librarian interested in assessment would have gotten out of their very realistic and honest (warts and all) descriptions of their assessment projects. We had about 80 people at our talk on what it takes to build a culture of assessment (where we compared our results from community colleges to those from the first study), but only a handful of attendees were not community college librarians. I totally understand that these might have seemed to be only relevant to community college librarians, but I also think there is a tendency to believe that community college libraries have more to learn from university libraries than vice versa. Maybe that’s true when it comes to data management, scholarly communications, and home-grown technologies, but the singular focus on student success makes community colleges a fantastic source of learning about instruction, outreach, and assessment.

Portland Community College looks to Portland State University a great deal (especially in the library) for ideas and to create a consistent experience for students moving from one school to the other. But, in my time at Portland State, most of my colleagues were not interested in the community colleges that provided around 2/3 of their student body. I’ll admit I was guilty of it as well until I was contacted by a wonderful librarian at PCC (who is now my colleague) who saw a presentation Amy Hofer and I gave at the 2013 Oregon Library Association/Washington Library Association Conference and wanted to meet. Amy and I started to meet with fantastic librarians from Portland and Mount Hood Community Colleges to talk about our learning outcomes and instructional practice. During the worst summer of my professional life, they were a ray of sunshine and hope. When I first saw the brilliant work my now colleague, Pam Kessinger, did around curriculum mapping, I became ashamed of the fact that I originally thought we at PSU had more to teach than we had to learn from the community colleges. I was so wrong. Interestingly, Amy and I are now both working for community colleges.

I was so happy to see the presentation at ACRL about how Appalachian State and their local community colleges met to collaborate and discuss learning outcomes development. Similar work was done years ago in Oregon and that work blossomed into a group, ILAGO (Information Literacy Articulation Group of Oregon), that connects community college, university, and K-12 librarians around our shared information literacy and advocacy goals.

There is so much Portland State could learn from Portland Community College about how to build a culture of assessment right. Having served on the Assessment Council at PSU, I can say that there was only lip service paid at the administrative level to the importance of assessment and no real support for assessment was provided in the years I was at the University. You can’t even find learning outcomes for courses (if they exist at all), and while the departments were required to draft program-level outcomes a few years ago, they were not published anywhere on the websites of most departments (I had to ask for a copy from Institutional Research). A couple of departments were doing really great assessment work, but they were the exception rather than the norm.

The College is still on the road to building an assessment culture, but they’re doing it in such a smart way. Every department is required to do assessment, but the faculty in each department are empowered to decide what they want to assess and how to approach it. And they are given support, in the form of faculty mentors associated with the Learning Assessment Council. And the people to whom we have to report our assessment progress each year and who give us feedback on it are our peers on the Learning Assessment Council. The faculty are driving the bus. The departments I’ve seen doing assessment seem to be really focused on doing it to improve their programs for their students. It’s very inspiring. And so nice, as a new librarian learning about her liaison areas, to be able to see the course-level outcomes for every course listed prominently on the College website. I’m not saying every community college is doing amazing assessment work, but, according to our research, they seem to be doing more than BA, MA, and PhD-granting institutions.

Take a look at the results Lisa Hinchliffe and I shared at ACRL from our study of the factors that facilitate and hinder librarians in building an assessment culture, and you’ll see that community college librarians are ahead of the game in terms of assessment practices. [Sorry the formatting got a little messed up in slideshare, but the content is all there.]

Community college libraries have longer been scrutinized by outside entities and so have longer had to play the accountability game. Their more singular focus on student success and learning encourages a focus on assessment for and about learning. And I’d argue that their long history of being resource-constrained (by-and-large) has led in many cases to real creativity (I think this is also sometimes helped by having leaner organizations with less bureaucracy). There’s a lot we could learn from the approaches community colleges have taken to engaging in assessment practice.

This is starting to feel like a guilt trip for university librarians, but I think community college librarians share the blame if they’re not sharing the great work they do. When you look at the literature, a lot less publishing is happening in community college libraries. Seeing how much busier I am in my current job with reference, instruction, and library-wide projects than I ever was in previous positions, I totally understand why, but I don’t think we can expect people to be interested in our work if we are not out there sharing it. Lisa and I exhorted our audience to share their assessment work. Whether you publish it in a journal, at a conference, on a listserv, or a blog, what matters most is that you’re sharing it (though I’d love to see more people publishing in places that provides open access to their work). Librarians of every type could benefit so much from knowing the great work that goes on in community college libraries.

I also think it would be helpful to not use the term “community college library” in the title of articles and presentations, unless something is really only relevant to community college libraries. I think it may make people from other types of academic libraries think it isn’t for them. The work my incredible colleague at PCC, Sara Seely, presented in our ACRL presentation on teaching sources and source evaluation for lifelong information literacy was from a community college context, but was totally relevant to any librarian teaching information literacy. I understand the desire to have community college-specific programming, but I think having the speakers be from a community college is good enough and would expose more people to our great work. So much of what we do and struggle with is not unique to community colleges.

Let’s share the great work we do and break down the barriers between community college librarians and academic librarians in other contexts. There is so much we can learn from each other!

Image credit: 99u

Mark E. Phillips: Extended Date Time Format (EDTF) use in the DPLA: Part 1

Mon, 2015-04-13 13:00

I’ve got a new series of posts that I’ve been wanting to do for a while now to try and get a better understanding of the utilization of the Extended Date Time Format (EDTF) in cultural heritage organizations and more specifically if those date formats are making their way into the Digital Public Library of America. Before I get started with the analysis of the over eight million records in the DPLA,  I wanted to give a little background of the EDTF format itself and some of the work that has happened in this area in the past.

A Bitter Harvest

One of the things that I remember about my first year as a professional librarian was starting to read some of the work that Roy Tennant and others were doing at the California Digital Library that was specific to metadata harvesting.  One specific text that I remember specifically was his “Bitter Harvest: Problems & Suggested Solutions for OAI-PMH Data & Service Providers” which talked about many of the issues that they ran into in trying to deal with dates from a variety of service providers.  This same challenge was also echoed by projects like the National Science Digital Library (NSDL), OAIster, and other aggregations of metadata over the years.

One thing that came out of many of these aggregation projects,  and something that many of us are dealing with today is the fact that “dates are hard”.

Extended Date Time Format

A few years ago an initiative was started to address some of the challenges that we have in representing dates for the types of things that we work with in cultural heritage organizations. This initiative was named the Extended Date Time Format (EDTF) which has a site at the Library of Congress and which resulted in a draft specification for a profile or extension to ISO 8601.

An example of what this documented was how to represent some of the following date concepts in a machine readable way.

Commonly Used Dates

 Date Feature Example Item Format Example Date Year Book with publication year YYYY 1902 Month Monthly journal issue YYYY-MM 1893-05 Day Letter YYYY-MM-DD 1924-03-03 Time Born-digital photo YYYY-MM-DDTHH:MM:SS 2003-12-27T11:09:08 Interval Compiled court documents YYYY/YYYY 1887/1889 Season Seasonal magazine issue YYYY-SS 1957-23 Decade WWII poster YYYu 194u Approximate Map “circa 1886” YYYY~ 1886~

Some Complex Dates

Example Item Kind of Date Format Example Date Photo taken at some point during an event August 6-9, 1992 One of a Set [YYYY..YYYY] [1992-08-06..1992-08-09] Hand-carved object, “circa 1870s” Extended Interval (L1) YYYY~/YYYY~ 1870~/1879~ Envelope with a partially-legible postmark Unspecified “u” in place of digit(s) 18uu-08-1u Map possibly created in 1607 or 1630 One of a Set, Uncertain [YYYY, YYYY] [1607?, 1630?]

The UNT Libraries made an effort to adopt the EDTF for its digital collections in 2012 and started the long process of identifying and adjusting dates that did not conform with the EDTF specification (sadly we still aren’t finished).

Hannah Tarver and I authored a paper titled “Lessons Learned in Implementing the Extended Date/Time Format in a Large Digital Library” for the 2013 Dublin Core Metadata Initiative (DCMI) Conference that discussed our findings after analyzing the 460,000 metadata records in the UNT Libraries’ Digital Collections at the time.  As a followup to that presentation Hannah created a wonderful cheatsheet to help people with the EDTF for many of the standard dates we encounter in describing items in our digital libraries.

EDTF use in the DPLA

When the DPLA introduced its Metadata Application Profile (MAP) I noticed that there was mention of the EDTF as one of the ways that dates could be expressed.  In the 3.1 profile it was mentioned in both the dpla:SourceResource.date property syntax schema as well as the edm:TimeSpan class for all of its properties.  In the 4.0 profile it changed up a bit with the removal from the dpla:SourceResource.date property as a syntax schema, and from the edm:TimeSpan “Original Source Date” but kept in the edm:TimeSpan “Being” and “End” property.

Because of this mention, and the knowledge that the Portal to Texas History which is a service-Hub is contributing records in the EDTF,  I had the following questions in mind when I started the analysis presented in this post and a few that will follow.

  • How many date values in the DPLA are valid EDTF values?
  • How are these valid EDTF values distributed across the Hubs?
  • What feature levels (Level 0, Level 1, and Level 2) are used by various Hubs?
  • What are the most common date format patterns used in the DPLA?

With these questions in mind I started the analysis

Preparing the Dataset

I used the same dataset that I had created for some previous work that consisted of a data download from the DPLA of their Bulk Metadata Download in February 2015. This download contained 8,xxx,xxx records that I used for the analysis.

I used the UNT Libraries’ ExtendedDateTimeFormat validation module (available on Github) to classify each date present in each record as either valid EDTF or not valid.  Additionally I tested which level of EDTF each value conformed to.  Finally I identified the pattern of each of the date fields by converting all digits in the string to 0, all alpha characters to x and leaving all non alpha-numeric characters.

This resulted in the following fields being indexed for each date

Field Value date 2014-04-04 date_valid_edtf true date_level0_feature true date_level1_feature false date_level2_feature false date_pattern 0000-00-00

For those interested in trying some EDTF dates you can check out the EDTF Validation Service offered by the UNT Libraries.

After several hours of indexing these values into Solr,  I was able to start answering some of the questions mentioned above.

Date usage in the DPLA

The first thing that I looked at was how many of the records in the DPLA dataset had dates vs the records that were missing dates.  Of the 8,012,390 items in my copy of the DPLA dataset,  6,624,767 (83%) had a date value with 1,387,623 (17%) missing any date information.

I was impressed with the number of records that have dates in the DPLA dataset as a whole and was then curious about how that mapped to the various Hubs.

Hub Name Items Items With Date Items With Date % Items Missing Date Items Missing Date % ARTstor 56,342 49,908 88.6% 6,434 11.4% Biodiversity Heritage Library 138,288 29,000 21.0% 109,288 79.0% David Rumsey 48,132 48,132 100.0% 0 0.0% Digital Commonwealth 124,804 118,672 95.1% 6,132 4.9% Digital Library of Georgia 259,640 236,961 91.3% 22,679 8.7% Harvard Library 10,568 6,957 65.8% 3,611 34.2% HathiTrust 1,915,159 1,881,588 98.2% 33,571 1.8% Internet Archive 208,953 194,454 93.1% 14,499 6.9% J. Paul Getty Trust 92,681 92,494 99.8% 187 0.2% Kentucky Digital Library 127,755 87,061 68.1% 40,694 31.9% Minnesota Digital Library 40,533 39,708 98.0% 825 2.0% Missouri Hub 41,557 34,742 83.6% 6,815 16.4% Mountain West Digital Library 867,538 634,571 73.1% 232,967 26.9% National Archives and Records Administration 700,952 553,348 78.9% 147,604 21.1% North Carolina Digital Heritage Center 260,709 214,134 82.1% 46,575 17.9% Smithsonian Institution 897,196 675,648 75.3% 221,548 24.7% South Carolina Digital Library 76,001 52,328 68.9% 23,673 31.1% The New York Public Library 1,169,576 791,912 67.7% 377,664 32.3% The Portal to Texas History 477,639 424,342 88.8% 53,297 11.2% United States Government Printing Office (GPO) 148,715 148,548 99.9% 167 0.1% University of Illinois at Urbana-Champaign 18,103 14,273 78.8% 3,830 21.2% University of Southern California. Libraries 301,325 269,880 89.6% 31,445 10.4% University of Virginia Library 30,188 26,072 86.4% 4,116 13.6%

Presence of Dates by Hub Name

I was surprised by the high percentage of dates in records for many of the Hubs in the DPLA,  the only Hub that had more records without dates than with dates was the Biodiversity Heritage Library.  There were some Hubs, notably David Rumsey, HathiTrust, J. Paul Getty Trust, and the Government Printing Office that have dates for more then 98% of their items in the DPLA.  This is most likely because of the kinds of data they are providing or the fact that dates are required to identify which items can be shared (HathiTrust)

When you look at Content-Hubs vs Service-Hubs you see the following.

Hub Type Items Items With Date Items With Date % Items Missing Date Items Missing Date % Content-Hub 5,736,178 4,782,214 83.4% 953,964 16.6% Service-Hub 2,276,176 1,842,519 80.9% 433,657 19.1%

It looks like things are pretty evenly matched between the two types of Hubs when it comes to the presence of dates in their records.

Valid EDTF Dates

I took a look at the 6,624,767 items that had date values present in order to see if their dates were valid based on the EDTF specification.  It turns out the 3,382,825 (51%) of these values are valid and 3,241,811 (49%) are not valid EDTF date strings.

EDTF Valid vs Not Valid

So the split is pretty close.

One of the things that should be mentioned is that there are many common date formats that we use throughout our work that are valid EDTF date strings that may not have been created with the idea of supporting EDTF.  For example 1999, and 2000-04-03 are both valid (and highly used date_patterns) that are normal to run across in our collections. Many of the “valid EDTF” dates in the DPLA fall into this category.

In the next posts I want to take a look how EDTF dates are distributed across the different hubs and also to take a look at some of the EDTF features used by Hubs in the DPLA.

As always feel free to contact me via Twitter if you have questions or comments.

ACRL TechConnect: Educating Your Campus about Predatory Publishers

Mon, 2015-04-13 13:00

The recent publication of Monica Berger and Jill Cirasella’s piece in College and Research Libraries News “Beyond Beall’s List: Better understanding predatory publishers” is a reminder that the issue of “predatory publishers” continues to require focus for those working in scholarly communication. Berger and Cisarella have done a exemplary job of laying out some of the issues with Beall’s list, and called on librarians to be able “to describe the beast, its implications, and its limitations—neither understating nor overstating its size and danger.”

At my institution academic deans have identified “predatory” journals as an area of concern, and I am sure similar conversations are happening at other institutions. Here’s how I’ve “described the beast” at my institution, and models for services we all can provide, whether subject librarian or scholarly communication librarian.

What is a Predatory Publisher? And Why Does the Dean Care?

The concept of predatory publishers became much more widely known in 2013 with a publication of an open access sting by John Bohannon in Science, which I covered in this post. As a recap, Bohannon created a fake but initially believable poor quality scientific article, and submitted it to open access journals. He found that the majority of journals accepted the poor quality paper, 45% of which were included in the Directory of Open Access Journals. At the time of publication in October 2013 the response to this article was explosive in the scholarly communications world. It seems that more than a year later the reaction continues to spread. Late in the fall semester of 2014, library administration asked me to prepare a guide about predatory publishers, due to concern among the deans that unscrupulous publishers might be taking advantage of faculty. This was a topic I’d been educating faculty about on an ad hoc basis for years, but I never realized we needed to address it more systematically. That all has changed, with senior library administration now doing regular presentations about predatory publishers to faculty.

If we are to be advocates of open access, we need to focus on the positive impact that open access has rather than dwell for too long on the bad sides of it. We also need faculty to be clear on their own goals for making their work open access so that they may make more informed choices. Librarians have limited faculty bandwidth on the topic, and so focusing on education about self-archiving articles (otherwise known as green open access) or choosing no-fee (also known as gold) open access journals is a better way to achieve advocacy goals than suggesting faculty choose only a certain set of gold open access journals. Unless we are offering money for paying article fees, we also don’t have much say about where faculty choose to publish. Education about how to choose a journal and a license responsibly is what we should focus on, even if it diverges from certain ideals (see Meredith Farkas on choosing creative commons licenses.)

Understanding the Needs and Preparing the Material

As I mentioned, my library administration asked for a guide that that they could use in presentations and share with faculty. In preparing this guide, I worked with our library’s Scholarly Communications committee (of which I am co-chair) to determine the format and content.

We decided that adding this material to our existing Open Access research guide would be the best move, since it was already up and we shared the URL widely already. We have a robust series of Open Access Week events (which I wrote about last fall) and this seemed to ideal place to continue engaging people. That said, we determined that the guide needed an overhaul to make it more clear that open access was an on-going area of concern, not a once a year event. Since faculty are not always immediately thinking of making work open access but of the mechanics of publishing, I preferred to start with the title “Publishing Your Own Work”.

To describe its features a bit more, I wanted to start from the mindset of self-archiving work to make it open access with a description of our repository and Peter Suber’s useful guide to making one’s own work open access. I then continued with an explanation of article publication fees, since I often get questions along those lines. They are not unique to open access journals, and don’t imply any fee to accept for publication, which was a fear that I heard more than once during Open Access Week last year. I only then discussed the concept of predatory journals, with the hope that a basic understanding of the process would allay fears. I then present a list of steps to research a journal. I thought these steps were more common sense than anything, but after conversations with faculty and administration, I realized that my intuition about what type of journal I am dealing with is obvious because I have daily practice and experience. For people new to the topic I tried to break down research into easy steps that help them to figure out where a journal is on the continuum from outright scams to legitimate but new or unusual journals. It was also important to me to emphasize self-archiving as a strategy no matter the journal publication model.

Lastly, while most academic libraries have a model of liaison librarians engaging in scholarly communications activities, the person who spends every day working on these issues is likely to be more versed in emerging trends. So it is important to work with liaisons to help them research journals and to identify quality open access journals in their disciplines. We plan to add this information to the guide in a future version.

Taking it on the Road

We felt that in-person instruction on these matters with faculty was a crucial next step, particularly for people who publish in traditional journals but want to make their work available. Traditional journals’ copyright transfer agreements can be predatory, even if we don’t think about it in those terms. Taking inspiration from the ACRL Scholarly Communications Roadshow I attended a few years ago, I decided to take the curriculum from that program and offer it to faculty and graduate students. We read through three publication agreements as a group, and then discussed how open the publishers were to reuse of material, or whether they mentioned it at all. We then included a section on addenda to contracts for negotiation about additional rights.

The first workshop received modest attendance, but included some thoughtful conversations, and we have promised to run it again. Some people may never have read their agreements closely, and never realized they were doing something illegal or not specifically allowed by, for instance, sharing an article they wrote with their students. That concrete realization is more likely to spur action than more abstract arguments about the benefits of open access.

Escaping the Predator Metaphor

If I could go back, I would get rid of the concept of “predator” attached to open access journals. Let’s call it instead unscrupulous entrants into an emerging business model. That’s not as catchy, but it explains why this has happened. I would argue, personally, that the hybrid gold journals by large publishers are just as predatory, as they capitalize on funding requirements to make articles open access with high fees. They too are trying new business models, and those may not be tenable either. As I said above, choosing a journal with eyes wide open and understanding all the ramifications of different publication models is the only way forward. To suggest that faculty are innocently waiting to be pounced on by predators is to deny their agency and their ability to make choices about their own work. There may be days where that metaphor seems apt, but I think overall this is a damaging mentality to librarians interested in promoting new models of scholarly communication. I hope we can provide better resources and programming to escape this, as well as to help administration to understand how to choose to fund open access initiatives.

In the comments I’d like to hear more suggestions about how to escape the “predator” metaphor, as well as your own techniques for educating faculty on your campus.

Islandora: Fedora 4 Project Update III

Mon, 2015-04-13 12:41

The Fedora 4 upgration is coming into its third month with a big focus on migration. Notes from the last project meeting are available here. Some highlights:

Jared Whiklo, Web Application Developer at University of Manitoba, has joined the project team and is working with Danny and Nick on code tasks. 

Recent work has focused on a couple of areas. The first was collaborating with Mike Durbin (University of Virginia) on fcrepo4-labs/migration-utils, which will support an upgration in this order:

  • traversing the objectStore file system
  • archive export format
  • migrate export format

In order to provide a large set of test fixtures for use with this tool, Nick updated YUDL’s Fedora to 3.8.1-SNAPSHOT.

The second focus on was data modelling. Specifically, mapping fcrepo3 object properties to fcrepo4 container properties, fcrepo3 datastream properties to fcrepo4 file properties, mapping RELS-EXT properties, identifying standard audit trail events, and working towards bringing Islandora into compliance with the Portland Common Data Model. This work was shared with the community via a Large Image Solution Pack example object modelled in Fedora 4.

Related to the migration, work has also been done around Audit Service design in Fedora 4. Nick participated in all of the Audit Service design meetings, and led a discussion around PROV and PREMIS ontology usage in the service. A code sprint led by Esme Coles and devoted to the Audit Service began on March 30th. That work is outlined here.

The migration work will most likely continue through April. If you want to attend future meetings and keep an eye on development, please join the Islandora Fedora 4 Interest Group. Your ideas and use cases are also very welcome as issues is Islandora Labs. For anyone planning to attend the Open Repositories conference in Indianapolis this June, the upgration team will be giving a presentation on the upgration project called Islandora and Fedora 4: The Atonement.

Nicole Engard: Bookmarks for April 12, 2015

Sun, 2015-04-12 20:30

Today I found the following resources and bookmarked them on Delicious.

Digest powered by RSS Digest

The post Bookmarks for April 12, 2015 appeared first on What I Learned Today....

Related posts:

  1. Social Networking on your Desktop
  2. Google Search for Macs
  3. Facebook Chat

Eric Lease Morgan: Marrying close and distant reading: A THATCamp project

Sun, 2015-04-12 16:47

The purpose of this page is to explore and demonstrate some of the possibilities of marrying close and distant reading. By combining both of these processes there is a hope greater comprehension and understanding of a corpus can be gained when compared to using close or distant reading alone. (This text might also be republished at http://dh.crc.nd.edu/sandbox/thatcamp-2015/ as well as http://nd2015.thatcamp.org/2015/04/07/close-and-distant/.)

To give this exploration a go, two texts are being used to form a corpus: 1) Machiavelli’s The Prince and 2) Emerson’s Representative Men. Both texts were printed and bound into a single book (codex). The book is intended to be read in the traditional manner, and the layout includes extra wide margins allowing the reader to liberally write/draw in the margins. As the glue is drying on the book, the plain text versions of the texts were evaluated using a number of rudimentary text mining techniques and with the results made available here. Both the traditional reading as well as the text mining are aimed towards answering a few questions. How do both Machiavelli and Emerson define a “great” man? What characteristics do “great” mean have? What sorts of things have “great” men accomplished?

Comparison Feature The Prince Representative Men Author Niccolò di Bernardo dei Machiavelli (1469 – 1527) Ralph Waldo Emerson (1803 – 1882) Title The Prince Representative Men Date 1532 1850 Fulltext plain text | HTML | PDF | TEI/XML plain text | HTML | PDF | TEI/XML Length 31,179 words 59,600 words Fog score 23.1 14.6 Flesch score 33.5 52.9 Kincaid score 19.7 11.5 Frequencies unigrams, bigrams, trigrams, quadgrams, quintgrams unigrams, bigrams, trigrams, quadgrams, quintgrams Parts-of-speech nouns, pronouns, adjectives, verbs, adverbs nouns, pronouns, adjectives, verbs, adverbs Search

Search for “man or men” in The Prince. Search for “man or men” in Representative Men.

Observations

I observe this project to be a qualified success.

First, I was able to print and bind my book, and while the glue is still trying, I’m confident the final results will be more than usable. The real tests of the bound book are to see if: 1) I actually read it, 2) I annotate it using my personal method, and 3) if I am able to identify answers to my research questions, above.


bookmaking tools
almost done

Second, the text mining services turned out to be more of a compare & contrast methodology as opposed to a question-answering process. For example, I can see that one book was written hundreds of years before the other. The second book is almost twice as long and the first. Readability score-wise, Machiavelli is almost certainly written for the more educated and Emerson is easier to read. The frequencies and parts-of-speech are enumerative, but not necessarily illustrative. There are a number of ways the frequencies and parts-of-speech could be improved. For example, just about everything could be visualized into histograms or word clouds. The verbs ought to lemmatized. The frequencies ought to be depicted as ratios compared to the texts. Other measures could be created as well. For example, my Great Books Coefficient could be employed.

How do Emerson and Machiavelli define a “great” man. Hmmm… Well, I’m not sure. It is relatively easy to get “definitions” of men in both books (The Prince or Representative Men). And network diagrams illustrating what words are used “in the same breath” as the word man in both works are not very dissimilar:


“man” in The Prince
“man” in Representative men

I think I’m going to have to read the books to find the answer. Really.

Code

Bunches o’ code was written to produce the reports:

  • concordance.cgi – the simple search engine
  • fathom.pl – used to compute the readability scores
  • file2pos.py – create a parts-of-speech file for later use
  • network.cgi – used to display words used “in the same breath” a given word
  • ngrams.pl – compute ngrams
  • pos.py – count and tabulate parts-of-speech from a previously created file

You can download this entire project — code and all — from http://dh.crc.nd.edu/sandbox/thatcamp-2015/reports/thatcamp-2015.tar.gz or http://infomotions.com/blog/wp-content/uploads/2015/04/thatcamp-2015.tar.gz.

Patrick Hochstenbach: Cat studies

Sun, 2015-04-12 16:27
Filed under: Doodles Tagged: brushpen, cat, doodle

Terry Reese: Building a better MarcEdit for Mac users

Sun, 2015-04-12 04:40

This all started with a conversation over twitter (https://twitter.com/_whitni/status/583603374320410626) about a week ago.  A discussion about why the current version of MarcEdit is so fragile when being run on a Mac.  The short answer has been that MarcEdit utilizes a cross platform toolset when building the UI which works well on Linux and Windows systems, but tends to be less refined on Mac systems.  I’ve known this for a while, but to really do it right, I’d need to develop a version of MarcEdit that uses native Mac APIs, which would mean building a new version of MarcEdit for the Mac (at least, the UI components).  And I’ve considered it – mapped out a road map – but what’s constantly stopped me has been a lack of interest from the MarcEdit community and a lack of a Mac system.  On the community-side, I can count on two hands the number of times I’ve had someone request a version of MarcEdit  specifically for a Mac.  And since I’ve been making a Mac App version of MarcEdit available – it’s use has been fairly low (though this could be due to the struggles noted above).  With an active community of over 20,000, I try to put my time where it will make the most impact, and up until a week ago, better support for Mac systems didn’t seem to be high on the list.  The second reason is I don’t own a Mac.  My technology stack is made up of about a dozen Windows and Linux systems embedded around my house because they play surprisingly well together, where as, Apple’s walled garden just doesn’t thrive within my ecosystem.  So, I’ve been waiting and hoping that the cross-platform toolset would get better and that in time, this problem would eventually go away.

I’m giving that background because apparently I’ve been misreading the MarcEdit community.  As I said, this all started with this conversation on twitter (https://twitter.com/_whitni/status/583603374320410626) – and out of that, two co-conspirators, Whitni Watkins and Francis Kayiwa set out to see just how much interest there actually was in having dedicated version of MarcEdit for the Mac.  The two set out to see if they could raise funds to acquire a Mac to do this development and indirectly, demonstrate that there was actually a much larger slice of the community interested in seeing this work done.  And, so, off they went – and I set back and watched.  I made a conscious decision that if this was going to happen, it was going to be because the community wanted it and in that, my voice wasn’t necessary.  And after 8 days, it’s done.  In all, 40 individuals contributed to the campaign, but more importantly to me, I heard directly from around 200+ individuals that were hopeful that this project would proceed. 

Development Roadmap

Now the hard work starts.  MarcEdit is a program that has been under constant development since 1999 – so even just rewriting the UI components of the application will be a significant undertaking.  So, I’m breaking up this work in chunks.  I figure it would take approximately 8-12 months to completely port the UI, which is a long-time.  Too long…so I’m breaking the development into 3 month “sprints”.  the first sprint will target the 80%, the functionality that would make MarcEdit productive when doing MARC editing.  This means porting the functionality for all the resources found in the MARC Tools and much of the functionality found in the MarcEditor components.  My guess is these two components are the most important functional areas for catalogers – so finishing those would allow the tool to be immediately useful for doing production cataloging and editing.  After that – I’ll be able to evaluate the remainder of the program and begin working on functional parity between all versions of the application. 

But I’ll admit, at this point, the road map is somewhat even cloudy to me.  See, I’ve written up the following document (http://1drv.ms/1ake4gO) and shared it with Whitni and asked her to work with other Mac users to refine the list and let me know what falls into that 80%.  So, I’ll be interested to see where their list differs from my own.  In the mean time, I’ll be starting work on the port – creating wireframes and spending time over the next week hitting the books and familiarizing myself with Apple’s API docs and the UI best practices (though, I will be trying to keep the program looking very familiar to the current application – best practices be damned).  Coding on the new UI will start in earnest around May 1 – and by August 1, 2015, I hope to have the first version built specifically for a Mac available.  For those interested in following the development process – I’ll be creating a build page on the MarcEdit website (http://marcedit.reeset.net) and will be posting regular builds as new areas of the application are ported so that folks can try them, and give feedback. 

So, that’s where this stands and this point.  For those interested in providing feedback, feel free to contact me directly at reeset@gmail.com.  And for those of you that reached out or participated in the campaign to make this happen, my sincere thanks. 

–TR

Open Library Data Additions: Amazon Crawl: part bu

Sun, 2015-04-12 03:20

Part bu of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

John Miedema: The cognitive computing features of Lila. Candidate technologies, limitations, and future plans.

Sat, 2015-04-11 14:47

Cognitive computing extends the range of knowledge tasks that can be performed by computers and humans. In the previous post I summarized the characteristics of a cognitive system. This post maps the characteristics to Lila features, along with candidate technology to deliver them. Limitations and future plans are also listed.

Cognitive Characteristic Lila Features Candidate Technology Limitations and Future Plans 1. Life-world data Lila operates on unstructured data from multiple sources. Unstructured data includes author notes, digital articles and books. Data is collected from many sources, including smart phone notes, email, web pages, documents, PDFs.

Lila operates on rapidly changing data, as is expected when writing a work. Lila’s functions can be re-calculated on demand.

Data volume is expected to be the size of an average non-fiction work (about 100,000 words), up to 1000 full length articles, and about 100 full length books.

There are existing tools for gathering content from different sources. Evernote, for example, is a candidate technology for a first version of Lila. Lila’s cognitive functions can operate on data exported from Evernote. English only.

Digital text only.

Text must be text analyzable, i.e., no locked formats.

Table content can be analyzed, but no table look-up operations.

Image analysis is limited to associated text labels.

2. Natural questions Lila analyzes author notes, treating them as questions to be asked of other notes and unread articles and books. The following features combine to build meaningful queries on the content.

  • The finite size of the note itself helps capture the author’s meaning.
  • Lila use author suggested categories, tags and markup to understand what the author considers important.
  • Lila develops a model of the author’s work, used to better understand the author’s intent.
New Lila technology will be built. This technology will be used to create more meaningful structured queries.

Structured queries will be performed using existing technology, Apache Solr.

Questions are constructed implicitly from author notes, not from a voice or text question box.

No direct dialog interface is provided, but see 6&7.

3. Reading and understanding Lila uses natural language processing (NLP) to read author notes and unread content.

Language dictionaries provide an understanding of synonyms and parts-of-speech. This knowledge of language is an advance over simple keyword matching.

Entity identification is performed automatically using machine learning. Identification includes person names, organizations and locations. Lila can be extended to include custom entity identification models.

Lila uses additional input from the author to build a model of the author’s work. This model is used to better understand the the author’s meaning when questioning content. See 6&7.

Existing NLP technologies, e.g., OpenNLP.

New Lila technology for the model.

English only.

Lila does not perform deep parsing of syntax.

4. Analytics Lila calculates a correlation between author notes, and between author notes and unread content. Lila also calculates a suggested order for notes. The open source R tool can be used for statistical calculations.

Language resources such as the MRC psycholinguistic database will be used to create new Lila technology for ordering notes.

The calculations for suggesting order are experimental. It is likely that this function will need development over time. 5. Answers are structured and ordered Lila provides two visualizations:

  • A connections view to visualize correlations between notes and unread content.
  • A suggested order for notes, a visual hierarchy or a table of contents.
New Lila technology for the visualizations. Web-based. Lila will use open source add-ins to generate visualizations. 6. Taught rather than just programmed7. Learn from human interaction Lila’s user interface provides the author with a simple and natural way to:

  • Classify content with categories and tags.
  • Inline markup of entities, concepts and relations.

These inputs create the model used to question content and create correlations. The author can manually edit the model with improvements.

The connections view will allow the author to “pin” correct relationships and delete incorrect relationships.

There are existing technologies for classifying content. Evernote, for example, is a candidate technology for a first version of Lila. Lila’s cognitive functions can operate on data exported from Evernote.

New Lila technology for the model.

The Evernote interface for collecting and editing notes has limitations. In the future, Lila will need its own interface to allow for advanced functions, e.g., inline markup, sorting of notes without numbered labels.

In the future, Lila may use author ordering of notes as a suggestion toward its calculated order.

Roy Tennant: Challenging the Open Source Religious Viewpoint

Sat, 2015-04-11 04:45

I’ve been involved with open source software projects since at least the 1990s. I even saved a Unix application from certain death that I still use today. But that doesn’t mean I’m all rosy-eyed about all open source software projects. They are not all created equal.

To be clear, there are “open source” projects that are neither all that open nor all that successful. 

But let me parse my terms before you get all hot and bothered. “Open” can be as little as dropping the code out on a repository somewhere, which is the level at which many projects currently sit. Or, it could mean that the code is actively managed under an open governance model. Most fall somewhere in between, and a number die a quiet death from neglect. I’ve also seen projects that claim the open source label long before releasing any code. And, as we’ve seen with Kuali, there is no guarantee that open source software will remain that way.

Meanwhile, someone like Terry Reese, who has programmed and maintained the amazing MARCEdit application for many years, is criticized for not open sourcing his software. If he refused to also make it better and add capabilities then maybe there would be reason for concern. But it has been tirelessly maintained and improved. Managing an open source community is not easy. I can certainly understand why Terry may want to simplify his job by vastly reducing the number of variables involved.

All things being equal, open source is better than closed source. But things are rarely equal. And it doesn’t follow that software must be open source to be useful and valued. It also doesn’t mean that someone such as Terry may choose to open source the software when he no longer wishes to maintain it. So let’s stop beating up people and projects that wish to control their own code. There should be many options for software development, not just one.

Now go ahead and give me hell, people, because I know you want to.

Picture by J. Albert Bowden II, https://www.flickr.com/photos/jalbertbowdenii/, Creative Commons License CC BY 2.0.

Open Library Data Additions: OL.120301.meta.mrc

Fri, 2015-04-10 23:51

OL.120301.meta.mrc 6003 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.121201.meta.mrc

Fri, 2015-04-10 23:51

OL.121201.meta.mrc 7088 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.121001.meta.mrc

Fri, 2015-04-10 23:51

OL.121001.meta.mrc 5235 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120601.meta.mrc

Fri, 2015-04-10 23:51

OL.120601.meta.mrc 6018 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120501.meta.mrc

Fri, 2015-04-10 23:51

OL.120501.meta.mrc 4685 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120401.meta.mrc

Fri, 2015-04-10 23:51

OL.120401.meta.mrc 3851 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120201.meta.mrc

Fri, 2015-04-10 23:51

OL.120201.meta.mrc 6433 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120101.meta.mrc

Fri, 2015-04-10 23:51

OL.120101.meta.mrc 5284 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.121101.meta.mrc

Fri, 2015-04-10 23:51

OL.121101.meta.mrc 6896 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Pages