Over the years, I’ve periodically gotten requests for a much more robust logger in MarcEdit. Currently, when the tool performs a global change, it reports the number of changes made to the user. However, a handful of folks have been wanting much more. Ideally, they’d like to have a log of every change the application makes, which is hard because the program isn’t built that way. I provided the following explanation to the MarcEdit list last week.
The question that has come up a number of times since posting notes about the logger is questions about granularity. There has been a desire to have the tool provide additional information (about the records), more information around change context, and also wondering if this will lead to a preview mode. I think other folks wondered why this process has taken so long to develop. Well, it stems from decisions I make around the development. MarcEdit’s application structure can be summed up by the picture below:
In developing MarcEdit, I have made a number of very deliberate decisions, and one of those is that no one component knows what the other one does. As you can see in this picture, the application parts of MarcEdit don’t actually talk directly to the system components. They are referenced through a messenger, which handles all interactions between the application and the system objects. However, the same is true of communication between the system objects themselves. The editing library, for example, knows nothing about MARC, validation, etc. – it only knows how to parse MarcEdit’s internal file format. Likewise, the MARC library doesn’t know anything about validation, MARC21, or linked data. Those parts live elsewhere. The benefit of this approach is that I can develop each component independent of the other, and avoid breaking changes because all communication runs through the messenger. This gives me a lot of flexibility and helps to enforce MarcEdit’s agnostic view of library data. It’s also how I’ve been able to start including support for linked data components – as far as the tool is concerned, it’s just another format to be messaged.
Of course, the challenge with an approach like this then, is that most of MarcEdit’s functions don’t have a concept of a record. Most functions, for reasons of performance, process data much like an XML sax processor. Fields for edit raise events to denote areas of processing, as do errors, which then push the application into a rescue mode. While this approach allows the tool to process data very quickly, and essentially remove size restrictions for data processing – it introduces issues if, for example, I want to expose a log of the underlying changes. Logs exist – I use them in my debugging, but they exist on a component level, and they are not attached to any particular process. I use messaging identifiers to determine what data I want to evaluate – but these logs are not meant to record a processing history, but rather, record component actions. They can be muddled, but they give me exactly what I need when problems arise. The challenge with developing logging for actual users, is that they would likely want actions associated with records. So, to do that, I’ve added an event handler in the messaging layer. This handles all interaction with the logging subsystem and essentially tracks the internal messaging identifier and assembles data. This means that the logger still doesn’t have a good concept of what a record is, but the messenger does, and can act as a translator.
Anyway – this is how I’ll be providing logging. It will also let me slowly expand the logging beyond the core editing functions if there is interest. It is also how I’ll be able to build services around the log file – to provide parsing and log enhancement, for users that want to add record specific information to a log file, that goes beyond the simple record number identifier that will be used to track changes. This would make log files more permanent (if for example the log was enhanced with a local identifier), but due to the way MarcEdit is designed, and the general lack of a standard control number across all MARC formats (in merging for example, merging on the 001 trips checks of 9 other fields that all could store associated control data), it is my belief that providing ways to enhance the log file after run, while an extra step, will allow me the most flexibility to potentially make greater user of the processing log in the future. It also enables me to continue to keep MARCisms out of the processing library – and focus only on handling data edits.
So that’s pretty much the work in a nut shell. So what do you get. Well, once you turn it on, you get lots of stuff and a few new tools. So, let’s walk through them.
Turning on Logging:
Since Logging only captures changes made within the MarcEditor, you find the logging settings in the MarcEditor Preferences Tab:
Once enabled, the tool will generate a new session in the Log folder each time the Editor starts a new Session. With the logs, come log management. From within the MarcEditor or the Main window, you find the following:
From the MarcEditor, you’ll find in Reports:
Functionally, both areas provide the same functionality, but the MarcEditor reports entry is scoped to the current session logfile and current record file loaded into the Editor (if one is loaded). To manage old sessions, use the entry on the Main Window.
Advanced Log Management
To of the use cases that were laid out for me were the need to be able to enhance logs and the ability to extract only the modified records from a large file. So, I’ve included an Advanced Management tool for just these kinds of queries:
This is an example run from within the MarcEditor.
Anyway – this is a quick write-up. I’ll be recording a couple sessions tomorrow. I’ll also be working to make a new plugin available.
I’ve posted a new update for all versions of MarcEdit, and it’s a large one. It might not look like it from the outside, but it represents close to 3 1/2 months of work. The big change is related to the inclusion of a more detailed change log. Users can turn on logging and see, at a low level, the actual changes made to specific data elements. I’ve also added some additional logging enhancement features to allow users to extract just changed records, or enhance the log files with additional data. For more information, see my next post on the new logging process.
The full change log:
* Enhancement: Z39.50: Sync’ng changes made to support Z39.50 servers that are sending records missing proper encoding guidelines. I’m seeing a lot of these from Voyager…I fixed this in one context in the last update. This should correct it everywhere.
* Enhancement: MARCEngine: 008 processing was included when processing MARCXML records in MARC21 to update the 008, particularly the element to note when a record has been truncated. This is causing problems when working with holdings records in MARCXML – so I’ve added code to further distinguish when this byte change is needed.
* Enhancement: MarcEdit About Page: Added copy markers to simplify capturing of the Build and Version numbers.
* Enhancement: Build New Field: The tool will only create one field per record (regardless of existing field numbers) unless the replace existing value is selected.
* Enhancement: Swap Field Function: new option to limit swap operations if not all defined subfields are present.
* Bug Fix: MARCValidator: Potential duplicates were being flagged when records had blank fields (or empty fields) in the elements being checked.
* Update: MarcEditor: UI responsiveness updates
* New Feature: Logging. Logging has been added to the MarcEditor and supports all global functions currently available via the task manager.
* New Feature: MarcEditor – Log Manager: View and delete log files.
* New Feature: MarcEditor – Log Manager: Advanced Toolset. Ability to enhance logs (add additional marc data) or use the logs to extract just changed records.
You can download the update directly from the website at: http://marcedit.reeset.net/downloads or you can use the automated downloader in the program.
One last note, on the downloads page, I’ve added a directly listing that will provide access to the most previous 6.2 builds. I’m doing this partly because some of these changes are so significant, that there may be behavior changes that crop up. If something comes up that is preventing your work – uninstall the application and pull the previous version from the archive and then let me know what isn’t working.
This posting describes a hack of mine, tei2json.pl – a Perl program to summarize the structure of Early English poetry and prose. 
In collaboration with Northwestern University and Washington University, the University of Notre Dame is working on a project whose primary purpose is to correct (“annotate”) the Early English corpus created by the Text Creation Partnership (TCP). My role in the project is to do interesting things with the corpus once it has been corrected. One of those things is the creation of metdata files denoting the structure of each item in the corpus.
Some of my work is really an effort to reverse engineer good work done by the late Sebastian Rahtz. For example, Mr. Rahtz cached a version of the TCP corpus, transformed each item into a number of different formats, and put the whole thing on GitHub.  As a part of this project, he created metadata files enumerating what TEI elements were in each file and what attributes were associated with each element. The result was an HTML display allowing the reader to quickly see how many bibliographies an item may have, what languages may be present, how long the document was measured in page breaks, etc. One of my goals is/was to do something very similar.
The workings of the script are really very simple: 1) configure and denote what elements to count & tabulate, 2) loop through each configuration, 3) keep a running total of the result, 4) convert the result to JSON (a specific data format), and 5) save the result to a file. Here are (temporary) links to a few examples:
JSON files are not really very useful in & of themselves; JSON files are designed to be transport mechanisms allowing other applications to read and process them. This is exactly what I did. In fact, I created two different applications: 1) json2table.pl and 2) json2tsv.pl. [2, 3] The former script takes a JSON file and creates a HTML file whose appearance is very similar to Rahtz’s. Using the JSON files (above) the following HTML files have been created through the use of json2table.pl:
The second script (json2tsv.pl) allows the reader to compare & contrast structural elements between items. Json2tsv.pl reads many JSON files and outputs a matrix of values. This matrix is a delimited file suitable for analysis in spreadsheets, database applications, statistical analysis tools (such as R or SPSS), or programming languages libraries (such as Python’s numpy or Perl’s PDL). In its present configuration, the json2tsv.pl outputs a matrix looking like this:id bibl figure l lg note p q A00002 3 4 4118 490 8 18 3 A00011 3 0 2 0 47 68 6 A00089 0 0 0 0 0 65 0 A00214 0 0 0 0 151 131 0 A00289 0 0 0 0 41 286 0 A00293 0 1 189 38 0 2 0 A00395 2 0 0 0 0 160 2 A00749 0 4 120 18 0 0 2 A00926 0 0 124 12 0 31 7 A00959 0 0 2633 9 0 4 0 A00966 0 0 2656 0 0 17 0 A00967 0 0 2450 0 0 3 0
Given such a file, the reader could then ask & answer questions such as:
- Which item has the greatest number of figures?
- What is average number of lines per line group?
- Is there a statistical correlation between paragraphs and quotes?
Additional examples of input & output files are temporarily available online. 
My next steps include at least a couple of things. One, I need/want to evaluate whether or not save my counts & tabulations in a database before (or after) creating the JSON files. The data may be prove to be useful there. Two, as a librarian, I want to go beyond qualitative description of narrative texts, and the counting & tabulating of structural elements moves in that direction, but it does not really address the “aboutness”, “meaning”, nor “allusions” found in a corpus. Sure, librarians have applied controlled vocabularies and bits of genre to metadata descriptions, but such things are not quantitive and consequently allude statistical analysis. For example, using sentiment analysis one could measure and calculate the “lovingness”, “war mongering”, “artisticness”, or “philosophic nature” of the texts. One could count & tabulate the number of times family-related terms are used, assign the result a score, and record the score. One could then amass all documents and sort them by how much they discussed family, love, philosophy, etc. Such is on my mind, and more than half-way baked. Wish me luck.Links
-  tei2json.pl – http://dh.crc.nd.edu/tmp/early-print/tei2json.pl
-  An example of this good work is found at https://github.com/textcreationpartnership/A00002
-  json2table.pl – http://dh.crc.nd.edu/tmp/early-print/json2table.pl
-  json2tsv.pl – http://dh.crc.nd.edu/tmp/early-print/json2tsv.pl
-  more examples – http://dh.crc.nd.edu/tmp/early-print/
This posting describes a little hack of mine, Synonymizer — a Python-based CGI script to create a synonym files suitable for use with Solr and other applications. 
Human language is ambiguous, and computers are rather stupid. Consequently computers often need to be explicitly told what to do (and how to do it). Solr is a good example. I might tell Solr to find all documents about dogs, and it will dutifully go off and look for things containing d-o-g-s. Solr might think it is smart by looking for d-o-g as well, but such is a heuristic, not necessarily a real understanding of the problem at hand. I might say, “Find all documents about dogs”, but I might really mean, “What is a dog, and can you give me some examples?” In which case, it might be better for Solr to search for documents containing d-o-g, h-o-u-n-d, w-o-l-f, c-a-n-i-n-e, etc.
This is where Solr synonym files come in handy. There are one or two flavors of Solr synonym files, and the one created by my Synonymizer is a simple line-delimited list of concepts, and each line is a comma-separated list of words or phrases. For example, the following is a simple Solr synonym file denoting four concepts (beauty, honor, love, and truth):beauty, appearance, attractiveness, beaut honor, abide by, accept, celebrate, celebrity love, adoration, adore, agape, agape love, amorousness truth, accuracy, actuality, exactitude
Creating a Solr synonym file is not really difficult, but it can be tedious, and the human brain is not always very good at multiplying ideas. This is where computers come in. Computers do tedium very well. And with the help of a thesaurus (like WordNet), multiplying ideas is easier.
Here is how Synonymizer works. First it reads a configured database of previously generated synonyms.† In the beginning, this file is empty but must be readable and writable by the HTTP server. Second, Synonymizer reads the database and offers the reader to: 1) create a new set of synonyms, 2) edit an existing synonym, or 3) generate a synonym file. If Option #1 is chosen, then input is garnered, and looked up in WordNet. The script will then enable the reader to disambiguate the input through the selection of apropos definitions. Upon selection, both WordNet hyponyms and hypernyms will be returned. The reader then has the opportunity to select desired words/phrase as well as enter any of their own design. The result is saved to the database. The process is similar if the reader chooses Option #2. If Option #3 is chosen, then the database is read, reformatted, and output to the screen as a stream of text to be used on Solr or something else that may require similar functionality. Because Option #3 is generated with a single URL, it is possible to programmatically incorporate the synonyms into your Solr indexing process pipeline.
The Synonymizer is not perfect.‡ For example, it only creates one of the two different types of Solr synonym files. Second, while Solr can use the generated synonym file, search results implement phrase searches poorly, and this is well-know issue.  Third, editing existing synonyms does not really take advantage of previously selected items; data-entry is tedious but not as tedious as writing the synonym file by hand. Forth, the script is not fast, and I blame this on Python and WordNet.
Below are a couple of screenshots from the application. Use and enjoy.
 synonymizer.py – http://dh.crc.nd.edu/sandbox/synonymizer/synonymizer.py
 “Why is Multi-term synonym mapping so hard in Solr?” – http://bit.ly/2iyYZw6
† The “database” is really simple delimited text file. No database management system required.
‡ Software is never done. If it were, then it would be called “hardware”.
It’s no secret that the research ecosystem has been experiencing rapid change in recent years, driven by complex political, technological, and network influences. One component of this complicated environment is the adoption of research information management (RIM) practices by research institutions, and particularly the increasing involvement of libraries in this development.
Research information management is the aggregation, curation, and utilization of information about research. Research universities, research funders, as well as individual researchers are increasingly looking for aggregated, interconnected research information to better understand the relationships, outputs, and impact of research efforts as well as to increase research visibility.
Efforts to collect and manage research information are not new but have traditionally emphasized the oversight and administration of federal grants. Professional research administrative oversight within universities emerged in the 20th century, rapidly accelerating in the United States following Sputnik and exemplified through the establishment of professional organizations concerned primarily with grants administration & compliance, such as the Society for Research Administrators (SRA) in 1957 and the National Council of University Research Administrators (NCURA) in 1959.
Today research information management efforts seek to aggregate and connect a growing diversity of research outputs that encompass more than grants administration, and significantly for libraries, includes the collection of publications information. In addition, both universities and funding agencies have an interest in reliably connecting grants with resulting publications–as well as to researchers and their institutional affiliations.
Not long ago the process for collecting the scholarly publications produced by a campus’s researchers would have been a manual process, possible only through the collection of each scholar’s curriculum vitae. The resources required to collect this information at institutional scale would have been prohibitively expensive, and few institutions made such an effort. Institutions have instead relied upon proxies of research productivity–such as numbers of PhDs awarded or total dollars received in federal research grants–to demonstrate their research strengths. However, recent advances in scholarly communications technology and networked information offer new opportunities for institutions to collect the scholarly outputs of its researchers. Indexes of journal publications like Scopus, PubMed, and Web of Science provide new sources for the discovery and collection of research outputs, particularly for scientific disciplines, and a variety of open source, commercial, and locally-developed platforms now support institutional aggregation of publications metadata. The adoption of globally accepted persistent identifiers (PIDs) like DOIs for digital publications and datasets and ORCID and ISNI identifiers for researchers provide essential resources for reliably disambiguating unique objects and people, and the incorporation of these identifiers into scholarly communications workflows provide growing opportunities for improved metadata quality and interoperability.
Institutions may now aggregate research information from numerous internal and external sources, including information such as:
• Individual researchers and their institutional affiliations
• Publications metadata
• Awards & honors received by a researcher
• Citation counts and other measures of research impact
Depending upon institutional needs, the RIM system may also capture additional internal information about faculty, such as:
• Courses taught
• Students advised
• Committee service
National programs to collect and measure the impact of sponsored research has accelerated the adoption of research information management in some parts of the world, such as through the Research Excellence Framework (REF) in the UK and the Excellence for Research in Australia (ERA) in Australia. The effort to collect, quantify, and report on a broad diversity of research outputs has been happening for some time in Europe, where RIM systems are more commonly known as Current Research Information Systems (CRIS), and where efforts like CERIF (the Common European Research Information Format) provide a standard data model for describing and exchanging research entities across institutions.
Here in the US, research information management is emerging as a part of scholarly communications practice in many university libraries, in close collaboration with other campus stakeholders. In the absence of national assessment exercises like REF or ERA, RIM practices are following a different evolution, one with greater emphasis on reputation management for the institution, frequently through the establishment public research expertise profile portals such as those in place at Texas A&M University and the University of Illinois. Libraries such as Duke University are using RIM systems to support open access efforts, and others are implementing systems that convert a decentralized and antiquated paper-based system of faculty activity reporting and annual review into a centralized process with a single cloud-based platform, as we are seeing at University of Arizona and Virginia Tech.
I believe that support for research information management will continue to grow as a new service category for libraries, as Lorcan Dempsey articulated in 2014. Through the OCLC Research Library Partnership and in collaboration with partners from EuroCRIS, I am working with a team of enthusiastic librarians and practitioners from three continents to explore, research, and report on the rapidly evolving RIM landscape, building on previous RLP outputs exploring the library’s contribution to university ranking and researcher reputation.
One working group is dedicated to conducting a survey of research institutions to gauge RIM activity:
• Pablo de Castro, EuroCRIS
• Anna Clements, University of St. Andrews
• Constance Malpas, OCLC Research
• Michele Mennielli, EuroCRIS
• Rachael Samberg, University of California-Berkeley
• Julie Speer, Virginia Tech University
And a second working group is engaged with qualitative inquiry into institutional requirements and activities for RIM adoption:
• Anna Clements, University of St. Andrews
• Carol Feltes, Rockefeller University
• David Groenewegen, Monash University
• Simon Huggard, La Trobe University
• Holly Mercer, University of Tennessee-Knoxville
• Roxanne Missingham, Australian National University
• Malaica Oxnam, University of Arizona
• Annie Rauh, Syracuse University
• John Wright, University of Calgary
Our research efforts are just beginning, and I look forward to sharing more about our findings in the future.About Rebecca Bryant
Rebecca Bryant is Senior Program Officer at OCLC where she leads research and initiatives related to research information management in research universities.Mail | Web | Twitter | LinkedIn | More Posts (1)
“Telling Fedora 4 Stories” is an initiative aimed at introducing project leaders and their ideas to one another while providing details about Fedora 4 implementations for the community and beyond.