Planet Code4Lib - http://planet.code4lib.org
Updated: 2 years 19 weeks ago
I'm teaching a TEI class this weekend, so I've been pondering it a bit. I've come to the conclusion that calling what we do with TEI "text encoding" is misleading. I think what we're really doing is text modeling.
TEI provides an XML vocabulary that lets you produce models of texts that can be used for a variety of purposes. Not a Model of Text, mind you, but models (lowercase) of texts (also lowercase).
TEI has made the (interesting, significant) decision to piggyback its semantics on the structure of XML, which is tree-based. So XML structure implies semantics for a lot of TEI. For example, paragraph text appears inside <p> tags; to mark a personal name, I surround the name with a <persname> tag, and so on. This arrangement is extremely convenient for processing purposes: it is trivial to transform the TEI <p> into an HTML <p>*, for example, or the <persname> into an HTML hyperlink, which points to more information about the person. It means, however, that TEI's modeling capabilities are to a large extent XML's own. This approach has opened TEI up to criticism. Buzetti (2002) has argued that its tree structure simply isn't expressive enough to represent the complexities of text, and Schmidt (2010) criticizes TEI for (among other problems) being a bad model of text, because it imposes editorial interpretation on the text itself.
The main disagreement I have with Schmidt's argument is the assumption that there is a text independent of the editorial apparatus. Maybe there is sometimes, but I can point at many examples where there is no text, as such, only readings. And a reading is, must be, an interpretive exercise. So I'd argue that TEI is at least honest in that it puts the editorial interventions front and center where they are obvious.
As for the argument that TEI's structure is inadequate to model certain aspects of text, I can only agree. But TEI has proved good enough to do a lot of serious scholarly work. That, and the fact that its choice of structure means it can bring powerful XML tools to bear on the problems it confronts, means that TEI represents a "worse is better" solution.† It works a lot of the time, doesn't claim to be perfect, and incrementally improves. Where TEI isn't adequate to model a text in the way you want to use it, then you either shouldn't use it, or should figure out how to extend it.
One should bear in mind that any digital representation of a text is ipso facto a model. It's impossible do anything digital without a model (whether you realize it's there or not). Even if you're just transcribing text from a printed page to a text editor you're making editorial decisions, like what character encoding to use, how to represent typographic features in that encoding, how to represent whitespace, and what to do with things you can't easily type (inline figures or symbols without a Unicode representation, for example).
So why argue that TEI is a language for modeling texts, rather than a language for "encoding" texts? The simple answer is that this is a better way of explaining what people use TEI for. TEI provides a lot of tags to choose from. No-one uses them all. Some are arguably incompatible with one another. We tag the things in a text that we care about and want to use. In other words, we build models of the source text, models that reflect what we think is going on structurally, semantically, or linguistically in the text, and/or models that we hope to exploit in some way.
For example, EpiDoc is designed to produce critical editions of inscribed or handwritten ancient texts. It is concerned with producing an edition (a reading) of the source text that records the editor's observations of and ideas about that text. It does not at this point concern itself with marking personal or geographic names in the text. An EpiDoc document is a particular model of the text that focuses on the editor's reading of that text. As a counterexample, I might want to use TEI to produce a graph of the interactions of characters in Hamlet. If I wanted to do that, I would produce a TEI document that marked people and whom they were addressing when they spoke. This would be a completely different model of the text than a critical edition of Hamlet might be. I could even try to do both at the same time, but that might be a mess—models are easier to deal with when they focus on one thing.
This way of understanding TEI makes clear a problem that arises whenever one tries to merge collections of TEI documents: that of compatibility. Just because two documents are marked up in TEI, that does not mean they are interoperable. This is because each document represents the editor's model of that text. Compatibility is certainly achievable if both documents follow the same set of conventions, but we shouldn't expect it any more than we'd expect to be able to merge any two models that follow different ground rules.
Notes* with the caveat that the semantics of TEI <p> and HTML <p> are different, and there may be problems. TEI's <p> can contain lists, for example, whereas HTML's cannot.
† See http://www.dreamsongs.com/RiseOfWorseIsBetter.html
Yes, I wrote a blog post with endnotes and bibliography. Sue me.
So, a new year is here. Again. I'm getting a bit sick of this straining repetition, but apparently the rest of society thinks it is quite alright. So.
A lot of stuff have happened. We've sold one house, bought and moved into another (and I'm sure I'll write more on this later), and various events have come and gone. I've gotten a new camera for Christmas which I'm excited about (a Panasonic Lumix G2), and I'm reading Bill Bryson's latest "At Home" which is brilliant as usual. Oh, and Mr Mister have released their album "Pull" after 20 years (!!), and it is AWESOME!
I'm writing a book. And I'm enjoying it, when I get the time to do it. I'm some 70 pages in, and it's about ... uh, part technology, part human and cosmological evolution, some laser shooting which defies the laws of physics, project management, opinions on the strong need for secularity, on music, and some more parts technology, programming and development, syntax and language, lots about language, and about libraries and culture, and then some. Yeah, so not your average book, but some people are interested, and I'm taking advice on publishing, format and schedule from anyone.
I'm opening ThinkPlot again, an organisation for people who care about the well-being of the human race and the world we live in in an intelligent fashion, to promote education, science and rationality amongst the people that live near you. Our patron "saint" is the late great Carl Sagan. I'm definitely talk more about this later.
Work is good. It's intranets all the way, interspersed with UCD, IA, UX, hacking, supervision, PMing, and all other goodies, and it's in the health-care system doing important work. So, yeah. Good stuff, and enjoyable. In fact, one of the things I've noticed is that in the few years since my last stints in the Intranet world not much have improved in terms of content and document management. The old systems that sucked have been overtaken by systems that also sucks, just in different ways. Enterprise systems of various kinds follow suit. There's so much bad software out there, even from people who should know better. So, yes, I've decided to make something funky from scratch in the Intranet space, using REST, Topic Maps and simpler development tools readily available. We'll see where it takes us.
Kids and wife doing fine. Kids winning awards, playing violin brilliantly, and growing up fine. (Crossing fingers!) Things are chugging along. Oh, and we've just been introduced to and getting hooked on Carcassonne, so now you know what we often do in the evenings. The beach is down the road next to the shop and cafe, and the pool in the backyard is a favorite past-time, so do come over. Things are good.
PS. Send more salty liquorice.
There is a code4lib IRC channel for folks who are interested in the convergence of computers and library/information science. The channel is a less formal and more interactive alternative to the code4lib mailing list for the discussion of code, projects, ideas, music, first computers, etc., etc..
del.icio.us: The Code4Lib Journal – A Principled Approach to Online Publication Listings and Scientific Resource Sharing
A Principled Approach to Online Publication Listings and Scientific Resource Sharing The Max Planck Institute (MPI) for Psycholinguistics has developed a service to manage and present the scholarly output of their researchers. The PubMan database manages publication metadata and full-texts of publications published by their scholars. All relevant information regarding a researcher’s work is brought together in this database, including supplementary materials and links to the MPI database for primary research data. The PubMan metadata is harvested into the MPI website CMS (Plone). The system developed for the creation of the publication lists, allows the researcher to create a selection of the harvested data in a variety of formats. by Jacquelijn Ringersma, Karin Kastens, Ulla Tschida and Jos van Berkum
I’m trying to really polish off the edges and provide a slick interface in our Blacklight implementation, to contrast with the very hacky legacy OPAC.
Applied to the display of your items out with due dates… In a display of due dates, the user has to do some arithmatic to figure out how far away a given date is. What they really care about is, is this tomorrow? In a week? In a month?
So why not display it to them? There’s a nice Rails helper to calculate human readable time deltas, although I customized it just a bit to handle the fact that some of our due dates have specific times attached (like for reserves; I use a ruby Time object), and others are just dates with no time, before close of business (I use ruby Date object). The built-in method alone works with Dates, but doesn’t always do the most sensible thing with them for this case.# handle both Date's without a time, and Time's with hour/minute/second # appropriately. def relative_due_date(date_or_time) if date_or_time.kind_of?(Time) distance_of_time_in_words( Time.now, date_or_time) elsif date_or_time == Date.today "today" elsif (date_or_time == (Date.today + 1)) "tomorrow" else distance_of_time_in_words( Date.today, date_or_time) end end
There are other corners that it wasn’t really feasible to polish off based on the functionality and business rules of the underlying ILS. Like if the item isn’t renewable, I’d rather now show a renew button at all, instead showing the reason it’s not renewable. But the underlying ILS doesn’t really support that, you have to click ‘renew’ first, and then (maybe, sometimes) get a reason if it wasn’t renewable. So I polished off what I could within reasonable constraints — like at least in this interface if you choose ‘renew all’, it specifically marks which items were renewed and which weren’t, instead of the legacy OPAC that just let you look at the new due dates without telling you if they were actually new or the same old ones due to not being renewable. Oh well.
Filed under: General
There are some good spam solutions going these days, e.g., filtering, but spam is a complex problem and here’s one more simple idea that might help. Blog software can require a person to be approved before leaving a comment. Why not use the same approach with email?
It seems the COinS Generator is not working, at least when given a DOI it returns the target not a Content Object in Spans. Is there no alternative tool? I couldn't find one. If that is the case, is it because COinS is pretty useless and no one bothered to create another generator? If COinS are useful, maybe another instance of the generator or a different but similar service would be a good thing.
In principle, I like microformats. Anything that supplies more semantics to information is something I tend to support. COinS seem like a very useful microformat, nothing in RDFa, that I know of, is a decent substitute. What's happening here?
Have you seen the cover of the new issue of American Libraries?
Just curious any news of color e-ink readers from either ALA or CES?
Tematres 1.2 has been released.TemaTres is an open source vocabulary
server, web application to manage and exploit vocabularies, thesauri, taxonomies and formal representations of knowledge.... Export in any format: Skos-Core, Zthes, TopicMap, Dublin Core, MADS, BS8723-5, RSS, SiteMap, txt, SQL.
Light pollution is a major issue which concerns not only astronomers and stargazers, but has serious impacts on the environment and human health. The BMP project is an initiative founded in early 2008, by Francesco Giubbilini Francesco and Andrea Giacomelli, two environmental engineers with Tuscan roots, aimed at collecting data on light pollution by non-professionals as a form of environmental awareness raising.
Ordinary citizens, families, and schools can participate. The project also has a scientific aspect, as it allows the collection of valuable data, using a tool, called Sky Quality Meter, which has been on the market for a couple of years now. Measurements can be collected with a user’s instrument, or borrowing one from the BMP group, and subsequently uploaded to the BMP web site.
In addition to collecting new measurements, the BMP team takes care of:
The data uploaded on the BMP web site can then be viewed and downloaded freely (data are available under the Open DataBase Licence). Other contents produced by the BMP team are released under a CC BY-NC-SA license. Furthermore, free and open source geospatial technologies are used for the database and the web mapping engine.
The project has generated considerable interest at national level, among other things, winning an award for innovation in early 2009 and receiving a diverse media coverage. The BMP project is interested in:
No related posts.
Or we could save our energy and find untapped sources of content created by our local users and work together to create a single publishing platform and rights-management tool to allow easy creation and access to local content.
That’s the excellent ending of Kathryn Greenhill’s answer to her own question: How do we force publishers to give us ebook content that includes works that our users want and that they find easy to download to their chosen device?
This is such a compelling vision of a way forward for libraries. Not only is it more attainable than forcing publishers to do anything (or even compelling them) but it would result in a much more meaningful public library.
I’m looking forward to the rest of the posts in her series!
GetSatisfaction‘s “How does this make you feel?” intrigues me: why do people answer this? Conventional wisdom says that people don’t classify their posts.
Network diagrams are great ways to illustrate relationships. In such diagrams nodes represent some sort of entity, and lines connecting nodes represent some sort of relationship. Nodes clustered together and sharing many lines denote some kind of similarity. Conversely, nodes whose lines are long and not interconnected represent entities outside the norm or at a distance. Network diagrams are a way of visualizing complex relationships.
Are you familiar with the phrase “in the same breath”? It is usually used to denote the relationship between one or more ideas. “He mentioned both ‘love’ and ‘war’ in the same breath.” This is exactly one of the things I want to do with texts. Concordances provide this sort of functionality. Given a word or phrase, a concordance will find the query in a corpus and display the words on either side of it. A KWIK (key word in context) index, concordances make it easier to read how words or phrases are used in relationship with their surrounding words. The use of network diagrams seem like good idea to see — visualize — how words or phrases are used within the context of surrounding words.
The implementation of the visualization requires the recursive creation of a term matrix. Given a word (or regular expression), find the query in a text (or corpus). Identify and count the d most frequently used words within b number of characters. Repeat this process d times with each co-occurrence. For example, suppose the text is Walden by Henry David Thoreau, the query is “spring”, d is 5, and b is 50. The implementation finds all the occurrences of the word “spring”, gets the text 50 characters on either side of it, finds the 5 most commonly used words in those characters, and repeats the process for each of those words. The result is the following matrix:spring day morning first winter day days night every today morning spring say day early first spring last yet though winter summer pond like snow
Thus, the most common co-occurrences for the word “spring” are “day”, “morning”, “first”, and “winter”. Each of these co-occurrences are recursively used to find more co-occurrences. In this example, the word “spring” co-occurs with times of day and seasons. These words then co-occur with more times of day and more seasons. Similarities and patterns being to emerge. Depending on the complexity of a writer’s sentence structure, the value of b (“breath”) may need to be increased or decreased. As the value of d (“detail”) is increased or decreased so does the number of co-occurrences to return.
“spring” in Walden
It is interesting enough to see the co-occurrences of any given word in a text, but it is even more interesting to compare the co-occurrences between texts. Below are a number of visualizations from Thoreau’s Walden. Notice how the word “walden” frequently co-occurs with the words “pond”, “water”, and “woods”. This makes a lot of sense because Walden Pond is a pond located in the woods. Notice how the word “fish” is associated with “pond”, “fish”, and “fishing”. Pretty smart, huh?
“walden” in Walden
“fish” in Walden
“woodchuck” in Walden
“woods” in Walden
Compare these same words with the co-occurrences in a different work by Thoreau, A Week on the Concord and Merrimack Rivers. Given the same inputs the outputs are significantly different. For example, notice the difference in co-occurrences given the word “woodchuck”.
“walden” in Rivers
“fish” in Rivers
“woodchuck” in Rivers
“woods” in Rivers Give it a try
Give it a try for yourself. I have written three CGI scripts implementing the things outlined above:
In each implementation you are given the opportunity to input your own queries, define the “size of the breath”, and the “level of detail”. The result is an interactive network diagram visualizing the most frequent co-occurrences of a given term.
The root of the Perl source code is located at http://infomotions.com/sandbox/network-diagrams/.Implications for librarianship
The visualization of co-occurrences obviously has implications for text mining and the digital humanities, but it also has implications for the field of librarianship.
Given the current environment where data and information abound in digital form, libraries have found themselves in an increasingly competitive environment. What are libraries to do? Lest they become marginalized, librarians can not rest on their “public good” laurels. Merely providing access to information is not good enough. Everybody feels as if they have plenty of access to information. What is needed are methods and tools for making better use of the data and information they acquire. Implementing text mining and visualization interfaces are one way to accomplish that goal within context of online library services. Do a search in the “online catalog”. Create a subset of interesting content. Click a button to read the content from a distance. Provide ways to analyze and summarize the content thus saving the time of the reader.
Us librarians have to do something differently. Think like an entrepreneur. Take account of your resources. Examine the environment. Innovate and repeat.
Digest powered by RSS Digest
A last-minute change to my plans for ALA Midwinter came on Tuesday when I was sought out to fill in for a speaker than canceled at the ALCTS Digital Preservation Interest Group meeting. Options for outsourcing storage and services for preserving digital content has been a recent interest, so I volunteered to combine two earlier DLTJ blog posts with some new information and present it to the group for feedback. The reaction was great, and here is the promised slide deck, links to further information, and some thoughts from the audience response.
Slide Deck and References
In the presentation there is a Table About Costs that uses a scenario from an earlier DLTJ blog post. The text of the scenario is:
To examine the similarities and differences in costs, let’s use the OhioLINK Satellite Image collection as a prototypical example. It consists of about 2 terabytes (2TB) of high-quality images in TIFF format, with about 7.5GB of data going into the repository each month. In the interest of exploring everything that S3 can do, there is an assumption that approximately 4GB of data will be transferred out of the archive each month; OCLC’s Digital Archive does not have a end-user dissemination component.
The point of showing this scenario is to show the widest range of costs — from a storage-only solution like Amazon S3 to a soup-to-nuts service like OCLC Digital Archive. A word about the redacted costs. Some of the numbers for OCLC’s Digital Archive response (from 2008) came from a confidential quote, so the numbers were removed from the public table. For the numbers that are publicly listed, the values come from Barbara Quint’s article.
The articles and blog posts I referenced in the course of the presentation were:
Iglesias, Edward and Wittawat Meesangnil (2010). Using Amazon S3 in Digital Preservation in a mid sized academic library: A case study of CCSU ERIS digital archive system. The Code4Lib Journal, issue 12, retrieved 5-Jan-2011 from http://journal.code4lib.org/articles/4468
Murray, Peter (2008). Long-term Preservation Storage: OCLC Digital Archive versus Amazon S3. Disruptive Library Technology Jester. Retrieved 5-Jan-2011 from http://dltj.org/article/oclc-digital-archive-vs-amazon-s3/
Murray, Peter (2009). Can We Outsource the Preservation of Digital Bits?. Disruptive Library Technology Jester. Retrieved 5-Jan-2011 from http://dltj.org/article/outsource-digital-bits/
Quint, Barbara (2008). OCLC Introduces High-Priced Digital Archive Service. Information Today. Retrieved 5-Jan-2011 from http://newsbreaks.infotoday.com/nbReader.asp?ArticleId=49018
At the Friday ‘Big Heads’ meeting much of the conversation revolved around Incrementalism vs. Revolution, as have so many conversations, about so many things. Someone quoted David Mamet (I can’t find the quote) that what we need is sledge hammers, not chisels, and I thought it was a notion too good to pass up as a jumping off point to discuss that meeting.
There were a lot of interesting topics discussed at the meetings, but as is my habit I’m going to focus only on the topics of interest to me. As usual there were a number of vendors in the audience, and when a few of the ‘heads’ at the main table voiced the expectation that they would be depending on the vendor community for help as they experienced additional staff reductions and resource constraints in general, the vendors came up to the microphones to respond. A couple of vendors expressed their concern that the library community in general has not been able to articulate what they want from vendors, and this has made it difficult for them to develop business plans. I hear a variation of this line when I stroll the exhibit halls and talk to vendors about what their plans are for RDA implementation. Almost always I hear that they have not heard from their customers about what they want, and they’re waiting for that before making plans. As a result, when I’m presenting to librarians about RDA, I tell them that they should be talking to their vendors, asking when and how they will be implementing, etc., etc. The problem with that approach is that a) most of the time the librarians don’t know what to ask, beyond the when and how; and b) when they get an answer they often don’t know how to interpret it. Maybe I’m slow, but I’m coming to the conclusion that I should stop telling people to talk to their vendors about RDA. I’m not sure it matters.
I went up to the microphone for one of my usual rants, after hearing quite enough of this dancing around. Here’s the reality, as I see it:
1. Libraries are unlikely to agree on what they want (this has been true in the past, and will likely be true in the future)
My rant included all three of those points and more. Little over a year ago, the R2 report on the marketplace for MARC records (upon which I blogged) assumed that there is a marketplace for MARC records which will continue and that a direct return on investment is possible (or desirable) for creators of data. I said then, and still believe, that such a viewpoint is both unrealistic and in fact destructive to the task of moving forward into a world where data is not the coin of the realm but freely available (this is the basis for linked open data) and the investment and return on investment is around data services, not data sale. After my rant to Big Heads, one of the vendors came up to talk to me and offered up some useful nuggets to support my view: a) they provide records, but don’t make much money on them; b) the realm of digital metadata is vastly more complicated than that for physical metadata. It’s a huge challenge for vendors to operate in this world, but clearly the usual answers are no longer working, even as the data revolution is not yet fully upon us. The inevitable conclusion is that vendors who wait for their customers to tell them what they want may not survive the coming revolution. This is no time for chisels.
In this context it’s good to meditate on Henry Ford’s famous statement: “If I had asked people what they wanted, they would have said faster horses.”
Active forum topics
There are currently 0 users and 11 guests online.