A few months ago I was following a conversation on Twitter that for got me thinking about how much bit-for-bit duplication there was in our preservation repository and how much space that duplication amounted to.
I let this curiosity sit for a few months and finally pulled the data from the repository in order to get some answers.Getting the data
Each of the digital objects in our repository have a METS record that conforms to the UNTL-AIP-METS Profile registered with the Library of Congress. One of the features of this METS profile (like many others) is that these files make use of is the fileStruct section and for each file in a digital object, there exist the following pieces of informationField Example Value FileName ark:/67531/metadc419149 CHECKSUM bc95eea528fa4f87b77e04271ba5e2d8 CHECKSUMTYPE MD5 USE 0 MIMETYPE image/tiff CREATED 2014-11-17T22:58:37Z SIZE 60096742 FILENAME file://data/01_tif/2012.201.B0389.0516.TIF OWNERID urn:uuid:295e97ff-0679-4561-a60d-62def4e2e88a ADMID amd_00013 amd_00015 amd_00014 ID file_00005
By extracting this information for each file in each of the digital objects I would be able to get at the initial question I had about duplication at the file level and how much space it accounted for in the repository.Extracted Data
At the time of writing of this post the Coda Repository that acts as the preservation repository for the UNT Libraries Digital Collections contains 1.3 million digital objects that occupy 285TB of primary data. These 1.3 million digital objects consist of 151 million files that have fixity values in the repository.
The dataset that I extracted has 1,123,228 digital objects because it was extracted a few months ago. Another piece of information that is helpful to know is that the numbers that we report for “file managed by Coda (151 million mentioned above) include both the primary files ingested into the repository as well as metadata files added to the Archival Information Packages as they are ingested into the repository. The analysis in this post deals only with the primary data files deposited with the initial SIP and do not include the extra metadata files. This dataset contains information about 60,164,181 files in the repository.Analyzing the Data
Once I acquired the METS records from the Coda repository I wrote a very simple script to extract information from the File section of the METS records and format that data into a Tab separated dataset that I could use for subsequent analysis work. Because of the duplication of some of the data to each row to make processing easier, this resulted in a Tab separated file that is just over 9 GB in size (1.9 GB compressed) that contains the 60,164,181 rows, one for each file.
Here is a representation as a table for a few rows of data.METS File CHECKSUM CHECKSUMTYPE USE MIMETYPE CREATION SIZE FILENAME metadc419149.aip.mets.xml bc95eea528fa4f87b77e04271ba5e2d8 md5 0 image/tiff 2014-11-17T22:58:37Z 60096742 file://data/01_tif/2012.201.B0389.0516.TIF metadc419149.aip.mets.xml 980a81b95ed4f2cda97a82b1e4228b92 md5 0 text/plain 2014-11-17T22:58:37Z 557 file://data/02_json/2012.201.B0389.0516.json metadc419544.aip.mets.xml 0fba542ac5c02e1dc2cba9c7cc436221 md5 0 image/tiff 2014-11-17T23:20:57Z 51603206 file://data/01_tif/2012.201.B0391.0539.TIF metadc419544.aip.mets.xml 0420bff971b151442fa61b4eea9135dd md5 0 text/plain 2014-11-17T23:20:57Z 372 file://data/02_json/2012.201.B0391.0539.json metadc419034.aip.mets.xml df33c7e9d78177340e0661fb05848cc4 md5 0 image/tiff 2014-11-17T23:42:16Z 57983974 file://data/01_tif/2012.201.B0394.0493.TIF metadc419034.aip.mets.xml 334827a9c32ea591f8633406188c9283 md5 0 text/plain 2014-11-17T23:42:16Z 579 file://data/02_json/2012.201.B0394.0493.json metadc419479.aip.mets.xml 4c93737d6d8a44188b5cd656d36f1e3d md5 0 image/tiff 2014-11-17T23:01:15Z 51695974 file://data/01_tif/2012.201.B0389.0678.TIF metadc419479.aip.mets.xml bcba5d94f98bf48181e2159b30a0df4f md5 0 text/plain 2014-11-17T23:01:15Z 486 file://data/02_json/2012.201.B0389.0678.json metadc419495.aip.mets.xml e2f4d1d7d4cd851fea817879515b7437 md5 0 image/tiff 2014-11-17T22:30:10Z 55780430 file://data/01_tif/2012.201.B0387.0179.TIF metadc419495.aip.mets.xml 73f72045269c30ce3f5f73f2b60bf6d5 md5 0 text/plain 2014-11-17T22:30:10Z 499 file://data/02_json/2012.201.B0387.0179.json
My first step at this was to extract the column that stored the MD5 fixity value, sort that column and then find the number of the instances of each fixity value in the dataset. The command ends up looking like this:cut –f 2 mets_dataset.tsv | sort | uniq –c | sort –nr | head
This worked pretty will and resulted with the MD5 values that occurred the most. This represents the duplication at the file level in the repository.Count Fixity Value 72,906 68b329da9893e34099c7d8ad5cb9c940 29,602 d41d8cd98f00b204e9800998ecf8427e 3,363 3c80c3bf89652f466c5339b98856fa9f 2,447 45d36f6fae3461167ddef76ecf304035 2,441 388e2017ac36ad7fd20bc23249de5560 2,237 e1c06d85ae7b8b032bef47e42e4c08f9 2,183 6d5f66a48b5ccac59f35ab3939d539a3 1,905 bb7559712e45fa9872695168ee010043 1,859 81051bcc2cf1bedf378224b0a93e2877 1,706 eeb3211246927547a4f8b50a76b31864
There are a few things to note here, first because of the way that we version items in the repository, there is going to be some duplication because of our versioning strategy. If you are interested in understanding the versioning process we use for our system and the overhead that occurs because of this strategy you can take a look at the whitepaper we wrote a in 2014 about the subject.
Phillips, Mark Edward & Ko, Lauren. Understanding Repository Growth at the University of North Texas: A Case Study. UNT Digital Library. http://digital.library.unt.edu/ark:/67531/metadc306052/. Accessed September 26, 2015.
To get a better idea of the kinds of files that are duplicated in the repository, the following table shows fields for the top five more repeated files.Count MD5 Bytes Mimetype Common File Extension 72,906 68b329da9893e34099c7d8ad5cb9c940 1 text/plain txt 29,602 d41d8cd98f00b204e9800998ecf8427e 0 application/x-empty txt 3,363 3c80c3bf89652f466c5339b98856fa9f 20 text/plain txt 2,447 45d36f6fae3461167ddef76ecf304035 195 application/xml xml 2,441 388e2017ac36ad7fd20bc23249de5560 21 text/plain txt 2,237 e1c06d85ae7b8b032bef47e42e4c08f9 2 text/plain txt 2,183 6d5f66a48b5ccac59f35ab3939d539a3 3 text/plain txt 1,905 bb7559712e45fa9872695168ee010043 61,192 image/jpeg jpg 1,859 81051bcc2cf1bedf378224b0a93e2877 2 text/plain txt 1,706 eeb3211246927547a4f8b50a76b31864 200 application/xml xml
You can see that most of the files that are duplicated are very small in size, 0, 1, 2, and three bytes. The largest were jpegs that were represented 1,905 times in the dataset and each were 61,192 byes. The makeup of files for these top examples are txt, xml and jpg.
Overall we see that for the 60,164,181 rows in the dataset, there are 59,177,155 unique md5 hashes. This means that 98% of the files in the repository are in fact unique. Of the 987,026 rows in the dataset that are duplicates of other fixity values, there are 666,259 unique md5 hashes.
So now we know that there is some duplication in the repository at the file level. Next I wanted to know what kind of effect does this have on the storage allocated. I took care of this by taking the 666,259 values that contained duplicates and went back to pull the number of bytes for those files. I calculated the storage overhead for each of these fixity values as bytes x instances – 1 to remove the size of the initial storage, thus showing only the duplication overhead.
Here is the table for the ten most duplicated files to show that calculation.Count MD5 Bytes per File Duplicate File Overhead (Bytes) 72,906 68b329da9893e34099c7d8ad5cb9c940 1 72,905 29,602 d41d8cd98f00b204e9800998ecf8427e 0 0 3,363 3c80c3bf89652f466c5339b98856fa9f 20 67,240 2,447 45d36f6fae3461167ddef76ecf304035 195 476,970 2,441 388e2017ac36ad7fd20bc23249de5560 21 51,240 2,237 e1c06d85ae7b8b032bef47e42e4c08f9 2 4,472 2,183 6d5f66a48b5ccac59f35ab3939d539a3 3 6,546 1,905 bb7559712e45fa9872695168ee010043 61,192 116,509,568 1,859 81051bcc2cf1bedf378224b0a93e2877 2 3,716 1,706 eeb3211246927547a4f8b50a76b31864 200 341,000
After taking the overhead for each row of duplicates, I ended up with 2,746,536,537,700 bytes or 2.75 TB of overhead because of file duplication in the Coda repository.Conclusion
I don’t think there is much surprise that there is going to be duplication of files in a repository. The most common file we have that is duplicated is a txt file with just one byte.
What I will do with this information I don’t really know. I think that the overall duplication across digital objects is a feature and not a bug. I like the idea of more redundancy when reasonable. It should be noted that this redundancy is often over files that from what I can tell carry very little information (i.e. tiff images of blank pages, or txt files with 0, 1, or 2 bytes of data)
I do know that this kind of data can be helpful when talking with vendors that provide integrated “de-duplication services” into their storage arrays, though that de-duplication is often at a smaller unit that the entire file. It might be interesting to take a stab at seeing what the effect of different de-duplication methodologies and algorithms on a large collection of digital content might be, so if anyone has some interest and algorithms I’d be game on giving it a try.
That’s all for this post, but I have a feeling I might be dusting off this dataset in the future to take a look at some other information such as filesizes and mimetype information that we have in our repository.
Organizational culture is a very real and a very powerful force in every organization. I have worked in a variety of different organizations and each had had its own rituals, norms, values, and assumptions that influenced the way people worked together, shared information, and got things done. Culture is this weird, powerful, unspoken thing that both impacts and is impacted by the people within it. While organizational culture can change over time, it is usually because of major staff turnovers as culture is notoriously difficult to change.
Organizational culture can be positive and healthy or seriously maladaptive, but I think most cultures have a little from column A and a little from column B. Healthy cultures incorporate and adapt to new people and ideas. Maladaptive cultures are notoriously difficult for newcomers to feel welcome in and tend to force them to conform or leave. It’s in organizations with maladptive cultures where I think the issue of cultural fit can be most problematic.
I know what it feels like to work at a place where you don’t fit. You feel like a second class citizen in just about every interaction. You go from participating in meetings to avoiding speaking at all costs. You feel like your perspective is not taken seriously and the projects you’re involved in are marginalized. There were a few of us at that job to whom it was made painfully clear that we were the odd men out. These were not slackers who did a crappy job, but folks who were passionate about and devoted to their work. Not fitting was torture for my psyche and made me question whether there was something inherently wrong with me.
Based on my experience, you might think I’d be suggesting that people carefully screen their applicants for “fit.” That couldn’t be further from the truth. Screening for cultural fit tends to lead to monocultures that don’t embrace diversity of any kind — racial, gender, perspective, experience, etc. Monocultures are toxic and have difficulty adapting to change. Hiring people in your own image leads to an organization that can’t see clearly beyond its navel. As expressed in the article “Is Cultural Fit a Qualification for Hiring or a Disguise for Bias?” in Knowledge @ Wharton —
Diversity in the workplace has long been valued as a way to introduce new ideas, but researchers have found other reasons for cultivating heterogeneity. Information was processed more carefully in heterogeneous groups than homogenous groups, according to “Is the Pain Worth the Gain? The Advantages and Liabilities of Agreeing With Socially Distinct Newcomers,” by Katherine W. Phillips, Katie A. Liljenquist and Margaret A. Neale, published in Personality and Social Psychology Bulletin. Social awkwardness creates tension, and this is beneficial, the study found. “The mere presence of socially distinct newcomers and the social concerns their presence stimulates among old-timers motivates behavior that can convert affective pains into cognitive gains” — or, in other words, better group problem solving.
So perhaps bringing people in who aren’t such a perfect fit, and maybe even challenge the current structure a bit, is very good for the organization. Any time I have worked with someone who has a very different perspective and lived experience than I have, I have learned so much. I remember when we hired an instructional designer at the PSU Library who came from outside of libraries, I found that it was much more difficult to get on the same page, but the ideas he brought to our work more than compensated for any difficulties I had as a manager. He allowed us to see beyond our myopic librarian view. I think hiring people with different cultural, racial, gender, socioeconomic, etc. backgrounds provide similar benefits to the organization.
Whether it is conscious or unconscious, hiring people who are “like you” is bias, and it tends to result in organizations that are less diverse; not only in terms of perspectives, but in terms of race/gender/religion/etc. When you’re on a hiring committee, how often do you find yourself judging candidates based on qualities you value in a colleague rather than the stated qualifications? It probably happens more than we’d all like to admit.
It’s easy to fall into the trap of considering fit without even thinking about it. I remember when I was on my first hiring committee, once we’d weeded out those candidates who didn’t meet the minimum qualifications, I felt myself basing my evaluation of the rest on whether or not they had the traits I value in a colleague. The person we hired ended up becoming a good friend and while he did a fantastic job in his role, part of me wishes I had put my personal biases aside when making that decision. I may still have championed him, but I would have done it for the right reasons.
One thing I feel strongly that we should hire for is shared values. It is critical that the person one hires doesn’t hold values antithetical to the work of the organization. I don’t care anymore if a candidate seems like they could be a friend, but I do care if they evidence and support the goals and values of my library and community college. Just having the required qualifications isn’t enough; being a community college librarian isn’t for everyone.
Unfortunately, in reading this New York Times article, “Guess Who Doesn’t Fit In at Work”, and from my own experiences, people are judged by much more than shared values, which unintentionally biases people doing hiring against folks who have different lived experiences and interests. This is discrimination, plain and simple. When I was looking for an image to use for this post, I found this blog post about how people doing hiring should look at candidates’ social media profiles to scan for cultural fit. That we should look at what restaurants candidates visit and what things they favorite on Twitter frankly scares the crap out of me. Because in doing that, you’re saying that people with different views or outside-of-work activities are not welcome in your organization.
What we need is to embrace diversity in its many forms and value contributions from everyone, but that is easier said than done. I like the suggestions theNew York Times article has regarding hiring for fit without bias:
First, communicate a clear and consistent idea of what the organization’s culture is (and is not) to potential employees. Second, make sure the definition of cultural fit is closely aligned with business goals. Ideally, fit should be based on data-driven analysis of what types of values, traits and behaviors actually predict on-the-job success. Third, create formal procedures like checklists for measuring fit, so that assessment is not left up to the eyes (and extracurriculars) of the beholder.
Finally, consider putting concrete limits on how much fit can sway hiring. Many organizations tell interviewers what to look for but provide little guidance about how to weigh these different qualities. Left to their own devices, interviewers often define merit in their own image.
Clearly, the more structured the process and the less leeway there is for making decisions based on aspects of a candidates personality, interests and background, the less likely the bias.
And what of those cultures that may hire for diversity but then treat people with different ideas and experiences like pariahs? Unfortunately, I get the sense that changing culture is nearly impossible without a decent amount of staff turnover. I witnessed a culture shift in my first library job, but it was because my boss had hired over half the staff over a period of about six years and was able to cultivate the right mix of values and diverse characters. I’ve also seen new administrators come into organizations with really strong, entrenched cultures and fail spectacularly at creating any kind of culture change. Fixing the problem of bias in hiring is only half the problem. We also need to embrace diversity in our organizations so that people of color or people with divergent ideas feel valued by the organization.
I feel very lucky that I work at a library that values diversity and diverse perspectives. We have a group of librarians who have different passions, different viewpoints, and very different personalities. Yet I don’t see anyone marginalizing anyone else. I don’t see anyone whose opinions are taken less seriously than anyone else’s. I don’t see people playing favorites or being cliquish. What I see is an diverse group of people who value each other’s opinions and also value consensus-building. We don’t always come to complete agreement, but we accept and respect the way things go. We have a functional adhocracy, where we feel empowered to act and where we alternate taking and sharing leadership roles organically. I feel like everyone is valued for what they bring to the group and everyone brings something very different. Even after one year, it still feels like heaven to me and it’s certainly not because everyone is like me.
We have a long way to go in building diverse libraries, but becoming keenly aware of how our unconscious preferences in hiring and our organizational cultures can help or harm diversity is a good step in the right direction.
Lately I’ve been looking back through the past of the Digital Library Production Service (DLPS) -- in fact, all the way back to the time before DLPS, when we were the Humanities Text Initiative -- to see what, if anything, we’ve learned that will help us as we move forward into a world of Hydra, ArchivesSpace, and collaborative development of repository and digital resource creation tools.
DuraSpace News: Telling DSpace Stories at the International Livestock Research Institute (ILRI) with Alan Orth
“Telling DSpace Stories” is a community-led initiative aimed at introducing project leaders and their ideas to one another while providing details about DSpace implementations for the community and beyond. The following interview includes personal observations that may not represent the opinions and views of the International Livestock Research Institute (ILRI) or the DSpace Project.
I came across this video on a friend’s Facebook feed. I’m a chronic multitasker, but by half a minute in I stopped doing whatever else I was doing and just watched and listened. This is the part that grabbed my heart:
This is my star. I had to wear it on my chest, of course, like all the Jews. It’s big, isn’t it? Especially for a child. That was when I was 8 years old.
Also Francine Christophe’s voice was very powerful and moved me. She annunciates each word so clearly. My French isn’t great, but she speaks slowly and clearly enough that I can understand her. Also, the subtitles confirm that I’m understanding correctly and reinforce what she’s saying.
I noticed that there was something different about the subtitles. The font is clear and elegant and the words are positioned in the blank space beside her face. I can watch her face and her eyes while I read the subtitles. My girlfriend reminded me of something I had said when I was reviewing my Queer ASL lesson at home. In ASL I learned that when fingerspelling you position your hand up by your face, as your face (especially your eyebrows) are part of the language. Even when we speak English our faces communicate so much.
I’ve seen a bunch of these short videos from this film. They are everyday people telling amazing stories about the huge range of experiences that we experience on this planet. The people who are filmed are from all over the world, and speaking in various languages. The design decision to shoot people with enough space to put the subtitles beside them is really powerful. For me the way the subtitles are done enhances the feeling of empathy.
A couple of weeks ago I was at a screening event of Microsoft’s Inclusive video at OCAD in Toronto. In the audience were many students of the Inclusive Design program who were in the video. One of the students asked if the video included description of the visuals for blind and visually impaired viewers. The Microsoft team replied that it didn’t and that often audio descriptions were distracting for viewers who didn’t need them. The student asked if there could’ve been a way to weave the audio description into the interviews, perhaps by asking the people who were speaking to describe where they were and what was going on, instead of tacking on the audio description afterwards. I love this idea.
HUMAN is very successful in skillfully including captions that are beautiful, enhance the storytelling, provide access to Deaf and Hard of Hearing people, provide a way for people who know a bit of the language to follow along with the story as told in the storyteller’s mother tongue, and make it easy to translate the film into other languages. I’m going to include this example in the work we’re going around universal design for learning with the BC Open Textbook project.
I can’t wait to see the whole film of HUMAN. I love the stories that they are telling and the way that they are doing it.
But first, some background. Most of China's internet connects to the rest of the world through what's known in the rest of the world as "the Great Firewall of China". Similar to network firewalls used for most corporate intranets, the Great Firewall is used as a tool to control and monitor internet communications in and out of China. Websites that are deemed politically sensitive are blocked from view inside China. This blocking has been used against obscure and prominent websites alike. The New York Times, Google, Facebook and Twitter have all been blocked by the firewall.
When web content is unencrypted, it can be scanned at the firewall for politically sensitive terms such as "June 4th", a reference to the Tiananmen Square protests, and blocked at the webpage level. China is certainly not the only entity that does this; many school systems in the US do the same sort of thing to filter content that's considered inappropriate for children. Part of my motivation for working on the "Library Digital Privacy Pledge" is that I don't think libraries and publishers who provide online content to them should be complicit in government censorship of any kind.
Last March, however China's Great Firewall was associated with an offensive attack. To put it more accurately, software co-located with China's Great Firewall turned innocent users of unencrypted websites into attack weapons. The targets of the attack were "GreatFire.org", a website that works to provide Chinese netizens a way to evade the surveillance of the Great Firewall, and GitHub.com, the website that hosts code for hundreds of thousand of programmers, including those supporting Greatfire.org.
Here's how the Great Cannon operated In August, Bill Marczak and co-workers from Berkeley, Princeton and Citizen Lab presented their findings on the Great Cannon at the 5th USENIX Workshop on Free and Open Communications on the Internet.
Our findings in China add another documented case to at least two other known instances of governments tampering with unencrypted Internet traffic to control information or launch attacks—the other two being the use of QUANTUM by the US NSA and UK’s GCHQ.[reference] In addition, product literature from two companies, FinFisher and Hacking Team, indicate that they sell similar “attack from the Internet” tools to governments around the world [reference]. These latest findings emphasize the urgency of replacing legacy web protocols like HTTP with their cryptographically strong counterparts, such as HTTPS.It's worth thinking about how libraries and the resources they offer might be exploited by a man-in-the-middle attacker. Science journals might be extremely useful in targeting espionage scripts at military facilities, for example. A saboteur might alter reference technical information used by a chemical or pharmaceutical company with potentially disastrous consequences. It's easy to see why any publisher that wants its information to be perceived as reliable has no choice but to start encrypting their services now.
The unencrypted services of public libraries are attractive targets for other sorts of mischief, ironically because of their users' trust in them and because they have a reputation for protecting privacy. Think about how many users would enter their names, phone numbers, and last four digits of their social security numbers if a library website seemed to ask for it. When a website is unencrypted, it's possible for "man-in-the-middle" attacks to insert content into an unencrypted web page coming from a library or other trusted website. An easy way for an attacker to get into position to execute such an attack is to spoof a wifi network, for example in a cafe or other public space, such as a library. It doesn't help if only a website's login is encrypted if an attacker can easily insert content into the unencrypted parts of the website.
To be clear, we don't know that libraries and the type of digital resources they offer are being targeted for weaponization, espionage or other sorts of mischief. Unfortunately, the internet offers a target-rich environment of unencrypted websites.
I believe that libraries and their suppliers need to move swiftly to take the possibility off the table and help lead the way to a more secure digital environment for us all.
[note: Technically, the Great Cannon executed a "man-on-the-side" variant of a "man-in-the-middle" attack, not unlike the NSA's "QuantumInsert" attack revealed by Edward Snowden.]
After about a month of working with the headings validation tool, I’m ready to start adding a few enhancements to provide some automated headings corrections. The first change to be implemented will be automatic correction of headings where the preferred heading is different from the in-use headings. This will be implemented as an optional element. If this option is selected, the report will continue to note variants are part of the validation report – but when exporting data for further processing – automatically corrected headings will not be included in the record sets for further action.
Additionally – I’ll continue to be looking at ways to improve the speed of the process. While there are some limits to what I can do since this tool relies on a web service (outside of providing an option for users to download the ~10GB worth of LC data locally), there are a few things I can to do continue to ensure that only new items are queried when resolving links.
These changes will be made available on the next update.
Pull up a chair and set a while: I shall talk of my progress in the doctoral program; my research interests, particularly LGBT leadership; the value of patience and persistence; Pauline Kael; and my thoughts on leadership theory. I include a recipe for cupcakes. Samson, my research assistant, wanted me to add something about bonita flakes, but that’s really his topic.
My comprehensive examinations are two months behind me: two four-hour closed-book exams, as gruesome as it sounds. Studying for these exams was a combination of high-level synthesis of everything I had learned for 28 months and rote memorization of barrels of citations. My brain was not feeling pretty.
I have been re-reading the qualifying paper I submitted earlier this year, once again feeling grateful that I had the patience and persistence to complete and then discard two paper proposals until I found my research beshert, about the antecedents and consequences of sexual identity disclosure for academic library directors. That’s fancy-talk for a paper that asked, why did you come out, and what happened next? The stories participants shared with me were nothing short of wonderful.
As the first major research paper I have ever completed, it is riddled with flaws. At 60–no, now, 52–pages, it is also an unpublishable length, and I am trying to identify what parts to chuck, recycle, or squeeze into smaller dress sizes, and what would not have to be included in a published paper anyway.
But if there is one thing I’ve learned in the last 28 months, it is that it is wise to pursue questions worth pursuing. I twice made the difficult decision to leave two other proposals on the cutting-room floor, deep-sixing many months of effort. But in the end that meant I had a topic I could live with through the long hard slog of data collection, analysis, and writing, a topic that felt so fresh and important that I would mutter to myself whilst working, “I’m in your corner, little one.”
As I look toward my dissertation proposal, I find myself again (probably, but not inevitably) drawn toward LGBT leadership–even more so when people, as occasionally happens, question this direction. A dear colleague of mine questioned the salience of one of the themes that emerged from my study, the (not unique) idea of being “the only one.” Do LGBT leaders really notice when they are the only ones in any group setting, she asked? I replied, do you notice when you’re the only woman in the room? She laughed and said she saw my point.
The legalization of same-gender marriage has also resulted in some hasty conclusions by well-meaning people, such as the straight library colleague from a liberal coastal community who asked me if “anyone was still closeted these days.” The short answer is yes. A 2013 study of over 800 LGBT employees across the United States found that 53 percent of the respondents hide who they are at work.
But to unpack my response requires recalling Pauline Kael’s comment about not knowing anyone who voted for Nixon (a much wiser observation than the mangled quote popularly attributed to her): “I live in a rather special world. I only know one person who voted for Nixon. Where they are I don’t know. They’re outside my ken. But sometimes when I’m in a theater I can feel them.”
In my study, I’m pleased to say, most of the participants came from outside that “rather special world.” I recruited participants through calls to LGBT-focused discussion lists which were then “snowballed” out to people who knew people who knew people, and to quote an ancient meme, “we are everywhere.” The call for participation traveled several fascinating degrees of separation. If only I could have chipped it like a bird and tracked it! As it was, I had 10 strong, eager participants who generated 900 minutes of interview data, and the fact that most were people I didn’t know made my investigation that much better.
After the data collection period for my research had closed, I was occasionally asked, “Do you know so-and-so? You should use that person!” In a couple of cases colleagues complained, “Why didn’t you ask me to participate?” But I designed my study so that participants had to elect to participate during a specific time period, and they did; I had to turn people away.
The same HRC study I cite above shrewdly asked questions of non-LGBT respondents, who revealed their own complicated responses to openly LGBT workers. “In a mark of overall progress in attitudinal shifts, 81% of non-LGBT people report that they feel LGBT people ‘should not have to hide’ who they are at work. However, less than half would feel comfortable hearing an LGBT coworker talk about their social lives, dating or related subject.” I know many of you reading this are “comfortable.” But you’re part of my special world, and I have too much experience outside that “special world” to be surprised by the HRC’s findings.
Well-meaning people have also suggested more than once that I study library leaders who have not disclosed their sexual identity. Aside from the obvious recruitment issues, I’m far more interested in the interrelationship between disclosure and leadership. There is a huge body of literature on concealable differences, but suffice it to say that the act of disclosure is, to quote a favorite article, “a distinct event in leadership that merits attention.” Leaders make decisions all the time; electing to disclose–an action that requires a million smaller decisions throughout life and across life domains–is part of that decision matrix, and inherently an important question.
My own journey into research
If I were to design a comprehensive exam for the road I have been traveling since April, 2013, it would be a single, devilish open-book question to be answered over a weekend: describe your research journey.
Every benchmark in the doctoral program was a threshold moment for my development. Maybe it’s my iconoclast spirit, but I learned that I lose interest when the chain of reasoning for a theory traces back to prosperous white guys interviewing prosperous white guys, cooking up less-than-rigorous theories, and offering prosperous-white-guy advice. “Bring more of yourself to work!” Well, see above for what happens to some LGBT people when they bring more of themselves to work. It’s true that the participants in my study did just that, but it was with an awareness that authenticity has its price as well as its benefits.
The more I poked at some leadership theories, the warier I became. Pat recipes and less-than-rigorous origin stories do not a theory make. (Resonant leadership cupcakes: stir in two cups of self-awareness; practice mindfulness, hope, and compassion; bake until dissonance disappears and renewal is evenly golden.) Too many books on leadership “theory” provide reasonable and generally useful recommendations for how to function as a leader, but are so theoretically flabby that if they were written by women would be labeled self-help books.
(If you feel cheated because you were expecting a real cupcake recipe, here’s one from Cook’s Catalog, complete with obsessive fretting about what makes it a good cupcake.)
I will say that I would often study a mainstream leadership theory and then see it in action at work. I had just finished boning up on Theory X and Theory Y when someone said to me, with an eye-roll no less, “People don’t change.” Verily, the scales fell from my eyes and I revisited moments in my career where a manager’s X-ness or Y-ness had significant implications. (I have also asked myself if “Theory X” managers can change, which is an X-Y test in itself.) But there is a difference between finding a theory useful and pursuing it in research.
I learned even more when I deep-sixed my second proposal, a “close but no cigar” idea that called for examining a well-tested theory using LGBT leader participants. The idea has merit, but the more I dug into the question, the more I realized that the more urgent question was not how well LGBT leaders conform to predicted majority behavior, but instead the very whatness of the leaders themselves, about which we know so little.
It is no surprise that my interest in research methods also evolved toward exploratory models such as grounded theory and narrative inquiry that are designed to elicit meaning from lived experience. Time and again I would read a dissertation where an author was struggling to match experience with predicated theory when the real findings and “truth” were embedded in the stories people told about their lives. To know, to comprehend, to understand, to connect: these stories led me there.
Bolman and Deal’s “frames” approach also helped me diagnose how and why people are behaving as they are in organizations, even if you occasionally wonder, as I do, if there could be another frame, or if two of the frames are really one frame, or even if “framing” itself is a product of its time.
For that matter, mental models are a useful sorting hat for leadership theorists. Schein and Bolman see the world very differently, and so follows the structure of their advice about organizational excellence. Which brings me back to the question of my own research into LGBT leadership.
In an important discussion about the need for LGBT leadership research, Fassinger, Shullman, and Stevenson get props for (largely) moving the barycenter of LGBT leadership questions from the conceptual framework of being acted upon toward questions about the leaders themselves and their complex, agentic decisions and interactions with others. Their discussion of the role of situation feels like an enduring truth: “in any given situation, no two leaders and followers may be having the same experience, even if obvious organizational or group variables appear constant.”
What I won’t do is adopt their important article on directions for LGBT leadership research as a Simplicity dress pattern for my leadership research agenda. They created a model; well, you see I am cautious about models. Even my own findings are at best a product of people, time, and place, intended to be valid in the way that all enlightenment is valid, but not deterministic.
So on I go, into the last phase of the program. In this post I have talked about donning and discarding theories as if I had all the time in the world, which is not how I felt in this process at all. It was the most agonizing exercise in patience and persistence I’ve ever had, and I questioned myself along the entire path. I relearned key lessons from my MFA in writing: some topics are more important than others; there is always room for improvement; writing is a process riddled with doubt and insecurity; and there is no substitute for sitting one’s behind in a chair and writing, then rewriting, then writing and rewriting some more.
So the flip side of my self-examination is that I have renewed appreciation for the value of selecting a good question and a good method, and pressing on until done. I have no intention of repeating my Goldilocks routine.
Will my dissertation be my best work? Two factors suggest otherwise. First, I have now read countless dissertations where somewhere midway in the text the author expresses regret, however subdued, that he or she realized too late that the dissertation had some glaring flaw that could not be addressed without dismantling the entire inquiry. Second, though I don’t know that I’ve ever heard it expressed this way, from a writer’s point of view the dissertation is a distinct genre. I have become reasonably comfortable with the “short story” equivalent of the dissertation. But three short stories do not a novel make, and rarely do one-offs lead to mastery of a genre.
But I will at least be able to appreciate the problem for what it is: a chance to learn, and to share my knowledge; another life experience in the “press on regardless” sweepstakes; and a path toward a goal: the best dissertation I will ever write.Bookmark to:
Today I found the following resources and bookmarked them on Delicious.
- iDoneThis Reply to an evening email reminder with what you did that day. The next day, get a digest with what everyone on the team got done.
Digest powered by RSS Digest
Guest post by Dave Hansen, a Clinical Assistant Professor and Faculty Research Librarian at the University of North Carolina’s School of Law, where he runs the library’s faculty research service.
Wouldn’t libraries and archives like to be able to digitize their collections and make the texts and images available to the world online? Of course they would, but copyright inhibits this for most works created in the last 100 years.
The U.S. Copyright Office recently issued a report and a request for comments on its proposal for a new licensing system intended to overcome copyright obstacles to mass digitization. While the goal is laudable, the Office’s proposal is troubling and vague in key respects.
The overarching problem is that the Office’s proposal doesn’t fully consider how libraries and archives currently go about digitization projects, and so it misidentifies how the law should be improved to allow for better digital access. It’s important that libraries and archives submit comments to help the Office better understand how to make recommendations for improvements.
Below is a summary of the Office’s proposal and five specific reasons why libraries and archives should have reservations about it. I strongly encourage you to read the proposal and Notice of Inquiry closely and form your own judgment about it.
For commenting, a model letter is available here (use this form to fill in basic information), but you should tailor it with details that are important to your institution. Comments are due to the Copyright Office by October 9, 2015. The comment submission page is here.The Copyright Office’s Licensing Proposal
The Copyright Office’s proposal is that Congress enact a five year pilot “extended collective licensing” (ECL) system that would allow collecting societies (e.g., the Authors Guild, or the Copyright Clearance Center) to grant licenses for mass digitization for nonprofit uses.
Societies could, in theory, already grant mass digitization licenses for works owned by their members. The Office’s proposed ECL system would allow collecting societies to go beyond that, and also grant licenses for all works that are similar to those owned by their members, even if the owners of those similar works are not actually members of the collective themselves. That’s the “extended” part of the license; Congress would, by statute, extend the society’s authority to grant licenses on behalf of both members and non-members alike. Such a system would help to solve one of the most difficult copyright problems libraries and archives face: tracking down rights holders. Digitizers would instead need only to negotiate and purchase a license from the collecting societies, simplifying the rights clearance process.Why the Copyright Office’s Proposal is Troubling
In the abstract, the Office’s proposal sound appealing. But for digitizing libraries and archives, the details make it troubling for these five reasons:
First, the proposal doesn’t address the types of works that libraries and archives are working hardest to preserve and make available online—unique collections that include unpublished works such as personal letters or home photographs. Instead of focusing on these works for which copyright clearance is hardest to obtain, the proposal applies to only three narrow categories: 1) published literary works, 2) published embedded pictorial or graphic works, and 3) published photographs.
Second, given the large variety of content types in the collections that libraries and archives want to digitize—particularly special collections that include everything from unpublished personal papers, to out-of-print books, to government works—there is no one collecting society that could ever offer a license for mass digitization of entire collections. If seeking a license, libraries and archives would still be forced to negotiate with a large number of parties. And because the proposed ECL pilot would include only published works, large sections of collections would remain unlicensable anyway.
Third, digitization is an expensive investment. Because the system would be a five-year pilot project, few libraries or archives would be able to pay what it will cost to digitize works (not to mention ECL license fees) if those works have to be taken offline in a few years when the ECL system expires.
Fourth, for an ECL system to truly address the costs of clearing rights, it would need to include licensing orphan works (works whose owners cannot be located) alongside all other works. While the Copyright Office acknowledges in one part of its report that licensing of orphan works doesn’t make sense because it would require payment of fees that would never go to owners, it later specifies an ECL system that would do just that. The Society of American Archivists said it best in its comments to the Copyright Office: “[R]epositories that are seeking to increase access to our cultural heritage generally have no surplus funds. . . . Allocating those funds in advance to a licensing agency that will only rarely disperse them would be wasteful, and requiring such would be irresponsible from a policy standpoint.”
Finally, one of the most unsettling things about the ECL proposal is its threat to the one legal tool that is currently working for mass digitization: fair use. To be clear, fair use doesn’t work for all mass digitization uses. But it likely does address many of the uses that libraries and archives are most concerned with, including nonprofit educational uses of orphan works, and transformative use of special collections materials.
The Office recognized concerns about fair use in its report, and in response proposed a “fair use savings clause” that would state that “nothing in the [ECL] statute is intended to affect the scope of fair use.” Even with an effective savings clause, the existence of the ECL system alone could shrink the fair use right because fewer users might rely on it in favor of more conservative licensing. As legal scholars have observed, fair use is like a muscle, its strength depends in part on how it is used.
Rather than focus its energy on creating a licensing system that can only reach a small segment of library and archive collections, the Office should instead promote the use of legal tools that are working, such as fair use, and work to solve the problems underlying the rights-clearance issues by helping to create better copyright information registries and by studying solutions that would encourage rightsholders to make themselves easier to be found by potential users of their works.
Creative Commons (CC) is a public copyright license. What does this mean? It means it allows for free distribution of work that would otherwise be under copyright, providing open access to users. Creative Commons licensing provides both gratis OA licensing and libre OA licensing (terms coined by Peter Suber). Gratis OA is free to use, libre OA is free to use and free to modify.
How does CC licensing benefit the artist? Well, it allows more flexibility with what they can allow others to do with their work. How does it benefit the user? As a user, you are protected from copyright infringement, as long as you follow the CC license conditions.
CC licenses: in a nutshell with examples
BY – attribution | SA – share alike | NC – non-commercial | ND – no derivs
CC0 – creative commons zero license means this work is in the public domain and you can do whatever you want with it. No attribution is required. This is the easiest license to work with. (example of a CC0 license: Unsplash)
BY – This license means that you can do as you wish with the work but only as long as you provide attribution for the original creator. Works with this type of license can be expanded on and used for commercial use, if the user wishes, as long as attribution is given to the original creator. (example of a CC-BY license: Figshare ; data sets at Figshare are CC0; PLOS)
BY-SA – This license is an attribution licenses and share alike license meaning that all new works based on the original work will carry the same license. (example of a CC-BY-SA license: Arduino)
BY-NC – this license is another attribution license but the user does not have to retain the same licensing terms as the original work. The catch, the user must be using the work non-commercially. (example of a BY-NC license: Ely Ivy from the Free Music Archive)
BY-ND – This license means the work can be shared, commercially or non-commercially, but without change to the original work and attribution/credit must be given. (example of a BY-ND license: Free Software Foundation)
BY-NC-SA – This license combines the share alike and the non-commercial with an attribution requirement. Meaning, the work can be used (with attribution/credit) only if for non-commercial use and any and all new works retain the same BY-NC-SA license. (example of a CC BY-NC-SA: Nursing Clio see footer or MITOpenCourseWare)
BY-NC-ND – This license combines the non-commercial and non-derivative licenses with an attribution requirement. Meaning, you can only use works with this license with attribution/credit for non-commercial use and they cannot be changed from the original work. (example of a BY-NC-ND license: Ted Talk Videos)
Washington, DC Let ACRL’s Scholarly Communication Toolkit help you prepare to lead events on your campus during Open Access Week, October 19-25, 2015. Open Access Week, a global event now entering its eighth year, is an opportunity for the academic and research community to continue to learn about the potential benefits of Open Access, to share what they’ve learned with colleagues, and to help inspire wider participation in helping to make Open Access a new norm in scholarship and research.
From Michele Mennielli, Cineca
As announced in August’s DuraSpace Digest, just a few days after the release of the version 5.2.1, on August 25, 2015 Cineca released Version 5.3.0 of DSpace-CRIS. The new version is aligned with the latest DSpace 5 release and includes a new widget that supports the dynamic properties of CRIS objects to support hierarchy classification such as ERC Sectors, MSC (Mathematics Subject Classification).
At this stage of the copyright reform effort, the U.S. House Judiciary Committee is meeting with stakeholders for “listening sessions,” which give concerned rights holders or users of content an opportunity to make their case for a copyright fix. To reach a broader audience, the Committee is going on the road to reach individuals and groups around the country, and one would think, to hear a range of opinions from the community. So, on September 22, they went to Nashville, a music mecca, to hold a listening session regarding music copyright reform.
Music, perhaps more than any other form of creative expression, needs to be re-examined. New business models for digital streaming, fair royalty rights, and requests for transparency have all created a need for clarity on who gets paid for what in the music business. We need policy that answers this question in a way that’s fair to everyone. One thing has been agreed on by copyright stakeholders thus far—people should be compensated for their intellectual and creative work. Wonderful.
But lo and behold—the same industry and trade group lobbyists that always get a chance to meet with the Congressional representatives and staff in DC turned out to be mostly the only music stakeholder groups that were invited to speak. What gives?
It looks like the House merely gathered the usual suspects—a list of “who do we know (already)?” to the table. It would have been simple for the Committee to convene a wide gamut of music stakeholders together to paint a full picture of the state of the music industry, given the fact that they met in Nashville. Ultimately, however, other key stakeholders (Out of the Box, Sorted Noise, community radio, music educators, librarians, archivists, and consumers) were not heard, and only one (older) version of the state of the music industry (that the Committee already knows about) took center stage.
So, why go to Nashville?
Don’t get me wrong. It is a good thing that the Committee wants to hear from all stakeholders and it is thoughtful to hold listening sessions in geographically diverse locations, but you have to give people you don’t already know an opportunity to speak. That’s the only way to learn about new business models and how best to cultivate music creators of tomorrow—to truly understand how the creativity ecosystem can thrive in the future and then what legislative changes are needed to realize that future.
What does a trade agreement have to do with libraries and copyright? Expert Krista Cox who has traveled the world promoting better policies for the intellectual property chapter of the Trans-Pacific Partnership Agreement (TPP) will enlighten us at our next CopyTalk webinar.
There is no need to pre-register for this free webinar! Just go to: http://ala.adobeconnect.com/r46u21nc214 on October 1, 2015 at 2 p.m. EST/11 a.m. PST.
Note that the webinar is limited to 100 seats so watch with colleagues if possible. An archived copy will be available after the webinar.
The Trans-Pacific Partnership Agreement (TPP) is a large regional trade agreement currently being negotiated between twelve countries: Australia, Brunei, Canada, Chile, Japan, Malaysia, Mexico, New Zealand, Peru, Singapore, the United States and Vietnam. The agreement has been negotiated behind closed doors, but due to various leaks of the text it is apparent that the TPP will include a comprehensive chapter on intellectual property, including specific provisions governing copyright and enforcement. In addition to requiring other countries to change their laws, the final agreement could lock-in controversial provisions of US law and prevent reform in certain areas.
Krista Cox is Director of Public Policy Initiatives at the Association of Research Libraries (ARL). In this role, she advocates for the policy priorities of the Association and executes strategies to implement these priorities. She monitors legislative trends and participates in ARL’s outreach to the Executive Branch and the US Congress.
Prior to joining ARL, Krista worked as the staff attorney for Knowledge Ecology International (KEI) where she focused on access to knowledge issues as well as TPP. Krista received her JD from the University of Notre Dame and her BA in English from the University of California, Santa Barbara. She is licensed to practice before the Supreme Court of the United States, the Court of Appeals for the Federal Circuit, and the State Bar of California.
In 2015, DPLA piloted a National History Day partnership with National History Day in Missouri, thanks to the initiative of community rep Brent Schondelmeyer. For 2016, DPLA will be partnering with NHDMO and two new state programs: National History Day – California and National History Day in South Carolina. For each program, DPLA designs research guides based on state and national topics related to the contest theme, acts an official sponsor, and offers a prize for the best project that extensively incorporates DPLA resources.
In this post, NHDMO Coordinator Maggie Mayhan describes the value of DPLA as a resource for NHD student researchers. To learn more about DPLA and National History Day partnerships, please email email@example.com.
Each year more than 3,500 Missouri students take part in National History Day (NHD), a unique opportunity for sixth- through twelfth-grade students to explore the past in a creative, hands-on way. While producing a documentary, exhibit, paper, performance, or website, they become an expert on the topic of their choosing.
In following NHD rules, students quickly learn that the primary sources they are required to use in their projects also help them to tell their stories effectively. But where do they start their search for those sources? How can it be manageable and meaningful?
Enter Digital Public Library of America (DPLA). Collecting and curating digital sources from libraries, museums, and archives, the DPLA portal connects students and teachers with the resources that they need. For students who cannot easily visit specialized repositories to work with primary sources, DPLA may even be the connection that enables them to participate in National History Day.
National History Day in Missouri loves how DPLA actively works to fuse history and technology, encouraging students to use modern media to access and share history. Knowing how to use new technologies to find online archives, databases, and other history sources is important for future leaders seeking to explore the past.
Seeing the potential for a meaningful collaboration in which students uncover history through the DPLA collections and put their own stamp on it through National History Day projects, the Digital Public Library of America became a major program sponsor in 2015.
Additionally, DPLA sponsors a special prize at the National History Day in Missouri state contest, awarded to the best documentary or website that extensively incorporates DPLA resources. The 2015 prize winners, Keturah Gadson and Daniela Hinojosa from Pattonville High School in St. Louis, pointed out that DPLA access was important for their award-winning website about civil rights activist Thurgood Marshall:
We found that the sources on the Digital Public Library of America fit amazingly into our research and boosted it where we were lacking… the detail we gained from looking directly at the primary sources was unmatched…DPLA sources completed our research wonderfully.
National History Day in Missouri is excited to continue this partnership in 2016, and we look forward to seeing what resources students will discover as they focus on the 2016 contest theme, Exploration, Encounter, Exchange in History.
There’s still time to register for the 2015 LITA Forum at the early bird rate and save $50
November 12-15, 2015
LITA Forum early bird rates end September 30, 2015
Join us in Minneapolis, Minnesota, at the Hyatt Regency Minneapolis for the 2015 LITA Forum, a three-day education and networking event featuring 2 preconferences, 3 keynote sessions, more than 55 concurrent sessions and 15 poster presentations. This year including content and planning collaboration with LLAMA.
Check out the report from Melissa Johnson. It details her experience as an attendee, a volunteer, and a presenter. This year, she’s on the planning committee and attending. Melissa says most people don’t know is how action-packed and seriously awesome this years LITA Forum is going to be. Register now to receive the LITA and LLAMA members early bird discount:
- LITA and LLAMA member early bird rate: $340
- LITA and LLAMA member regular rate: $390
The LITA Forum is a gathering for technology-minded information professionals, where you can meet with your colleagues involved in new and leading edge technologies in the library and information technology field. Attendees can take advantage of the informal Friday evening reception, networking dinners and other social opportunities to get to know colleagues and speakers and experience the important networking advantages of a smaller conference.
- Mx A. Matienzo, Director of Technology for the Digital Public Library of America
- Carson Block, Carson Block Consulting Inc.
- Lisa Welchman, President of Digital Governance Solutions at ActiveStandards.
- So You Want to Make a Makerspace: Strategic Leadership to support the Integration of new and disruptive technologies into Libraries: Practical Tips, Tricks, Strategies, and Solutions for bringing making, fabrication and content creation to your library.
- Beyond Web Page Analytics: Using Google tools to assess searcher behavior across web properties.
Comments from past attendees:
“Best conference I’ve been to in terms of practical, usable ideas that I can implement at my library.”
“I get so inspired by the presentations and conversations with colleagues who are dealing with the same sorts of issues that I am.”
“After LITA I return to my institution excited to implement solutions I find here.”
“This is always the most informative conference! It inspires me to develop new programs and plan initiatives.”
See you in Minneapolis.
Each month, the LITA bloggers share selected library tech links, resources, and ideas that resonated with us. Enjoy – and don’t hesitate to tell us what piqued your interest recently in the comments section!
- Amanda Rinehart writes about the emotional side of data services.
- There are four new DPLA service hubs, including my very own Wisconsin!
- Git lit, a new project to parse, version control, and post to GitHub 68,000 texts from the British Library.
- Ted Underwood on digital humanities and institutions.
I’m mixing things up this month and have been reading a lot on…
- Engaging students in the classroom via FREE tools compiled by Jonathan Wylie, who is a Technology Consultant at Grant Wood AEA.
- The Data Information Literacy project funded by an IMLS grant and supported by institutions such as Purdue University and the University of Minnesota.
- The Chronicle of Higher Education’s report on the credentialing craze for recognizing learning that is not degree-based.
Hopefully this isn’t all stuff you’ve all seen already:
- Given that I’ve talked about it before, here’s how to start a 3D printing program at your library.
- Here’s a quick article on libraries circulating WiFi hotspots.
- I’m hoping to start doing this at my library: Public libraries embrace self-publishing services.
- Finally, from David Lee King, a list of social media policies you can co-opt for your library.
These are all over the place as I’ve been bouncing back and forth between multiple interests I’ve been finger dipping in.
- Why UX is not about Design by Patrizia Bertini; discuss
- I really enjoyed this article, Start Small: In search of the minimum viable product it talks about the creation of something that solves a problem (also introduced me to Unsplash which is a CC0 license image site), I felt that the principle could be applied beyond building a product like for example managing a team.
- Better Sharing Through Licenses? Measuring the Influence of Creative Commons Licenses on the Usage of Open Access Monographs influenced by a recent keynote I attended from Amy Buckland about Open Access; also triggered my post on Creative Commons licensing.
- SonicPi – really cool free live coding synth to turn data sets into music. Major thanks to a Hackfest run by William Denton (https://www.miskatonic.org/music/access2015/)