Researchers, of varying technical abilities, are increasingly applying data science tools and methods to digital collections. As a result, new ways are emerging for processing and analyzing the digital collections’ raw material — the data.
For example, instead of pondering one single digital item at a time – such as a news story, photo or weather event – a researcher can compute massive quantities of digital items at once to find patterns, trends and connections. Such data visualization can be revelatory. Ian Milligan, assistant professor in the Department of History at the University of Waterloo, said, “Visualizations can, at a glance, tell you more than if you get mired down in the weeds of reading document after document after document.”
The NEH Chronicling America Data Challenge is an example of extracting data visualizations from a large, publicly available data set. Recently, the National Endowment for the Humanities invited people to “Create a web-based tool, data visualization or other creative use of the information found in the (Library of Congress’s) Chronicling America historic newspaper database.” The results are diverse and imaginative. According to the NEH website,
- “America’s Public Bible tracks Biblical quotations in American newspapers to see how the Bible was used for cultural, religious, social, or political purposes”
- “American Lynching…explores America’s long and dark history with lynching, in which newspapers acted as both a catalyst for public killings and a platform for advocating for reform”
- “Historical Agricultural News, a search tool site for exploring information on the farming organizations, technologies, and practices of America’s past”
- “Chronicling Hoosier tracks the origins of the word ‘Hoosier’ “
- “USNewsMap.com discovers patterns, explores regions, investigates how stories and terms spread around the country, and watches information go viral before the era of the internet”
- “Digital APUSH…uses word frequency analysis…to discover patterns in news coverage.”
The explicit purpose of the Library of Congress’s Archives Unleashed 2.0, Web Archive Datathon was exploratory — open-ended discovery. The data came from a variety of sources, such as the Internet Archive’s crawl of the Cuban web domain, Participants divided into teams, each with a general objective of what to do with the data. The room bustled with people clacking laptop keys, poking at screens, bunching around whiteboards and scrawling rapidly on easel pads. At one table, a group queried web site data related to the Supreme Court nominations of Justice Samuel Alito and Justice John Roberts. They showed off a word cloud view of their results and pointed out that the cloud for the archived House of Representatives websites prominently displayed the words “Washington Post” and the word cloud for the Senate prominently displayed “The New York Times.” The group was clearly excited by the discovery. This was solid data, not conjecture. But what did it mean? And who cares?
Well, it was an intriguing fact, one that invites further research. And surely someone might be curious enough to research it someday to figure out the “why” of it. And the results of that research, in turn, might open a new avenue of information that the researcher didn’t know was available or even relevant.
Events such as hackathons and the upcoming Collections as Data conference bring together librarians, archivists, digital humanities professionals, data scientists, artists, scholars — people from disparate backgrounds, evidence that computation of large data sets in research is blurring the lines between disciplines. There are a lot of best practices to be shared.
A variety of digital research centers, scholars’ labs, digital humanities labs, learning labs and visualization labs are opening in libraries, universities and other institutions. But, despite their variety, these data labs are congealing into identifiable, standardized components that include
- A work space
- Hardware resources
- Network access
- Databases and data sets
- Teaming researchers with technologists
- Powerful processing capability
- Software resources and tools
- Repositories for end-result data sets.
A work space
A quiet room or rooms should be available for brainstorming. Whiteboards and easel pads enable people to quickly jot down ideas and diagram half-formed thoughts. A brain dump, no matter how unfocused, contains bits of value that may clump into solid ideas and strategies. The room also needs enough tables, chairs and power outlets.
The lab should provide computer workstations, monitors, laptops, conference phones and possibly a net-cam for video teleconferencing.
Because of the constant flow of network requests and transactions, some moving potentially large files around, Wi-Fi must be consistent and reliable, and cable networks should be optimized for the highest bandwidth possible.
Databases and data sets
The data may need to be cleaned. Web harvesting, for example, grabs almost everything related to the seed URL – even with some filtering — and the archive often includes web pages that the researcher does not care about. Databases and data sets, if they are to be accessed over the network, should be small enough so they can be moved about easily. A researcher can also download large databases in advance of the scheduled work time.
Teaming researchers with technologists
In a complimentary collaboration between a researcher or subject matter expert and an information technologist, the researcher conveys what she would like to query the data for and the technologist makes it happen. The researcher may analyze the results and make suggestions to the technologist for refining the results. Some workshops such as Ian Milligan’s web archiving analysis workshop, require their researchers to take a Data Carpentry workshop, which is an overview of computation, programming and analysis methods that a data researcher might need. The researcher could either conduct data analyses for herself or become more conversant in data analysis methods in order to better understand her options and communicate with the technologist.
Powerful processing capability
Processing large data sets foists a load on computational power, so a lab needs ample processing muscle. At the Archives Unleashed event, it took one group ten hours to process their query. Milligan is a big proponent of cloud processing and storage, using powerful network systems supported and maintained by others. He said, “We started out using machines on the ground and we found the issue was to have real sustainable storage that’s backed up and not risky to use, that’s going to have to live in network storage anyway. We found that we’re moving data all over the place and we do some of our stuff on our server itself and when we have to spin up other machines, it’s so much quicker to actually move stuff — especially when you’re working with terabytes of web archives — until you get to that last mile of the actual Ethernet cable coming to your workstation. That’s turning out to be the mass bottleneck. Our federation of provincial computing organizations has a sort of Amazon AWS-esque dashboard where we can spin up virtual machines. We have a big server at York University and we sometimes use Jimmy Lin’s old cluster down at Maryland. So the physical equipment turns out not to be that important when we have so many network resources to draw on.”
Software resources and tools
As data labs spring up, newer and better tools are appearing too. Data labs may offer a gamut of tools for
- Content and file management
- Data visualization
- Document processing
- Geospatial analysis
- Image editing
- Moving image editing
- Network Analysis
- Text mining
- Version control
The Digital Research Tools site is a good comprehensive resource to begin with for an overview of resources.
Repositories for end-result data sets
The data set at the end of a project may be of value to other researchers and the researcher might want her project to be discoverable. The data set should include metadata to describe the project and how to repeat the work in order to arrive at the same data set and conclusions. The repository where the data set resides should have long-term preservation reliability.
Data science is drifting out of the server rooms and into the general public. The sharp differences among professions and areas of interest are getting fuzzier all the time as researchers increasingly use information technology tools. Archaeologists practice 3D modelling. Humanities scholars practice data visualization. Students of all kinds query databases.
For the near future, interacting with data is a specialized skill that requires a basic understanding of data science and knowledge of its tools, whether through training or teaming up with knowledgeable technologists. But eventually the relevant instruction should be made widely available, whether in person or by video, and tools need to be simplified, especially as API-enabled databases proliferate and more sources of data become available
In time, computationally enhanced research will not be a big deal and cultural institutions’ data resources and growing digital collections will be ready for researchers to access, use and enjoy.
In this episode of the podcast, user experience and interface designer Mark Dodgson joins me to talk about the kind work Bluespark — a design agency that has sort of found itself popular in the library and higher-ed web niche — does. I kind of let him just talk about their process. It’s pretty fascinating and super instructive.Notes
- Indiana University Libraries and their descending hero search
- How Changing our Estimation Process Took our Project Endgame from WTF? to FTW!
Help us out and say something nice. Your sharing and positive reviews are the best marketing we could ask for.
This blog post is part of our summer series featuring chapter updates from across the Open Knowledge Network and was written by the team of Open Knowledge Ireland.
What is OK Ireland and what do we do?
Open Knowledge Ireland is a team of 9 volunteers who envision an information age where everyone, not just a few, has access to and the ability to use the massive amounts of information and data generated by entities such as our government or public service.
We believe everyone should have access to this information and data to be able to make better decisions, receive better services and ensure money is spent in the right places. Our goal is to make taxpayer-supported information openly available, so that it can be used and re-used without the public having to pay for it, again.
In so doing we want to ensure that vital research can happen. We want people to be able to leverage information to hold powerful institutions to account, whether in health care, the charity sector, or through Freedom of Information requests in the public service.
In June we organised and ran an event dedicated to Knowledge Preservation in the 21st century: https://ti.to/open-knowledge-ireland/knowledge-preservation/ . The event was attended by 20 enthusiasts. Kalpana Shankar, Stan Nazarenko and Rufus Pollock shared their visions of how knowledge and information can and should be preserved today and what the current challenges are. (Photos https://www.flickr.com/photos/139932355@N08/sets/72157669330777481)
In August we were delighted to help our friends and colleagues from Open Street Maps to map the Kingdom of Lesotho.
To see a list of our past events click here.
A notable highlight from the last few months has been our work on hospital waiting list data. For a more extensive look at the activities we have initiated, see here.
In May we presented the findings from our Hospital Waiting List Project at the all-Ireland conference ‘Knowledge For Health’ organised by the Institute of Public Health (IPH), which operates on both sides of the island of Ireland. The reason we took on this project is that people with illnesses requiring them to visit a hospital (bad enough in itself!) are currently waiting up to 18 months and more to be seen by specialist doctors and consultants. No one in Europe should have to wait so long for a consultation on what may prove to be a severe or life-threatening illness.
As a way of reducing waiting times to see specialist doctors in hospitals we would like waiting times to be publicly (= ‘openly’) available so that the public, journalists, and social media can hold service providers accountable where waiting times are unusually high. This would also allow experts to use the data and complete sophisticated problem analysis that could work to improve waiting lists.
While advocating for open data, we realise that for the data to be useful and to help answer real questions, users need to be sure that the data is authentic and that it will be accessible tomorrow or ten years from now. We believe that the InterPlanetary File System (IPFS) has great potential to facilitate the preservation of the authenticity and accessibility of public data.
IPFS is a peer-to-peer distributed file system that seeks to connect all computing devices with the same system of files where each file is given a unique fingerprint called a cryptographic hash.
IPFS provides historical versioning (like git) and makes it simple to set up resilient networks for mirroring of data.
At the conference, we demonstrated that the hospital waiting list data could and should be permanently and publicly available via the IPFS. See here for the examples of hospital waiting list data we presented. (https://ipfs.io/ipfs/QmT66oHDwzb8dU5vnZt3Ez5aStcWCjbqjNE2pA25ShTjmM/)
What are we working on next:?
Plans for the future:
- Hospital Waiting List: OK Ireland continues to work with the Irish government on making Hospital Waiting List data open, to linking it with Wikimedia Data, and applying it on the Open Street Maps. As all of us can and might become ill, we believe that making health data accessible and comprehensive to everyone is the best way to demonstrate the potential value of open data.
We aim to get existing data on waiting times released into Data.gov.ie – to do so it is likely that a tender will have to be announced to get this work under way. We are therefore looking to draft what this work might look like and what a project plan & costs would look like.
- Developing a sustainable fundraising strategy: We are struggling, as are many non-profits, to secure funds. Are there proven methods & tools that the Open Knowledge International Network could share to support us in developing a strategic plan for fundraising? For example, how could we leverage prominent personalities on the global level locally? Where should a strategic fundraising plan focus? And how do we go about sustaining a constant output of fundraising applications?
And our upcoming events:
- During Open Access Week (October 24–30, 2016) Open Knowledge Ireland and the Institute of Public Health (IPH) are co-organising an event which is dedicated to Open Data, Open Access, and Social Justice. The event will take place on Tuesday, 25 October at Pearse Street Library. More information to follow.
If you want to contact us:
If you found the above interesting and/or want to learn more about anything we talked about here, please feel free to email, tweet, or facebook us.
DuraSpace News: REMINDER Call for Expressions of Interest in Hosting Open Repositories Conference: 2018 and 2019
From William Nixon and Elin Stangeland on behalf of the Open Repositories Steering Committee
Glasgow, Scotland The Open Repositories Steering Committee seeks Expressions of Interest (EoI) from candidate host organizations for the 2018 and 2019 Open Repositories Annual Conference series. The call is issued for two years this time to enable better planning ahead of the conferences and to secure a good geographical distribution over time. Proposals from all geographic areas will be given consideration.
FOR IMMEDIATE RELEASE
Duluth, Georgia–September 13, 2016
Equinox is happy to announce that yet another library has successfully migrated to Virginia Evergreen. Halifax County South Boston-Public Library System is the ninth library system to join Virginia Evergreen which boasts close to thirty branches in total. Halifax County-South Boston includes two branches: Halifax Public Library and South Boston Public Library.
Jay Stephens, Director of Halifax Public Library, remarked; “It has been a pleasure working with Equinox. Everyone is very knowledgeable and willing to share that knowledge to help out.” He later added, “The training was awesome! Mary rocks!”
In response, Equinox Training Service Librarian Mary Jinglewski had this to say about the migration; “I greatly enjoyed training with Halifax County South Boston Library System. They have a great community of caring staff members and I am excited that they’ll be a part of Virginia Evergreen moving forward!”
Equinox handled the migration from start to finish and will continue to support Halifax County-South Boston along with the rest of Virginia Evergreen.
About Equinox Software, Inc.
Equinox was founded by the original developers and designers of the Evergreen ILS. We are wholly devoted to the support and development of open source software in libraries, focusing on Evergreen, Koha, and the FulfILLment ILL system. We wrote over 80% of the Evergreen code base and continue to contribute more new features, bug fixes, and documentation than any other organization. Our team is fanatical about providing exceptional technical support. Over 98% of our support ticket responses are graded as “Excellent” by our customers. At Equinox, we are proud to be librarians. In fact, half of us have our ML(I)S. We understand you because we *are* you. We are Equinox, and we’d like to be awesome for you. For more information on Equinox, please visit http://www.esilibrary.com.
Evergreen is an award-winning ILS developed with the intent of providing an open source product able to meet the diverse needs of consortia and high transaction public libraries. However, it has proven to be equally successful in smaller installations including special and academic libraries. Today, over 1,500 libraries across the US and Canada are using Evergreen including NC Cardinal, SC Lends, and B.C. Sitka.
For more information about Evergreen, including a list of all known Evergreen installations, see http://evergreen-ils.org.
It all started in Boston…
In 2010, for the inaugural Lucene Revolution in Boston MA, I tried to weasel out of giving a prepared talk by proposing a Live Q&A style session where I’d be put on the spot with tough, challenging, unusual questions about Solr & Lucene — live, on stage. I don’t remember what my original session title was, but the conference organizer realized it sounded a lot like the “Stump The Chump” segment of the popular “Car Talk” radio show, hosted by Boston’s own Click & Clack, and insisted that be the title we use.
If you’ve never seen our version of “Stump the Chump” it’s a little different then Click & Clack’s original radio call in format. In addition to being live in front of hundreds of rowdy convention goers, we also have a panel of judges who have had a chance to see and think about many of the questions in advance — Because folks like you are free to submit questions via email prior to conference (even if you can’t attend in person). The judges take every opportunity to mock The Chump (ie: Me) anytime I flounder, and ultimately the panel will award prizes to people whose questions do the best job of “Stumping The Chump”.
As my boss Cassandra (a Boston native, and this year’s Stump the Chump moderator) would say: “It’s a Wicked Pissa!”
You can see for yourself by checking out the videos from the past events like Lucene/Solr Revolution 2015 in Austin TX, or Lucene/Solr Revolution Dublin 2013. If you want a real blast from the past, check out the video from the last time “Stump The Chump” was in Boston: Lucene Revolution 2012. (Regrettably, there is no video from that first Stump The Chump in 2010)
(And don’t forget to register for the conference ASAP if you plan on attending! The registration price will be increasing on September 16th.)
Presenter: Jaclyn McKewan
Tuesday November 8, 2016
11:00 am – 12:30 pm Central Time
This course has been re-scheduled from a previous date.
Become a lean, mean productivity machine!
In this 90 minute webinar we’ll discuss free online tools that can improve your organization and productivity, both at work and home. We’ll look at to-do lists, calendars, and other programs. We’ll also explore ways these tools can be connected, as well as the use of widgets on your desktop and mobile device to keep information at your fingertips. Perfect for any library workers who spend a significant portion of their day at a computer.
Webinar takeaways will include:
- Keep track of regular repeating tasks by letting your to-do list remember for you
- Connect your calendars and to-do lists
- Use mobile and desktop widgets to keep information at your fingertips
Jaclyn McKewan is the Digital Services Coordinator at WNYLRC, where she has worked since 2008. Her job duties include managing the Ask Us 24/7 virtual reference program, New York Heritage Digital Collections, and internal networking/IT.
Social Media For My Institution; from “mine” to “ours”
Instructor: Plamen Miltenoff
Starting Wednesday October 19, 2016, running for 4 weeks
Register Online, page arranged by session date (login required)
Beyond Usage Statistics: How to use Google Analytics to Improve your Repository
Presenter: Hui Zhang
Tuesday, October 11, 2016
11:00 am – 12:30 pm Central Time
Register Online, page arranged by session date (login required)
Questions or Comments?
For questions or comments, contact LITA at (312) 280-4268 or Mark Beatty, firstname.lastname@example.org
I sometimes hear about archives which scan for and remove malware from the content they ingest. It is true that archives contain malware, but this isn't a good idea:
- Most content in archives is never accessed by a reader who might be a target for malware, so most of the malware scan effort is wasted. It is true that increasingly these days data mining accesses much of an archive's content, but it does so in ways that are unlikely to activate malware.
- At ingest time, the archive doesn't know what it is about the content future scholars will be interested in. In particular, they don't know that the scholars aren't studying the history of malware. By modifying the content during ingest they may be destroying its usefulness to future scholars.
- Scanning and removing malware during ingest doesn't guarantee that the archive contains no malware, just that it doesn't contain any malware known at the time of ingest. If an archive wants to protect readers from malware, it should scan and remove it as the preserved content is being disseminated, creating a safe surrogate for the reader. This will guarantee that the reader sees no malware known at access time, likely to be a much more comprehensive set.
See, for example, the Internet Archive's Malware Museum, which contains access surrogates of malware which has been defanged.
The rules are simple:
- Find your favorite piece of copyright-free material from DPLA, Europeana, Trove, or DigitalNZ
- Create a sweet gif
- Submit it for a chance to win some nifty prizes
To find out more about the 2016 competition, including available prizes and submission rules, visit https://dp.la/info/gif-it-up/. In the lead up to the October 1st kick-off, here are some fun and easy ways that you can start your source material exploration and build your gif-making skills.Join our free gif-making workshops
Interested in participating in this year’s competition but aren’t sure how to make a gif? Looking to sharpen your existing gif-making skills with some more advanced techniques? Look no further! We’ve enlisted the help of some gif experts to teach you how to get started with gifs using open materials and beyond.Workshop #1: GIF-Making 101, Wed, September 21, 3pm – 4pm Eastern
Ever wondered how to make an animated gif? Join gif-making experts Shaelyn Amaio (Consultant at Lord Cultural Resources) and Derek Tulowitzky (Web, Social Media, and Outreach Manager at the Muncie Public Library) for an hour long webinar workshop on how to make gifs using open materials found in DPLA and other digital libraries. The workshop will cover what gifs are, how to find suitable materials in DPLA and elsewhere, and how to make a simple gif. This workshop is the first part of a two-part series leading up to the GIF IT UP 2016 competition (October 1-31, 2016). Part two will cover advanced gif-making techniques. Attendees are encouraged but not required to attend both sessions.
Join us for a hands-on, hour long workshop on how to use photo editing software to perform advanced gif-making techniques, such as how to use frame animation in order to make objects disappear and then reappear, move around, and change color. This workshop will be led by two seasoned gif-making vets, Richard Naples (Outreach and Education Technical Information Specialist at the Smithsonian Institution) and Darren Cole (Digital Engagement Specialist at the National Archives and Records Administration’s Office of Innovation). This workshop is part two of a two-part series leading up to the GIF IT UP 2016 competition (October 1-31, 2016). Part one will provide a basic introduction to gifs and the materials used to make them. Attendees are encouraged but not required to attend both sessions.
GIF IT UP is all about exploring DPLA and the other participating digital libraries for the perfect piece of open content. If you’re not sure what type of material you should be looking for when creating a gif, here are some helpful suggestions to get you started.
That’s just a taste of the types of materials that can be found in DPLA and the other participating digital libraries. To explore the many open collections available for the competition, check out our list of select public domain and open collections for re-use.Need Inspiration? Check out past competition galleries
This is the third year of GIF IT UP, so we have an awesome array of gifs from our previous couple of competitions that may help get your creative juices flowing.2015
Following the inaugural GIF IT UP in 2014, the competition returned in 2015, seeking innovative and endlessly looping uses of archival videos and images. In 2015 the challenge expanded internationally with support from Europeana and Trove and featured an esteemed line-up of judges and cool prizes. Check out last year’s winning gifs.2014
Over the course of Fall 2014, DPLA and DigitalNZ held GIF IT UP, an international competition to find the best GIFs reusing public domain and openly licensed digital video, images, text, and other material available via our search portals. Check out the 2014 winning gifs.
The proposals submission deadline for LITA programs at the 2017 ALA Annual conference has been extended two weeks until September 23, 2016.
The LITA Program Planning Committee (PPC) is now accepting innovative and creative proposals for the 2017 Annual American Library Association Conference. We’re looking for 60- and 90-minute conference presentations. In addition to program session proposals, we are also eager to see your proposals for half-day or full-day preconferences to help participants develop skills through interactive learning. The focus should be on technology in libraries, whether that’s use of, new ideas for, trends in, or interesting/innovative projects being explored – it’s all for you to propose.
When and Where is the Conference?
The 2017 Annual ALA Conference will be held in Chicago, IL, from June 22nd through 27th.
What kind of topics are we looking for?
We’re looking for programs of interest to all library/information agency types, that inspire technological change and adoption, or/and generally go above and beyond the everyday.
We regularly receive many more proposals than we can program into the 20 slots available to LITA at the ALA Annual Conference. These great ideas and programs all come from contributions like yours. We look forward to hearing the great ideas you will share with us this year.
This link from the 2016 ALA Annual conference scheduler shows the great LITA programs from this past year.
When are proposals due?
September 23, 2016
How I do submit a proposal?
Fill out this form bit.ly/litacfpannual2017
Program descriptions should be 150 words or less.
When will I have an answer?
The committee will begin reviewing proposals after the submission deadline; notifications will be sent out on October 3, 2016
Do I have to be a member of ALA/LITA? or a LITA Interest Group (IG) or a committee?
No! We welcome proposals from anyone who feels they have something to offer regarding library technology. Unfortunately, we are not able to provide financial support for speakers. Because of the limited number of programs, LITA IGs and Committees will receive preference where two equally well written programs are submitted. Presenters may be asked to combine programs or work with an IG/Committee where similar topics have been proposed.
Got another question?
Please feel free to email Nicole Sump-Crethar (PPC chair) (email@example.com)
The Digital Public Library of America is pleased to announce that Michael Della Bitta is joining its staff as Developer for Data and Usage Analytics, beginning September 12, 2016.
In this role, Della Bitta will work with DPLA’s Technology Team to process, evaluate, and share information about how DPLA’s diverse collections are discovered, used and shared through our user-facing platforms, social media and APIs. Della Bitta will also play a key role in improving data ingestion systems and supporting DPLA’s ongoing work to assess and provide meaningful feedback to our partners about data quality to enhance discoverability and use of collections internally, across our partner network, and by our broad community of users.
“Michael’s experience in digital libraries, high volume data processing and analysis, and interest in serving the cultural heritage sector make him a valuable member of the DPLA Team,” said Director of Technology Mark Matienzo. “I believe his background will serve him well in making our operations run more smoothly and efficiently, and will benefit the DPLA Network as a whole.”
Prior to joining DPLA, Michael has worked in software development and publications and in the startup, library, and education space for nearly twenty years. Michael most recently worked as a data and analytics developer, architect and engineering manager at the content marketing company ScribbleLive. Prior to that, Michael worked as a developer and architect on the repository and Digital Gallery teams at The New York Public Library, and built content management, online learning, and semantic metadata applications at Columbia University. Michael holds a B.A. in Philosophy from Bates College.
The 10th Islandora CLAW Community Sprint finished up last week. Running August 22nd to September 5th, this sprint was mostly about learning and design, with "homework" tickets to read up on specifications, and long discussions about how various pieces of CLAW should work. You can do a little homework of your own and follow the discussions about ORE and IIIF.
The MVP for this sprint was Everyone. We had some really great discussions, both in GItHub issues and via IRC (#islandora on freenode).
Danny Lamb (Islandora Fundation)
Nick Ruest (York University)
Jared Whiklo (University of Manitoba)
Diego Pino (Metro.org)
Melissa Anez (Islandora Foundation)
Ed Fugikawa (University of Wyoming)
Nat Kanthan (University of Toronto Scarborough)
Kirsta Stapelfeldt (University of Toronto Scarborough)
Kim Pham (University of Toronto Scarborough)
Bryan Brown (Florida State University)
Next up in CLAW Sprint 11, running September 19th - October 3rd. A few issues are listed here, with more to come. Non-developers may be interested in signing on for Homework Ticket #360, where we will be exploring the Drupal 8 UI. You can sign up for the sprint here.
This blog post is part of our summer series featuring updates from chapters across the Open Knowledge Network and was written by the team of Open Knowledge Austria.
The last two months have been very vibrant within Open Knowledge Austria. We co-organized the monthly Vienna Open Data Meetup, but no other public appearances, because we had set up some major projects. We also had our bi-annual plenary meeting and the election of the new board.
First the projects:
- Our project OpenDataPortal, in cooperation with Wikimedia AT, got funded by the Austrian Ministry for Mobility, Innovation, and Technology (BMVIT). In the so-called “Data Pioneers” project, we work with companies on open innovation strategies around using and sharing open data. The central goal is to work out use-cases and narratives for companies in order to get them to open up some of their own data. We will organize two workshops and one hackathon in the next months and we will guide the companies during the opening process.
- We will organize our second data literacy event for kids, the first time under the branding of the german project “Jugend hackt”. The 3-day event will take place in Linz at the beginning of November and will show children between 12 and 18 how to code with open data.
Second, our governance.
Out brand new board of Open Knowledge Austria for the next two years are:
- Stefan Kasberger – @stefankasberger, firstname.lastname@example.org
- Christopher Kittel – @chris_kittel, email@example.com)
- Clara Landler – @clara_l, firstname.lastname@example.org)
They are now planning the activities for the next half year, setting up a working group for a funding-strategy and one for a community-strategy, and are figuring out a guideline for better handling of projects. The current situation in numbers does not look good: we have 1 employed person for 5 hours a week, a few thousand euros to survive the next months and about 30 volunteers members. The good thing: it’s getting better each day, but there are still huge challenges in front of us.
The next months will be the most active time of Open Knowledge Austria so far. The very active and well organized Open Science working group will organize a hackathon and a meetup around OpenKnowledgeMaps and disseminate their past activities and involvements, like the Vienna Principles and the copyright recommendations from the OANA (Open Access Network Austria) working groups. Additionally, Michela Vignoli, the coordinator of the Open Science group alongside Peter Kraker, was nominated for the EU Commissions Open Science Policy Platform for her involvement in the YEAR network and will also represent the interests of the Open Science community.
We will also start an Austrian City Open Data Census and start a project about Open Data in elections after the disastrous last presidential elections, called “Offene Wahlen Österreich”. Next to the already mentioned Jugend hackt and BMVIT events, we will co-organize as usual the Vienna Open Data Meetups and co-organize a panel about net political processes at European level at the Elevate Festival in Graz. And last but not least: The above-mentioned working groups on funding- and community strategy will start their activity, input welcome.
In terms of collaboration, we can offer expertise in the field of Open Science, Knowledge Discovery, Content Mining, Open Data Repositories, Data Literacy and Data Science Trainings. If there is an interest in the outcome of the funding- and community strategy, just ask, and we will try to translate it at the end. In general, we are always happy about international cooperation and we are looking forward to requests and feedback from other Open Knowledge chapters.
This competition will be an opportunity for the next wave of developers to show their skills to the world — and to companies like ours. — Dick Hardt, ActiveState (quote taken from SC Track page)
All code contains bugs, and all projects have features that users would like but which aren’t yet implemented. Open source projects tend to get more of these as their user communities grow and start requesting improvements to the product. As your open source project grows, it becomes harder and harder to keep track of and prioritise all of these potential chunks of work. What do you do?
The answer, as ever, is to make a to-do list. Different projects have used different solutions, including mailing lists, forums and wikis, but fairly quickly a whole separate class of software evolved: the bug tracker, which includes such well-known examples as Bugzilla, Redmine and the mighty JIRA.
Bug trackers are built entirely around such requests for improvement, and typically track them through workflow stages (planning, in progress, fixed, etc.) with scope for the community to discuss and add various bits of metadata. In this way, it becomes easier both to prioritise problems against each other and to use the hive mind to find solutions.
Unfortunately most bug trackers are big, complicated beasts, more suited to large projects with dozens of developers and hundreds or thousands of users. Clearly a project of this size is more difficult to manage and requires a certain feature set, but the result is that the average bug tracker is non-trivial to set up for a small single-developer project.
The SC Track category asked entrants to propose a better bug tracking system. In particular, the judges were looking for something easy to set up and configure without compromising on functionality.
The winning entry was a bug-tracker called Roundup, proposed by Ka-Ping Yee. Here we have another tool which is still in active use and development today. Given that there is now a huge range of options available in this area, including the mighty github, this is no small achievement.
These days, of course, github has become something of a de facto standard for open source project management. Although ostensibly a version control hosting platform, each github repository also comes with a built-in issue tracker, which is also well-integrated with the “pull request” workflow system that allows contributors to submit bug fixes and features themselves.
Github’s competitors, such as GitLab and Bitbucket, also include similar features. Not everyone wants to work in this way though, so it’s good to see that there is still a healthy ecosystem of open source bug trackers, and that Software Carpentry is still having an impact.
I’m so insanely happy to be at the 2016 edition of the Ars Electronica Festival. I’ve wanted to attend for a long time and this year things came together. The festival is as good as I expected.
The scale is large – seemingly endless talks, workshops, and exhibits spread throughout the city of Linz (Austria). I won’t attempt any type of comprehensive overview but will share my personal highlights.
Portrait on the fly by Christa Sommerer and Laurent Mignonneau is an interactive piece – stand in front of it (a monitor with a camera on top) and you see an outline of yourself. You quickly realize the outline is made of buzzing flies. The monitor/camera is surrounded by fantastic printed portraits generated by the installation.
Body Pressure (this is a placeholder name until I track down the correct name) – lay down on a large deflated bag and feel it lift you toward another inflating bag. The two inflated forms come together gently squeezing you between. This piece is beautiful.
Face Cartography by Daniel Boschung – a robot moves a camera around to 600 different vantage points of a subject’s face. The photos are stitched together into a tremendously high resolution photo. The shoot takes 20 minutes to produce one portrait – is this a snapshot?
BitterCoin by Martin Nadal and Cesar Escudero Andaluz – an incredibly slow but deeply charming bitcoin miner made from an old calculator.
Loopers by Yasuaki Kakehi and Michinari Kono – 12 magnetic worms crawl back and forth to create rhythmic clicks.
Docker containers are designed to be ephemeral. You can destroy one and spin up an exact replica in seconds. Everything that defines the container can be found in the Dockerfile that declares how to build it.
This model does not, however, explain what to do with persistent data. Things like databases or uploaded media. In a production environment, I would recommend delegating these tasks to an external service, like Amazon’s RDS or S3.
For local development, though, you can use volumes for storing persistent data. Volumes come in two main flavors: data volumes and host directory mounts.
The latter is perhaps the most straightforward. You connect a directory in your container to a directory on your host machine, so they are essentially sharing the file system. Indeed, when you’re actively working on code, this is the simplest way to share your local code with your running containers. Mount the root directory of your project as a volume in your container, and anytime you update code, your container will also have the updates.# docker run --rm -it -v="/your/local/dir:/srv/www/public" nginx:stable-alpine /bin/sh
This runs a container that has its /srv/www/public directory shared with your host system’s /your/local/dir directory. Updates you make to files either on your local system or in the container are automatically shared with the other.
Data volumes do not map directly to your host filesystem. Docker stores the data somewhere, and you generally don’t need to know where that is. When using Docker for Mac, one of the key differences is that a host mount shares files using osxfs (which currently has some performance issues) while a data volume stores its data inside the Docker virtual machine (which is subsequently much more performant for I/O). While I use host mounts for things like uploaded media, I prefer to use a data volume for storing databases.# docker volume create --name=mysqldata
Once the volume is created, you can mount it into one or more containers.# docker run --rm -v="mysqldata:/var/lib/mysql" mysql:5.5
The contents of our volume “mysqldata” will be available to MySQL in the /var/lib/mysql directory. The data volume itself doesn’t have a directory name (in contrast to the prior best practice of using a directory within a data-only containers). I think of a volume as a single directory that can be mounted wherever I want in a container.
Three significant changes will be coming as part of MarcEdit’s Sunday update. These impact the MARCEngine, Regular Expression Processing, and the Linked Data Platform.
In May 2016, I had to make some changes to the MARCEngine to remove some of the convenience functions that allowed mnemonic processing to occur on UTF8 data or HTML entitles on MARC8 data. Neither of these should happen, and there were issues that came up related to titles that actually included HTML entitles in the titles. So, this kind of processing had to be removed.
Over the past few months, I’ve been re-evaluating how these functions use to work, and have been bringing them back. Sunday will mark the reintroduction of many of these functions (though, not the HTML entity translation when it’s not specifically appropriate). The upside, is that coupled with the new encoding work, the tool will be able to support more mixed encoding use-cases.
Regular Expression Processing
The most significant changes to the Regular Expression engine processing is how the multi-line processing works in the Replace Function. To protect users, I’d set up the match any character “.” match all characters but a new line character. This was done to keep users from accidently deleting data. However, this meant that when using the multi-line processing option, it really only worked when fields were side by side. By removing this limitation, users working with the multi-line option will now have full access to the entire record with the Regular Expression processing. And with that, a word of warning…be careful. The Multi-line processing is the easiest way to accidently delete bibliographic data through greedy matches.
Additionally, I’ve added an option to the Replace Function dialog that makes it easier to know that MarcEdit has this option. Right now, users have to know that you need to add a /m to the end of your expression to initiate the multi-line mode. You can still do that – but for users that don’t know that this is the case, a new option has been added to turn on the Multiline option (see below).
The MultiLine Evaluation will be enabled and selectable when the Use regular expressions option is checked.
Linked Data Platform Changes
The Linked Data Platform will be seeing two significant changes. The first change is occurring in the code to support linking fields like the 880. This has meant adding a new special processing instruction “linking”, which can now be used to perform reconciliation against data in these fields. This is particularly important for Asian languages, Arabic languages, Hebrew, etc.
The second change is in the rules file itself. I’ve profiled the 880 field, as well as a wide range of other collections.
Finally, I’ve added a note to the Main Window that helps users find their rules file for edit, as well as points to the knowledge-base articles and videos explaining the process. (see below)
Other changes that will be made in this update:
- ISSN Report – tweaked the process due to a bug.
- MARCCompare – added a new output type; in addition to the HTML output, there will be a diff file output
- Preferences Window/Language – Added a help icon that points to the knowledge-base articles related to fonts and font recommendations.
These changes will be part of the Sunday update, and at this point, look like they will be applicable to all versions of MarcEdit (though, I do have a few UI tweaks that I need to complete on the mac side).
If you have questions, let me know. Otherwise, these changes will be made available on 9/11/2016 9/12/2016 [evening].
Chapter 2 of Nicolini (2012) provides a quick tour through a few thousand years of philosophy to highlight the deep roots of practice theory.
When describing the types of activity of the human mind Aristotle added to Plato’s episteme (scientific knowledge) two more categories: phronesis (practical wisdom) and techne (art or skill). Phronesis in particular was a non-inferential, non-deductive and highly improvisational form of knowledge. Nicolini draws on the work of Nussbaum (1986) in his reading of Aristotle. He reminds us that Greek society at the time was highly segmented by slavery, and that it was a luxury of the ruling class to be able to dedicate one’s life to learning (episteme). Those that mastered practice, the artisans, were second class citizens at best. So there was a hierarchy to episteme, phronesis and techne.
In the centuries following Aristotle this aspect to his work was all but lost until Marx and Nietzsche rediscovered it, and turned the hierarchy on its head, with practice becoming the fundamental principle. Marx’s focus on human activity, can be found in his discussion of praxis which eventually becomes production in his later writing. Production is a word that he used to cover all human material practices. Marx’s philosophy hinged on the importance of putting ideas into practice in the world, as Nicolini says:
[Marx] makes clear the aim of science is not that of producing theoretical knowledge but more of obtaining practical mastery of the world in order to satisfy the practical needs of mankind.
I can’t help but be reminded of the American Pragmatism here too (Pierce, James and Dewey) and was a bit surprised that Nicolini doesn’t mention them at all. Shrug. At any rate, it’s clear that Nicolini sees Marx as opening up a new space for thought, a space that Nietzsche and Heidegger would later fill. Quite a bit of the chapter is also devoted to Heidgegger’s idea of Dasein or being in the world, which quite a few later practice theorists draw on. Heidegger positions everyday practices prior to representation – echoing Marx’s inversion of knowledge and practice. Heidegger introduces the idea of breakdown, which makes everyday practices visible. Breakdown is an idea that gets used a great deal in infrastructure studies. To understand it Nicolini borrows Heidegger’s thought experiment of hammering a nail:
The hammer belongs to the environment and can be unthinkingly used by the carpenter. The carpenter does not need to ‘think a hammer’ in order to drive in a nail. His or her capacity to act depends upon the familiarity with the act of hammering. His/her use of the practical item‘hammer’is its significance to him/her in the setting ‘hammering’ and ‘carpentry’ … The hammer as such acquires a separate ‘existence’ only when it breaks or is lost:> that is, when its unreflective use becomes problematic. (Nicolini, p. 34)
I first encountered Heidegger’s idea of breakdown a few years ago when I read Winograd & Flores (1986), which applied the idea to the context of computing and design. Since then it’s popped up in the context of infrastructure studies as well as work centered on repair. It might be useful to return to learn more about the origins in Heidegger’s work, perhaps through Dreyfus (1991) who Nicolini references quite a bit.
The chapter ends with a discussion of Wittgenstein and practice theory. I’ve encountered Wittgenstein’s work back when I was looking for to understand the semantic web agenda back when I was working at the Library of Congress a few years ago. His earlier and later career make for such a fascinating embodiment of the problems of philosophy particular where sense making and mathematics intersect Halpin (2011). So it was fun to find another parallel between practice theory and my own interests.
Nicolini references Shotter (1996) when drawing attention to three ways in which Wittgenstein’s work informs practice theory:
- meaning is found outside in social practices, not in internal contemplation
- the function of following rules (or not following rules) as practices that can only be understood through hints, tips and examples.
- practices provide a criteria of truth, and understanding is demonstrated by being extrapolate rules/practices further–how to go on
Wittgenstein’s ideas of forms of life is also influential Johannessen (1988), because it brings attention to specific ways of acting, and day-to-day performances. It is through the study of these that rules or practices can be observed.References
Dreyfus, H. L. (1991). Being-in-the-world: A commentary on Heidegger’s Being and Time, Division I. MIT Press.
Halpin, H. (2011). Sense and reference on the web. Minds and Machines, 21(2), 153–178. Retrieved from https://www.era.lib.ed.ac.uk/bitstream/handle/1842/3796/Halpin2009.pdf
Johannessen, K. S. (1988). The concept of practice in Wittgenstein’s later philosophy. Inquiry, 31(3), 357–369.
Nicolini, D. (2012). Practice theory, work, and organization: An introduction. Oxford University Press.
Nussbaum, M. (1986). The fragility of goodness: Luck and ethics in Greek tragedy and philosophy. Cambridge University Press.
Shotter, J. (1996). Problems of theoretical psychology. In C. W. Tolman, F. Cherry, R. van Hezewijk, & I. Lubek (Eds.), Problems of theoretical psychology (pp. 3–12). Captus Press.
Winograd, T., & Flores, F. (1986). Understanding computers and cognition: A new foundation for design. Intellect Books.
As I previously mentioned I’m taking a look at Practice Theory as part of my coursework this semester. First up is reading Davide Nicolini’s Practice Theory, Work, and Organization. This post is about chapter one, but I may cover multiple chapters in subsequent posts.
Even though Nicolini is writing a book about Practice Theory he resists the urge to try to establish a single, monolithic, unified practice theory, and has structured his book around six separate strands:
- social praxeology: Bourdieu and Giddens
- practice as tradition
- Activity Theory
- Schatzki’s work building on Heidegger and Wittgenstein
- critical discourse analysis
He cites Schatzki (2001) when mentioning this emphasis on practice theories rather than a unified practice theory. This same work is cited pretty heavily elsewhere in the first chapter, and seeing that he dedicates a whole chapter to him Schatzki is clearly important to Nicolini. If anyone is looking for a Wikipedia page to write, there appears to be no article for Schatzki, despite the fact that he is mentioned prominently in the article for Practice Theory. Maybe that can be something I can create a stub for when it comes to the chapter on Schatzki…
Nicolini also distinguishes between weak and strong practice theory. In weak practice theory the techniques of paying attention to the mundane details of activity are used in an effort to catalog and describe various practices in a particular domain or context. Hard practice theory does this as well but goes an extra step in trying to explain how the practices are generated in various contexts over time. Strong practice theory takes identified practices as the unit of analysis and build ontological analyses upon them.
It strikes me that in my own first tentative steps in using practice theory I’ve definitely been more in the weak camp. I’ve applied an attention to mundane details to identify practices, but haven’t done much analysis of how those practices are sustained over time–the ontological work. I’m hopeful that this book will give me the tools to help me shift towards trying more of that. While there is clearly an ordering to weak/strong I wonder if there may be hidden humanistic benefits to a weak approach–where letting the reader infer connections, rather than explicitly giving them could be useful. I can’t help but be reminded of Hard and Soft sci-fi.References
Schatzki, T. R. (2001). The practice turn in contemporary theory. In T. R. Schatzki, K. K. Cetina, & E. von Savigny (Eds.),. Routledge.