A couple Saturday mornings ago, I was on the couch listening to records and reading a book when Christina Harlow and MJ Suhonos asked me about collecting #WomensMarch tweets. Little did I know at the time #WomensMarch would be the largest volume collection I have ever seen. By the time I stopped collecting a week later, we'd amassed 14,478,518 unique tweet ids from 3,582,495 unique users, and at one point hit around 1 million tweets in a single hour.
This put #WomensMarch well over 1% of the overall Twitter stream, which causes dropped tweets if you're collecting from the Filter API, so I used the strategy of using the both the Filter and Search APIs for collection. (If you're curious about learning more about this, check out Kevin Driscoll, Shawn Walker's "Big Data, Big Questions | Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data", and Jiaul H. Paik and Jimmy Lin's "Do Multiple Listeners to the Public Twitter Sample Stream Receive the Same Tweets?). I've included the search and filter logs in the dataset. If you grep "WARNING" WomensMarch_filter.log or grep "WARNING" WomensMarch_filter.log | wc -l you'll get a sense of the scale of dropped tweets. For a number of hours on January 22, I was seeing around 1.6 million cumulative dropped tweets!
I collected from around 11AM EST on January 21, 2017 to 11AM EST January 28, 2017 with the Filter API, and did two Search API queries. Final count before deduplication looked like this:$ wc -l WomensMarch_filter.json WomensMarch_search_01.json WomensMarch_search_02.json 7906847 WomensMarch_filter.json 1336505 WomensMarch_search_01.json 9602777 WomensMarch_search_02.json 18846129 total
Final stats: 14,478,518 tweets in a 104GB json file!
Below I'll give a quick overview of the dataset using utilities from Documenting the Now's twarc, and utilities described inline. This is the same approach as Ian Milligan and my 2016 Code4Lib Journal article, "An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter." This is probably all that I'll have time to do with the dataset. Please feel free to use it in your own research. It's licensed CC-BY, so please have at it! :-)
...and if you want access to other Twitter dataset to analyse, check out http://www.docnow.io/catalog/.Users Tweets Username 5,375 paparcura 4,703 latinagirlpwr 1,903 ImJacobLadder 1,236 unbreakablepenn 1,212 amForever44 1,178 BassthebeastNYC 1,170 womensmarch 1,017 WhyIMarch 982 TheLifeVote 952 zerocomados
3,582,495 unique users.
146,370 RetweetsJanuary 21, 2017
Yes we can.
Yes we did.
Thank you for being a part of the past eight years. pic.twitter.com/mjmr4RkxpV
Thanks for standing, speaking & marching for our values @womensmarch. Important as ever. I truly believe we're always Stronger Together.— Hillary Clinton (@HillaryClinton) January 21, 2017
January 21, 2017
I'm here today to honor our democracy & its enduring values. I will never stop believing in our country & its future. #Inauguration— Hillary Clinton (@HillaryClinton) January 20, 2017
January 21, 2017
Congratulations to the women marching today. We must go forward to ensure full reproductive justice for all women. #WomensMarch— Bernie Sanders (@SenSanders) January 21, 2017
January 21, 2017
Hi everybody! Back to the original handle. Is this thing still on? Michelle and I are off on a quick vacation, then we'll get back to work.— Barack Obama (@BarackObama) January 20, 2017
January 24, 2017
2,403,637 URLs tweeted, with 527,350 of those being unique urls.
I've also setup a little bash script to feed all the unique urls to Internet Archive:#!/bin/bash URLS=/path/to/WomensMarch_urls_uniq.txt index=0 cat $URLS | while read line; do curl -s -S "https://web.archive.org/save/$line" > /dev/null let "index++" echo "$index/527350 submitted to Internet Archive" sleep 1 done
6,153,894 embedded image URLs tweeted, with 390,298 of those being unique urls.
I'll be creating an image montage similar to what I did for #elxn42 and #panamapapers for #WomensMarch. It'll take some time, and I have to gather resources to make it happen since we're looking at about 5 times the amount of images for #WomensMarch.tags: #womensmarchtwarctwitteractivismpolitics
Voting in the 2017 ALA election will run March 13 – April 5, and results will be announced on April 12. Note that eligible members will be sent their voting credentials via email over a three-day period, March 13-15. Check the main ALA website for information about the general ALA election.
The Board thanks the LITA Nominating Committee for all of their work: Rachel Vacek (Chair), Emily E. Clasper, and Melissa S. Stoner. Thank you to the candidates for agreeing to serve.
Code4Lib Journal: Bridging Technologies to Efficiently Arrange and Describe Digital Archives: the Bentley Historical Library’s ArchivesSpace-Archivematica-DSpace Workflow Integration Project
Code4Lib Journal: The Devil’s Shoehorn: A case study of EAD to ArchivesSpace migration at a large university
Code4Lib Journal: Python, Google Sheets, and the Thesaurus for Graphic Materials for Efficient Metadata Project Workflows
I participated in the “#1Lib1Ref” campaign again this year, recording my experience and talking through why I think it’s important.
I. We provide the highest level of service to all library users… ALA Code of Ethics
That’s what public libraries do, right? Provide service to everyone, respectfully and professionally — and without conditioning that respect on checking your papers. If you walk through those doors, you’re welcome here.
When you’re standing in the international arrivals area at Logan, you’re in a waiting area between a pair of large double doors, exiting from Customs, and then the doors to the outside world. We stood in a crowd of hundreds, chanting “Let Them In!” Sometimes, some mysterious number of minutes after a flight arrival, the doors would open, and tired people and their luggage pour through, from Zurich, Port-au-Prince, Heathrow, anywhere.
And the Code of Ethics ran through my head because that’s what we were chanting, wasn’t it? That anyone who walks through those doors is welcome here. Let them in.
Library values are American values. And if you have a stake in America, don’t let anyone build an America that’s less than what we as a profession stand for.
Apologies, but after our announcement, just before Christmas, of dates for Hydra Connect 2017 it became apparent that they clashed with a PASIG conference which, at that point, had not been widely advertised. This would have represented a conflict of interest for a significant number of our Hydra community
Accordingly, the dates for Hydra Connect 2017 have been changed. It will still be hosted by Northwestern University but the dates are now Monday November 6th – Thursday November 9th, 2017. This year we have made the decision to use a conference hotel and the event will take place at the Hilton Orrington near the University. Please update your calendars!
Further information via emails and the Hydra wiki in due course!
Open Knowledge Foundation: Brazil’s Public Spending project is looking for leaders in various regions of Brazil to increase participation in the budgeting process.
The website is part of a wider campaign to search, recruit and support new leaders that wish to work with transparency, mainly public spending, in Brazilian municipalities and is using OKI’s OpenSpending technical architecture. The support will be provided to mentors specializing in law, transparency, technology and open data. The goal here is to increase the transparency in budget execution, bidding process and contractual management of cities.
In order that leaders can achieve concrete results, the OK Brazil team will develop a chronogram with each and everyone of them, using the existing legal framework, the support of mentors and digital tools to increase transparency and the participation in the budgeting process.“The new website demonstrates how to organize the missions and actions of the new leaders, empower the civilian society so that they may be able to monitor public spending and give access to both academics and journalists to budgeting data of cities”, says Lucas Ansei, developer and one of the mentors of the new website.
According to Thiago Rondon, coordinator of the OK Brazil team, the mentors will have a fundamental role to the formation of the leaders. “They’re specialists with experience on the matter at hand and will support the leaders with online conferences that will offer directions so that the impact of the actions of these new leaders is meaningful.”
Another goal of this new phase of the project is to reach out to city mayors all over the country with the intention to get them to both sign the Public Spending Brazil Commitment Letter and realize the concrete actions foreseen in the letter.Be a leader of the Open Spending project in 2017
According to Thiago, there will be an initial agenda of action that functions like a step-by-step manual so that anyone can help to increase the transparency in the city where they reside. “We want to empower the people so that they may do that on their own. To potentialize the divulgation, we will have local leaders in pilot cities that will have a direct support from the OK Brazil.”
Those who want to participate as a local leader of the Public Spending project can do so on the website. During this first phase, the OK Brazil team will select 15 local leaders through answers offered via inscription form.
Users have high expectations these days. The hours spent in elegant web apps like Netflix and Spotify seem to be sharpening the collective sense of design. What was once the pinnacle is now the convention, and as Don Norman said, “Conventions are slow to be adopted and, once adopted, slow to go away.” So we thought it would be fun to emulate some of our favorite sites in a lightweight concept discovery layer we call Libre. Below are some of the expectations we prioritized in the design. 1#1: Things worth doing also look cool
First, we wanted to elevate books to the same “cool status” of other media. Thanks to Netflix and Spotify, that meant choosing a dark theme with white lettering and neon trim. Because of the ready association with the national library symbol, we chose blue for the secondary color.#2: The most useful things are also the most visible
The intent in a known-item search (33-60% of all queries 2) is rapid visual confirmation, so we highlighted title, author, and cover image. In more serendipitous browsing, the intent is evaluation, so average rating and a synopsis are prioritized second. 3#3: All the answers are here
Several friends of mine have revealed, at one point or another, that they didn’t know the library was free. While this can seem shocking, it’s bad design to assume that the user knows everything they need: immigrants may never have had access to a public library before, and the less tech-savvy might need to know that borrowing ebooks is legal. Hence, we avoided jargon like “Place Hold,” list requirements, and explained the basic premise of a library in fine print beneath the main call-to-action.#4: Browsing is always assisted
Other sites deliver personalized recommendations by capturing reams of personal data. Content-based recommendations like “Nebula Award Winners” or “NYT Bestsellers – Fiction” assist users in a similar way, though. Offering a compelling alternative is more important at the library than anywhere else online, since the title a user came looking for could be out on loan already. We wanted to keep our users from leaving in frustration if they encountered an unavailable title.#5: I can bring friends
A site without sharing is a city without roads. Even if the features aren’t used too often, we decided that it was important to offer up multiple options for users to save, share, and otherwise show off their discoveries. We distinguish subtly between casual users, who might know to post or tweet, and the power user, who may want to embed a free link on his book review blog, for instance.
1: Our work in this article focuses on a popular reading use case, and will therefore seem more applicable to public libraries. Still, we hope our friends in academics get something out of it too.
2: EBSCO and Ex Libris are at odds over this figure. EBSCO says “Just under 30” and Ex Libris “over 50.” Both of them exclude author searches from their definition of “known-item” entirely, which seems to me a mistake. Often an author search is an easier route to a known item: for instance, when the title is so long as to be annoying to type or so short as to be ambiguous. Therefore, I inflate their estimates by about 5%.
3: Notably absent are Format and Availability. These are currently displayed after the user clicks “See at the Library.” A more robust implementation might have them both appear on the page.
It can be difficult to have a conversation in Twitter but people somehow seem to manage. You can reply to someone’s tweet, and other people can reply to your replies, which forms a conversation thread of sorts. But the display of the thread is difficult to interpret.
What’s worse is that there is no Twitter API call to get the replies to a given tweet. If you have the JSON for a tweet in hand you can use the in_reply_to_status_id property to fetch the tweet that it is responding to. But the converse is not true: there is no straightforward way to get the tweets that are in response can given tweet. If I’m wrong about that please let me know. For a much more thorough discussion and analysis of these constraints see Alexander Nwala’s Tweet Visibility Dynamics in a Tweet Conversation Graph.
It’s a bit of a hack but you can use Twitter’s Search API to programmatically scan through tweets directed at a given user (e.g. to:barackobama), and inspect them to see if any are in response to a given tweet. You can also stop scanning when you arrive at tweets that are older than the tweet you are looking for responses to, since to my knowledge it’s impossible to reply to a tweet from the future. Yeah, that was my dry attempt at a joke. The big caveat here is that Twitter’s Search API only allows you to retrieve tweets from the last week. So this technique will only work for fetching conversation threads from the last week.
In the Documenting the Now project we are building tools to help researchers study Twitter. We’ve added a command to twarc that performs this heuristic to rebuild a given reply thread for a given tweet identifier. So to get the replies to this tweet:
let’s make this shit huge https://t.co/iP8IOY3CqB— laura olin ((???)) January 25, 2017
you can run this command:% twarc replies 82407791092769177 > replies.json
This will only get the initial set of replies to the tweet. If you want to get the entire conversation thread you can use the --recursive option:% twarc replies 82407791092769177 --recursive > replies.json
That will get the replies to the replies, and will also walk up the conversation chain if the supplied tweet identifier is itself a reply to another tweet. In addition it will follow tweets that are quotes.
To demonstrate that it’s working we’ve added a little utility called network.py that will read a set of tweets and write out the network of conversation as a GEXF for loading into Gephi, or DOT for use with Graphviz or as a standalone HTML file that uses D3 to visualize the conversation in your browser. Here’s how you run it:% ./network.py replies.json replies.html
and here’s what the D3 visualization looks like for that tweet above. Try clicking on the nodes in the graph to see the tweets that the node represents. You can see the quote is colored yellow, and the original tweet (the one with no parent) is colored red.
Paul Butler also recently added the ability to drag and drop a file of tweets generated with the twarc replies command in his Treeverse. Treeverse is a Chrome plugin which provides a much more usable display of a conversation thread. Here’s a screenshot of looking at that same set of replies. (https://paulgb.github.io/Treeverse/).
The nice thing about the D3 vidualization is that it’s possible to restyle the presentation using CSS. You can also use it to visualize the network of tweets that were not acquired using the replies command. For example here is a visualization that was generated from a search for the #datarefuge hashtag a few days ago. I recorded it as a video on a large screen because there were so many nodes.
If you get a chance to try any of this or have any thoughts about it I’d love to hear from you.