You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 5 hours 36 min ago

Mark E. Phillips: Comparing Web Archives: EOT2008 and EOT2012 – What disappeared

Fri, 2016-06-17 14:30

This is the fourth post in a series that looks at the End of Term Web Archives captured in 2008 and 2012.  In previous posts I’ve looked at the when, what, and where of these archives.  In doing so I pulled together the domain names from each of the archives to compare them.

My thought was that I could look at which domains had content in the EOT2008 or EOT2012 and compare these domains to get some very high level idea of what content was around in 2008 but was completely gone in 2012.  Likewise I could look at new content domains that appeared since 2008.  For this post I’m limiting my view to just the domains that end in .gov or .mil because they are generally the focus of these web archiving projects.

Comparing EOT2008 and EOT2012

The are 1,647 unique domain names in the EOT2008 archive and 1,944 unique domain names in the EOT2012 archive, which is an increase of 18%. Between the two archives there are 1,236 domain names that are common.  There are 411 domains that exist in the EOT2008 that are not present in EOT2012, and 708 new domains in EOT2012 that didn’t exist in EOT2008.

Domains in EOT2008 and E0T2012

The EOT2008 dataset consists of 160,212,141 URIs and the EOT2012 dataset comes in at 194,066,940 URIs.  When you look at the URLs in the 411 domains that are present in EOT2008 and missing in EOT2012 you get 3,784,308 which is just 2% of the total number of URLs.  When you look at the EOT2012 domains that were only present in 2012 compared to 2008 you see 5,562,840 URLs (3%) that were harvested from domains that only existed in the EOT2012 archive.

The thirty domains with the most URLs captured for them that were present in the EOT2008 collection that weren’t present in EOT2012 are listed in the table below.

Domain Count geodata.gov 812,524 nifl.gov 504,910 stat-usa.gov 398,961 tradestatsexpress.gov 243,729 arnet.gov 174,057 acqnet.gov 171,493 dccourts.gov 161,289 web-services.gov 137,202 metrokc.gov 132,210 sdi.gov 91,887 davie-fl.gov 88,123 belmont.gov 87,332 aftac.gov 84,507 careervoyages.gov 57,192 women-21.gov 56,255 egrpra.gov 54,775 4women.gov 45,684 4woman.gov 42,192 nypa.gov 36,099 nhmfl.gov 27,569 darpa.gov 21,454 usafreedomcorps.gov 18,001 peacecore.gov 17,744 californiadesert.gov 15,172 arpa.gov 15,093 okgeosurvey1.gov 14,595 omhrc.gov 14,594 usafreedomcorp.gov 14,298 uscva.gov 13,627 odci.gov 12,920

The thirty domains with the most URLs from EOT2012 that weren’t present in EOT2012.

Domain Count militaryonesource.mil 859,843 consumerfinance.gov 237,361 nrd.gov 194,215 wh.gov 179,233 pnnl.gov 132,994 eia.gov 112,034 transparency.gov 109,039 nationalguard.mil 108,854 acus.gov 93,810 404.gov 82,409 savingsbondwizard.gov 76,867 treasuryhunt.gov 76,394 fedshirevets.gov 75,529 onrr.gov 75,484 veterans.gov 75,350 broadbandmap.gov 72,889 saferproducts.gov 65,387 challenge.gov 63,808 healthdata.gov 63,105 marinecadastre.gov 62,882 fatherhood.gov 62,132 edpubs.gov 58,356 transportationresearch.gov 58,235 cbca.gov 56,043 usbonds.gov 55,102 usbond.gov 54,847 phe.gov 53,626 ussavingsbond.gov 53,563 scienceeducation.gov 53,468 mda.gov 53,010 Shared domains that changed

There were a number of domains (1,236) that are present in both the EOT2008 and EOT2012 archives.  I thought it would be interesting to compare those domains and see which ones changed the most.  Below are the fifty shared domains that changed the most between EOT2008 and EOT2012.

Domain EOT2008 EOT2012 Change Absolute Change % Change house.gov 13,694,187 35,894,356 22,200,169 22,200,169 162% senate.gov 5,043,974 9,924,917 4,880,943 4,880,943 97% gpo.gov 8,705,511 3,888,645 -4,816,866 4,816,866 -55% nih.gov 5,276,262 1,267,764 -4,008,498 4,008,498 -76% nasa.gov 6,693,542 3,063,382 -3,630,160 3,630,160 -54% navy.mil 94,081 3,611,722 3,517,641 3,517,641 3,739% usgs.gov 4,896,493 1,690,295 -3,206,198 3,206,198 -65% loc.gov 5,059,848 7,587,179 2,527,331 2,527,331 50% hhs.gov 2,361,866 366,024 -1,995,842 1,995,842 -85% osd.mil 180,046 2,111,791 1,931,745 1,931,745 1,073% af.mil 230,920 2,067,812 1,836,892 1,836,892 795% ed.gov 2,334,548 510,413 -1,824,135 1,824,135 -78% lanl.gov 2,081,275 309,007 -1,772,268 1,772,268 -85% usda.gov 2,892,923 1,324,049 -1,568,874 1,568,874 -54% congress.gov 1,554,199 40,338 -1,513,861 1,513,861 -97% noaa.gov 5,317,872 3,985,633 -1,332,239 1,332,239 -25% epa.gov 1,628,517 327,810 -1,300,707 1,300,707 -80% uscourts.gov 1,484,240 184,507 -1,299,733 1,299,733 -88% dol.gov 1,387,724 88,557 -1,299,167 1,299,167 -94% census.gov 1,604,505 328,014 -1,276,491 1,276,491 -80% dot.gov 1,703,935 554,325 -1,149,610 1,149,610 -67% usbg.gov 1,026,360 6,724 -1,019,636 1,019,636 -99% doe.gov 1,164,955 268,694 -896,261 896,261 -77% vaccines.mil 5,665 856,188 850,523 850,523 15,014% fdlp.gov 991,747 156,499 -835,248 835,248 -84% uspto.gov 980,215 155,428 -824,787 824,787 -84% bts.gov 921,756 130,730 -791,026 791,026 -86% cdc.gov 1,014,213 264,500 -749,713 749,713 -74% lbl.gov 743,472 4,080 -739,392 739,392 -99% faa.gov 945,446 206,500 -738,946 738,946 -78% treas.gov 838,243 99,411 -738,832 738,832 -88% fema.gov 903,393 172,055 -731,338 731,338 -81% clinicaltrials.gov 919,490 196,642 -722,848 722,848 -79% army.mil 2,228,691 2,936,308 707,617 707,617 32% nsf.gov 760,976 65,880 -695,096 695,096 -91% prc.gov 740,176 75,682 -664,494 664,494 -90% doc.gov 823,825 173,538 -650,287 650,287 -79% fueleconomy.gov 675,522 79,943 -595,579 595,579 -88% nbii.gov 577,708 391 -577,317 577,317 -100% defense.gov 687 575,776 575,089 575,089 83,710% usajobs.gov 3,487 551,217 547,730 547,730 15,708% sandia.gov 736,032 210,429 -525,603 525,603 -71% nps.gov 706,323 191,102 -515,221 515,221 -73% defenselink.mil 502,023 1,868 -500,155 500,155 -100% fws.gov 625,180 132,402 -492,778 492,778 -79% ssa.gov 609,784 125,781 -484,003 484,003 -79% archives.gov 654,689 175,585 -479,104 479,104 -73% fnal.gov 575,167 1,051,926 476,759 476,759 83% change.gov 486,798 24,820 -461,978 461,978 -95% buyusa.gov 490,179 37,053 -453,126 453,126 -92%

Only 11 of the 50 (22%) resulted in more content harvested in EOT2012 than EOT2012.

Of the eleven domains that had more content harvested for them in EOT2012 there were five navy.mil, osd.mil, vaccines.mil, defense.gov, and usajobs.gov that increased by over 1,000% in the amount of content.  I don’t know if this is necessarily a result in an increase in attention to these sites, more content on the sites, or a different organization of the sites that made them easier to harvest.  I suspect it is some combination of all three of those things.

Summary

It should be expected that there are going to be domains that come into and go out of existence on a regular basis in a large web space like the federal government.  One of the things that I think is rather challenging to identify is a list of domains that were present at one given time within an organization.  For example “what domains did the federal government have in 1998?”.  It seems like a way to come up with that answer is to use web archives. We see based on the analysis in this post that there are 411 domains that were present in 2008 that we weren’t able to capture in 2012.  Take a look at that list of the top thirty,  did you recognize any of those? How many other initiatives, committees, agencies, task forces, and citizen education portals existed at one point that are now gone?

If you have questions or comments about this post,  please let me know via Twitter.

Mark E. Phillips: Comparing Web Archives: EOT2008 and EOT2012 – What disappeared

Fri, 2016-06-17 14:30

This is the fourth post in a series that looks at the End of Term Web Archives captured in 2008 and 2012.  In previous posts I’ve looked at the when, what, and where of these archives.  In doing so I pulled together the domain names from each of the archives to compare them.

My thought was that I could look at which domains had content in the EOT2008 or EOT2012 and compare these domains to get some very high level idea of what content was around in 2008 but was completely gone in 2012.  Likewise I could look at new content domains that appeared since 2008.  For this post I’m limiting my view to just the domains that end in .gov or .mil because they are generally the focus of these web archiving projects.

Comparing EOT2008 and EOT2012

The are 1,647 unique domain names in the EOT2008 archive and 1,944 unique domain names in the EOT2012 archive, which is an increase of 18%. Between the two archives there are 1,236 domain names that are common.  There are 411 domains that exist in the EOT2008 that are not present in EOT2012, and 708 new domains in EOT2012 that didn’t exist in EOT2008.

Domains in EOT2008 and E0T2012

The EOT2008 dataset consists of 160,212,141 URIs and the EOT2012 dataset comes in at 194,066,940 URIs.  When you look at the URLs in the 411 domains that are present in EOT2008 and missing in EOT2012 you get 3,784,308 which is just 2% of the total number of URLs.  When you look at the EOT2012 domains that were only present in 2012 compared to 2008 you see 5,562,840 URLs (3%) that were harvested from domains that only existed in the EOT2012 archive.

The thirty domains with the most URLs captured for them that were present in the EOT2008 collection that weren’t present in EOT2012 are listed in the table below.

Domain Count geodata.gov 812,524 nifl.gov 504,910 stat-usa.gov 398,961 tradestatsexpress.gov 243,729 arnet.gov 174,057 acqnet.gov 171,493 dccourts.gov 161,289 web-services.gov 137,202 metrokc.gov 132,210 sdi.gov 91,887 davie-fl.gov 88,123 belmont.gov 87,332 aftac.gov 84,507 careervoyages.gov 57,192 women-21.gov 56,255 egrpra.gov 54,775 4women.gov 45,684 4woman.gov 42,192 nypa.gov 36,099 nhmfl.gov 27,569 darpa.gov 21,454 usafreedomcorps.gov 18,001 peacecore.gov 17,744 californiadesert.gov 15,172 arpa.gov 15,093 okgeosurvey1.gov 14,595 omhrc.gov 14,594 usafreedomcorp.gov 14,298 uscva.gov 13,627 odci.gov 12,920

The thirty domains with the most URLs from EOT2012 that weren’t present in EOT2012.

Domain Count militaryonesource.mil 859,843 consumerfinance.gov 237,361 nrd.gov 194,215 wh.gov 179,233 pnnl.gov 132,994 eia.gov 112,034 transparency.gov 109,039 nationalguard.mil 108,854 acus.gov 93,810 404.gov 82,409 savingsbondwizard.gov 76,867 treasuryhunt.gov 76,394 fedshirevets.gov 75,529 onrr.gov 75,484 veterans.gov 75,350 broadbandmap.gov 72,889 saferproducts.gov 65,387 challenge.gov 63,808 healthdata.gov 63,105 marinecadastre.gov 62,882 fatherhood.gov 62,132 edpubs.gov 58,356 transportationresearch.gov 58,235 cbca.gov 56,043 usbonds.gov 55,102 usbond.gov 54,847 phe.gov 53,626 ussavingsbond.gov 53,563 scienceeducation.gov 53,468 mda.gov 53,010 Shared domains that changed

There were a number of domains (1,236) that are present in both the EOT2008 and EOT2012 archives.  I thought it would be interesting to compare those domains and see which ones changed the most.  Below are the fifty shared domains that changed the most between EOT2008 and EOT2012.

Domain EOT2008 EOT2012 Change Absolute Change % Change house.gov 13,694,187 35,894,356 22,200,169 22,200,169 162% senate.gov 5,043,974 9,924,917 4,880,943 4,880,943 97% gpo.gov 8,705,511 3,888,645 -4,816,866 4,816,866 -55% nih.gov 5,276,262 1,267,764 -4,008,498 4,008,498 -76% nasa.gov 6,693,542 3,063,382 -3,630,160 3,630,160 -54% navy.mil 94,081 3,611,722 3,517,641 3,517,641 3,739% usgs.gov 4,896,493 1,690,295 -3,206,198 3,206,198 -65% loc.gov 5,059,848 7,587,179 2,527,331 2,527,331 50% hhs.gov 2,361,866 366,024 -1,995,842 1,995,842 -85% osd.mil 180,046 2,111,791 1,931,745 1,931,745 1,073% af.mil 230,920 2,067,812 1,836,892 1,836,892 795% ed.gov 2,334,548 510,413 -1,824,135 1,824,135 -78% lanl.gov 2,081,275 309,007 -1,772,268 1,772,268 -85% usda.gov 2,892,923 1,324,049 -1,568,874 1,568,874 -54% congress.gov 1,554,199 40,338 -1,513,861 1,513,861 -97% noaa.gov 5,317,872 3,985,633 -1,332,239 1,332,239 -25% epa.gov 1,628,517 327,810 -1,300,707 1,300,707 -80% uscourts.gov 1,484,240 184,507 -1,299,733 1,299,733 -88% dol.gov 1,387,724 88,557 -1,299,167 1,299,167 -94% census.gov 1,604,505 328,014 -1,276,491 1,276,491 -80% dot.gov 1,703,935 554,325 -1,149,610 1,149,610 -67% usbg.gov 1,026,360 6,724 -1,019,636 1,019,636 -99% doe.gov 1,164,955 268,694 -896,261 896,261 -77% vaccines.mil 5,665 856,188 850,523 850,523 15,014% fdlp.gov 991,747 156,499 -835,248 835,248 -84% uspto.gov 980,215 155,428 -824,787 824,787 -84% bts.gov 921,756 130,730 -791,026 791,026 -86% cdc.gov 1,014,213 264,500 -749,713 749,713 -74% lbl.gov 743,472 4,080 -739,392 739,392 -99% faa.gov 945,446 206,500 -738,946 738,946 -78% treas.gov 838,243 99,411 -738,832 738,832 -88% fema.gov 903,393 172,055 -731,338 731,338 -81% clinicaltrials.gov 919,490 196,642 -722,848 722,848 -79% army.mil 2,228,691 2,936,308 707,617 707,617 32% nsf.gov 760,976 65,880 -695,096 695,096 -91% prc.gov 740,176 75,682 -664,494 664,494 -90% doc.gov 823,825 173,538 -650,287 650,287 -79% fueleconomy.gov 675,522 79,943 -595,579 595,579 -88% nbii.gov 577,708 391 -577,317 577,317 -100% defense.gov 687 575,776 575,089 575,089 83,710% usajobs.gov 3,487 551,217 547,730 547,730 15,708% sandia.gov 736,032 210,429 -525,603 525,603 -71% nps.gov 706,323 191,102 -515,221 515,221 -73% defenselink.mil 502,023 1,868 -500,155 500,155 -100% fws.gov 625,180 132,402 -492,778 492,778 -79% ssa.gov 609,784 125,781 -484,003 484,003 -79% archives.gov 654,689 175,585 -479,104 479,104 -73% fnal.gov 575,167 1,051,926 476,759 476,759 83% change.gov 486,798 24,820 -461,978 461,978 -95% buyusa.gov 490,179 37,053 -453,126 453,126 -92%

Only 11 of the 50 (22%) resulted in more content harvested in EOT2012 than EOT2012.

Of the eleven domains that had more content harvested for them in EOT2012 there were five navy.mil, osd.mil, vaccines.mil, defense.gov, and usajobs.gov that increased by over 1,000% in the amount of content.  I don’t know if this is necessarily a result in an increase in attention to these sites, more content on the sites, or a different organization of the sites that made them easier to harvest.  I suspect it is some combination of all three of those things.

Summary

It should be expected that there are going to be domains that come into and go out of existence on a regular basis in a large web space like the federal government.  One of the things that I think is rather challenging to identify is a list of domains that were present at one given time within an organization.  For example “what domains did the federal government have in 1998?”.  It seems like a way to come up with that answer is to use web archives. We see based on the analysis in this post that there are 411 domains that were present in 2008 that we weren’t able to capture in 2012.  Take a look at that list of the top thirty,  did you recognize any of those? How many other initiatives, committees, agencies, task forces, and citizen education portals existed at one point that are now gone?

If you have questions or comments about this post,  please let me know via Twitter.

OCLC Dev Network: Consuming Linked Data Using JavaScript

Fri, 2016-06-17 12:00

Learn about how to use JavaScript to consume linked data from a specific graph URL.

OCLC Dev Network: Consuming Linked Data Using JavaScript

Fri, 2016-06-17 12:00

Learn about how to use JavaScript to consume linked data from a specific graph URL.

William Denton: TPL shills for Google

Fri, 2016-06-17 01:15

The Toronto Public Library does many things right, but they also do some important things wrong, and lending Google-provided internet-connected wifi hotspots to poor people is wrong.

Here’s TPL’s news release: Toronto Public Library, Mayor Tory and Google Canada Announce Wi-Fi Hotspot Lending Program. It quotes TPL chief Vickery Bowles:

“We need programs like this one to help close the digital divide. People who can’t afford broadband Internet at home are at a significant disadvantage when it comes to school, looking for jobs or accessing government services and education,” said Vickery Bowles, City Librarian. “Internet access is essential in our digital world. We’re proud to pilot this program and hopeful we can increase its reach in the future.”

Everyone should have Internet access. Absolutely. But not provided free by Google or Facebook or another company that profits by monitoring its users.

Every library story needs a cat.

The Toronto Public Library is trying to do the right thing, but teaming up with Google is another example of TPL doing something that harms, rather than helps, its users’ privacy. The Kitchener Public Library did it better, even if on a smaller scale, by buying internet sticks and lending them on its own.

Here’s the post from Google Canada’s “official blog,” purportedly written by Vickery Bowles: Toronto Public Library launches WiFi Lending Program with grant from Google.org and the City of Toronto:

Another hotspot borrower is a university student from out-of-town who regularly uses the library’s wi-fi to study and complete his course work. Now, he’s able to access the Internet outside of library hours at home. A hotspot was also borrowed by a single mother on disability who is using the device to submit benefit forms, communicate by email with her caseworker and browse health-related information online.

Touching stories. Of course these folks should have internet access. But the Toronto Public Library shouldn’t be helping Google make money (I realize it’s Google’s “charitable arm” providing the money, but get real): it should be an active, engaged library advocating for the public welfare of all Torontonians, for example by working with the Library Freedom Project while advocating for free city-wide municipal wifi. In the meantime, instead of taking money from Google, it should be buying its own devices and net connections, and making online privacy part of its information literacy work.

If I could get one of those hotspots I’d run a Tor exit node on it.

William Denton: TPL shills for Google

Fri, 2016-06-17 01:15

The Toronto Public Library does many things right, but they also do some important things wrong, and lending Google-provided internet-connected wifi hotspots to poor people is wrong.

Here’s TPL’s news release: Toronto Public Library, Mayor Tory and Google Canada Announce Wi-Fi Hotspot Lending Program. It quotes TPL chief Vickery Bowles:

“We need programs like this one to help close the digital divide. People who can’t afford broadband Internet at home are at a significant disadvantage when it comes to school, looking for jobs or accessing government services and education,” said Vickery Bowles, City Librarian. “Internet access is essential in our digital world. We’re proud to pilot this program and hopeful we can increase its reach in the future.”

Everyone should have Internet access. Absolutely. But not provided free by Google or Facebook or another company that profits by monitoring its users.

Every library story needs a cat.

The Toronto Public Library is trying to do the right thing, but teaming up with Google is another example of TPL doing something that harms, rather than helps, its users’ privacy. The Kitchener Public Library did it better, even if on a smaller scale, by buying internet sticks and lending them on its own.

Here’s the post from Google Canada’s “official blog,” purportedly written by Vickery Bowles: Toronto Public Library launches WiFi Lending Program with grant from Google.org and the City of Toronto:

Another hotspot borrower is a university student from out-of-town who regularly uses the library’s wi-fi to study and complete his course work. Now, he’s able to access the Internet outside of library hours at home. A hotspot was also borrowed by a single mother on disability who is using the device to submit benefit forms, communicate by email with her caseworker and browse health-related information online.

Touching stories. Of course these folks should have internet access. But the Toronto Public Library shouldn’t be helping Google make money (I realize it’s Google’s “charitable arm” providing the money, but get real): it should be an active, engaged library advocating for the public welfare of all Torontonians, for example by working with the Library Freedom Project while advocating for free city-wide municipal wifi. In the meantime, instead of taking money from Google, it should be buying its own devices and net connections, and making online privacy part of its information literacy work.

If I could get one of those hotspots I’d run a Tor exit node on it.

HangingTogether: Managing the Collective Collection: research libraries and legal deposit networks in the UK and Ireland

Fri, 2016-06-17 00:52

Brian Lavoie and I have just wrapped up a project examining the collective collection of a number of important research libraries in the UK and Ireland. The key findings from our analysis are published in a new OCLC Research report, Strength in Numbers: The Research Libraries UK (RLUK) Collective CollectionWe were very pleased to have an opportunity to collaborate with colleagues from the RLUK consortium on this project, our first foray into extending our collective collections research to geographies outside of North America. RLUK has a role in the UK that is comparable to that of ARL in the US, and its analogues in other places: it provides ‘coordination capacity’ for research libraries that share a number of similar challenges and operate within (broadly) comparable institutional circumstances.  As in our Right-scaling Stewardship project, undertaken in partnership with research libraries in the Committee on Institutional Cooperation (CIC) a few years ago, the insights and engagement of libraries in situ provided invaluable context for our analysis of WorldCat bibliographic and holdings data.

In the US, universities libraries have a long history of engagement with WorldCat as a core bibliographic utility for cataloging and resource sharing operations. In the UK, participation in WorldCat has – until recently – been more opportunistic and instrumental in nature. This has meant that coverage of UK library holdings in WorldCat is somewhat uneven. In preparation for the collective collection study, a number of RLUK libraries undertook major record-loading projects to bring their WorldCat holdings up to date. As a result, we now have a much better picture of the distribution of library resource across research libraries in the UK. The figure below, not included in our report, gives an idea of the regional concentration of RLUK library collections.

Geographic concentration of RLUK holdings in WorldCat (January 2016)

It is no surprise that the greatest concentration of resource is found in and around London – there are many RLUK institutions in the area, including the British Library. The regional concentrations that are shown in green (Oxford, Cambridge, Dublin, Edinburgh) reflect the depth of legal deposit collections in those locations. Edinburgh is a particularly bright spot, given it includes both the National Library of Scotland and the University of Edinburgh. The very prominent concentration in the North of England represents holdings of the British Library that are managed in Yorkshire. Setting aside the legal deposit libraries, the median collection size of RLUK libraries in WorldCat (as of January 2016) was about 827,000 titles; the average size was about 1.1 million titles.

An interesting question that arose in the course of this project was the degree to which the existing legal deposit libraries in the UK and Ireland might serve as preservation hubs within the larger RLUK network. As more academic libraries are looking to manage down locally held print inventory (transferring materials to offsite repositories, increasing reliance on shared print agreements etc.), there is growing interest in identifying latent print preservation capacity that might be more effectively leveraged, so that the total cost of stewarding research and heritage collections can be reduced and the overall scope of the collective collection increased. In the UK context, where universities (and their libraries) are largely supported by public funding, it is reasonable to ask whether investments in national legal deposit collections, and de facto preservation collections in some university libraries, can be used to support some rationalization of print collections management across the larger higher education sector.

While a close study of bi-lateral duplication rates within the RLUK group (e.g., duplication between individual legal deposit libraries and other university libraries) was beyond the scope of our collective collection project, we did do some preliminary investigation. The project advisory board, composed of representatives from 11 RLUK libraries, was especially interested to know if a large share of scarce or distinctive resources in the collective collection were concentrated in the legal deposit libraries. If so, they reasoned, other libraries might relegate or even de-select local copies of those resources, with confidence that legal deposit partners would uphold preservation and access responsibilities. We looked at titles held in fewer than 5 RLUK institutions and found that 73% or more were held in at least one legal deposit repository.  The percentage of titles held in legal deposit libraries increased with the overall duplication rate: for titles held by 4 libraries in the RLUK group, the legal deposit duplication rate rose to 95%. Among the legal deposit libraries, the British Library – the largest of all the legal deposit collections within the RLUK – was the most frequent source of duplication, with Oxford University a close second.

What this suggests is that while the legal deposit network in the UK represents a significant source of preservation and access infrastructure, the capacity of individual legal deposit libraries to contribute to shared RLUK preservation goals will vary. Put another way, the collective capacity of these libraries is greater than the sum of their individual institutional capacities. What the optimal allocation of preservation responsibility across the RLUK group will ultimately look like will depend on a number of factors. Rick Lugg, Ruth Fisher and colleagues on the OCLC Sustainable Collection Services will be working with subsets of the RLUK group (including members of the White Rose Libraries consortium) in the coming year to examine how print preservation responsibilities can be shared among UK research libraries.

From a research perspective, we are interested in delineating patterns that reveal where library collaboration can increase system-wide efficiency and maximize institution benefit, without necessarily prescribing specific choices or courses of action for individual libraries. This recent collaboration with RLUK libraries provided a wealth of opportunities to explore how WorldCat data can be used to support group-scale approaches to collection management. Not all of the research it motivated is reflected in the final report, but it has helped to illuminate some lines of inquiry that we hope to explore in the future.

About Constance Malpas

Constance Malpas is a Research Scientist at OCLC. Her work focuses on data-driven analysis of library collections and services, with a special emphasis on strategic planning and managing institutional change. She has a particular interest in the organization of knowledge and research practices in the sciences.

Mail | Twitter | More Posts (29)

HangingTogether: Managing the Collective Collection: research libraries and legal deposit networks in the UK and Ireland

Fri, 2016-06-17 00:52

Brian Lavoie and I have just wrapped up a project examining the collective collection of a number of important research libraries in the UK and Ireland. The key findings from our analysis are published in a new OCLC Research report, Strength in Numbers: The Research Libraries UK (RLUK) Collective CollectionWe were very pleased to have an opportunity to collaborate with colleagues from the RLUK consortium on this project, our first foray into extending our collective collections research to geographies outside of North America. RLUK has a role in the UK that is comparable to that of ARL in the US, and its analogues in other places: it provides ‘coordination capacity’ for research libraries that share a number of similar challenges and operate within (broadly) comparable institutional circumstances.  As in our Right-scaling Stewardship project, undertaken in partnership with research libraries in the Committee on Institutional Cooperation (CIC) a few years ago, the insights and engagement of libraries in situ provided invaluable context for our analysis of WorldCat bibliographic and holdings data.

In the US, universities libraries have a long history of engagement with WorldCat as a core bibliographic utility for cataloging and resource sharing operations. In the UK, participation in WorldCat has – until recently – been more opportunistic and instrumental in nature. This has meant that coverage of UK library holdings in WorldCat is somewhat uneven. In preparation for the collective collection study, a number of RLUK libraries undertook major record-loading projects to bring their WorldCat holdings up to date. As a result, we now have a much better picture of the distribution of library resource across research libraries in the UK. The figure below, not included in our report, gives an idea of the regional concentration of RLUK library collections.

Geographic concentration of RLUK holdings in WorldCat (January 2016)

It is no surprise that the greatest concentration of resource is found in and around London – there are many RLUK institutions in the area, including the British Library. The regional concentrations that are shown in green (Oxford, Cambridge, Dublin, Edinburgh) reflect the depth of legal deposit collections in those locations. Edinburgh is a particularly bright spot, given it includes both the National Library of Scotland and the University of Edinburgh. The very prominent concentration in the North of England represents holdings of the British Library that are managed in Yorkshire. Setting aside the legal deposit libraries, the median collection size of RLUK libraries in WorldCat (as of January 2016) was about 827,000 titles; the average size was about 1.1 million titles.

An interesting question that arose in the course of this project was the degree to which the existing legal deposit libraries in the UK and Ireland might serve as preservation hubs within the larger RLUK network. As more academic libraries are looking to manage down locally held print inventory (transferring materials to offsite repositories, increasing reliance on shared print agreements etc.), there is growing interest in identifying latent print preservation capacity that might be more effectively leveraged, so that the total cost of stewarding research and heritage collections can be reduced and the overall scope of the collective collection increased. In the UK context, where universities (and their libraries) are largely supported by public funding, it is reasonable to ask whether investments in national legal deposit collections, and de facto preservation collections in some university libraries, can be used to support some rationalization of print collections management across the larger higher education sector.

While a close study of bi-lateral duplication rates within the RLUK group (e.g., duplication between individual legal deposit libraries and other university libraries) was beyond the scope of our collective collection project, we did do some preliminary investigation. The project advisory board, composed of representatives from 11 RLUK libraries, was especially interested to know if a large share of scarce or distinctive resources in the collective collection were concentrated in the legal deposit libraries. If so, they reasoned, other libraries might relegate or even de-select local copies of those resources, with confidence that legal deposit partners would uphold preservation and access responsibilities. We looked at titles held in fewer than 5 RLUK institutions and found that 73% or more were held in at least one legal deposit repository.  The percentage of titles held in legal deposit libraries increased with the overall duplication rate: for titles held by 4 libraries in the RLUK group, the legal deposit duplication rate rose to 95%. Among the legal deposit libraries, the British Library – the largest of all the legal deposit collections within the RLUK – was the most frequent source of duplication, with Oxford University a close second.

What this suggests is that while the legal deposit network in the UK represents a significant source of preservation and access infrastructure, the capacity of individual legal deposit libraries to contribute to shared RLUK preservation goals will vary. Put another way, the collective capacity of these libraries is greater than the sum of their individual institutional capacities. What the optimal allocation of preservation responsibility across the RLUK group will ultimately look like will depend on a number of factors. Rick Lugg, Ruth Fisher and colleagues on the OCLC Sustainable Collection Services will be working with subsets of the RLUK group (including members of the White Rose Libraries consortium) in the coming year to examine how print preservation responsibilities can be shared among UK research libraries.

From a research perspective, we are interested in delineating patterns that reveal where library collaboration can increase system-wide efficiency and maximize institution benefit, without necessarily prescribing specific choices or courses of action for individual libraries. This recent collaboration with RLUK libraries provided a wealth of opportunities to explore how WorldCat data can be used to support group-scale approaches to collection management. Not all of the research it motivated is reflected in the final report, but it has helped to illuminate some lines of inquiry that we hope to explore in the future.

About Constance Malpas

Constance Malpas is a Research Scientist at OCLC. Her work focuses on data-driven analysis of library collections and services, with a special emphasis on strategic planning and managing institutional change. She has a particular interest in the organization of knowledge and research practices in the sciences.

Mail | Twitter | More Posts (29)

District Dispatch: Copyright questions got you down?

Thu, 2016-06-16 18:38

Help is on the way, both in person and online. At ALA Annual 2016, the Copyright Answer booth is back. On Saturday and Sunday only, from 10am – 4pm, conveniently located in the Orange County Convention Center outside of the Exhibit Hall and staffed by copyright librarian experts, stop by and pour your heart out.  What copyright issue is troubling you?  Feel like venting?  We have been there and done that. We will listen, and provide non-legal, but well informed advice.

If you can’t make it to Orlando, don’t pout.  The Copyright Advisory Network (CAN) is back.  Any time, day or night, post your copyright question or issue on the CAN forum.  Within 24 hours (except holidays and weekends), one of our knowledgeable copyright scholars will respond to your query.  We also have online copyright tools for figuring out when materials are in the public domain, conducting a fair use analysis and more. By the way,  it’s all FREE. You will never be asked to pay a fee or take a drug test.

These services are brought to you by fellow librarians who are members of the Office for Information Technology Policy (OITP) Copyright Education Subcommittee.  Our primary mission is to help librarians who find themselves on the copyright front lines navigate the mine field.

The post Copyright questions got you down? appeared first on District Dispatch.

District Dispatch: Copyright questions got you down?

Thu, 2016-06-16 18:38

Help is on the way, both in person and online. At ALA Annual 2016, the Copyright Answer booth is back. On Saturday and Sunday only, from 10am – 4pm, conveniently located in the Orange County Convention Center outside of the Exhibit Hall and staffed by copyright librarian experts, stop by and pour your heart out.  What copyright issue is troubling you?  Feel like venting?  We have been there and done that. We will listen, and provide non-legal, but well informed advice.

If you can’t make it to Orlando, don’t pout.  The Copyright Advisory Network (CAN) is back.  Any time, day or night, post your copyright question or issue on the CAN forum.  Within 24 hours (except holidays and weekends), one of our knowledgeable copyright scholars will respond to your query.  We also have online copyright tools for figuring out when materials are in the public domain, conducting a fair use analysis and more. By the way,  it’s all FREE. You will never be asked to pay a fee or take a drug test.

These services are brought to you by fellow librarians who are members of the Office for Information Technology Policy (OITP) Copyright Education Subcommittee.  Our primary mission is to help librarians who find themselves on the copyright front lines navigate the mine field.

The post Copyright questions got you down? appeared first on District Dispatch.

LITA: Reminder/Shameless Plug for LITA President’s Program in Orlando

Thu, 2016-06-16 15:53

by Thomas Dowling

LITA members–and anyone else–attending ALA Annual in Orlando, please join us for the LITA Awards and President’s Program on Sunday afternoon, 3pm to 4pm, in the Orange County Convention Center, room W109B.

Our featured speaker will be Dr. Safiya Noble, who will speak about how the landscape of information is rapidly shifting as new imperatives and demands push to the fore increasing investment in digital technologies, despite the consequences of increased surveillance and lack of privacy, which are changing our information engagements. Dr. Noble’s talk is co-sponsored by ALA’s Office for Diversity, Literacy, and Outreach Services, and the Black Caucus of the American Library Association.

If you can fit it all in to your schedule, I invite you to binge watch our Sunday Afternoon With LITA event, starting with Top Tech Trends (1pm to 2pm, Convention Center, W109B), continuing with the President’s Program, and concluding with the LITA Happy Hour, 5:30pm, Sam & Bubbe’s Lobby Bar at the Rosen Centre Hotel.  In addition to good company and good cheer, Happy Hour is the start to our year-long 50th anniversary celebration!

LITA: Reminder/Shameless Plug for LITA President’s Program in Orlando

Thu, 2016-06-16 15:53

by Thomas Dowling

LITA members–and anyone else–attending ALA Annual in Orlando, please join us for the LITA Awards and President’s Program on Sunday afternoon, 3pm to 4pm, in the Orange County Convention Center, room W109B.

Our featured speaker will be Dr. Safiya Noble, who will speak about how the landscape of information is rapidly shifting as new imperatives and demands push to the fore increasing investment in digital technologies, despite the consequences of increased surveillance and lack of privacy, which are changing our information engagements. Dr. Noble’s talk is co-sponsored by ALA’s Office for Diversity, Literacy, and Outreach Services, and the Black Caucus of the American Library Association.

If you can fit it all in to your schedule, I invite you to binge watch our Sunday Afternoon With LITA event, starting with Top Tech Trends (1pm to 2pm, Convention Center, W109B), continuing with the President’s Program, and concluding with the LITA Happy Hour, 5:30pm, Sam & Bubbe’s Lobby Bar at the Rosen Centre Hotel.  In addition to good company and good cheer, Happy Hour is the start to our year-long 50th anniversary celebration!

Mark E. Phillips: Comparing Web Archives: EOT2008 and EOT2012 – Where

Thu, 2016-06-16 15:36

This post carries on in the analysis of the End of Term web archives for 2008 and 2012. Previous posts in this series discuss when content was harvested and what kind of content was harvested and included in the archives.

In this post we will look at where content came from, specifically the data held in the top level domains, domain names and sub-domain names.

Top Level Domains

The first thing to look at is the top level domains for all of the URLs in the CDX files.

In the EOT2008 archive there are a total of 241 unique TLDs.  In the EOT2012 archive there are a total of 251 unique TLDs.  This is a modest increase of 4.15% from EOT2008 to EOT2012.

The EOT2008 and EOT2012 archives share 225 TLDs between the two archives.  There are 16 TLDs that are unique to the EOT2008 archive and 26 TLDs that are unique to the EOT2012 archive.

TLDs unique to EOT2008

Unique to 2008 URLs from TLD null 18,772 www 583 yu 357 labs 20 webteam 16 cg 10 security 8 ssl 8 b 8 css 7 web 6 dev 4 education 4 misc 2 secure 2 campaigns 2

TLDs unique to EOT2012

Unique to 2012 URLs from TLD whois 17,500 io 7,935 pn 987 sy 541 lr 478 so 418 nr 363 tf 291 xxx 258 re 186 xn--p1ai 171 bi 153 dm 120 tel 78 ck 65 ax 64 sx 54 tg 50 ki 48 gg 25 kn 25 gp 24 pm 20 fk 18 cf 7 wf 3

I believe that the “null” TLD from EOT2008 is an artifact of the crawling process and possibly represents rows in the CDX file that correspond to metadata records in the warc/arcs from 2008.  I will have to do some digging to confirm.

Change in TLD

Next up we take a look at the 225 TLDs that are shared between the archives. First up are the fifteen most changed based on the increase or decrease in the number of URLs from that TLD

TLD eot2008 eot2012 Change Absolute Change % change com 7,809,711 45,594,482 37,784,771 37,784,771 483.8% gov 137,829,050 109,141,353 -28,687,697 28,687,697 -20.8% mil 3,555,425 16,223,861 12,668,436 12,668,436 356.3% net 653,187 9,269,406 8,616,219 8,616,219 1319.1% edu 3,552,509 2,442,626 -1,109,883 1,109,883 -31.2% int 135,939 685,168 549,229 549,229 404.0% uk 70,262 594,020 523,758 523,758 745.4% ly 95 503,457 503,362 503,362 529854.7% org 5,108,645 5,588,750 480,105 480,105 9.4% us 840,516 474,156 -366,360 366,360 -43.6% co 2,839 211,131 208,292 208,292 7336.8% be 4,019 203,178 199,159 199,159 4955.4% jp 23,896 220,602 196,706 196,706 823.2% me 35 182,963 182,928 182,928 522651.4% tv 10,373 191,736 181,363 181,363 1748.4%

Interesting is the change in the first two.  There was an increase of over 37 million URLs (484%) for the com TDL between EOT2008 and EOT2012.  There was also a decrease (-21%) or over 28 million URLs for the gov TLD.  The mil TLD also increased by 356% between the EOT2008 and EOT2012 harvests with an increase of over 12 million URLs.

You can see that .ly and .me increased by some serious percentage,  529,855% and 522,651% respectively.

Taking a look at just the percent of change, here are the five most changed based on that percentage

TLD eot2008 eot2012 Change Absolute Change % change ly 95 503,457 503,362 503,362 529854.7% me 35 182,963 182,928 182,928 522651.4% gl 129 49,733 49,604 49,604 38452.7% gd 9 3,273 3,264 3,264 36266.7% cat 43 11,703 11,660 11,660 27116.3%

I have a feeling that at the majority of the ly, me, gl, and gd TLD content came in as redirect URLs from link shortening services.

Domain Names

There are 87,889 unique domain names in the EOT2008 archive, this increases dramatically in the EOT2012 archive to 186,214 which is an increase of 118% in the number of domain names.

There are 30,066 domain names that are shared between the two archives.  There are 57,823 domain names that are unique to the EOT2008 archive and 156.148 domain names that are unique to the EOT2012 archive.

Here is a table showing thirty of the domains that were only present in the EOT2008 archive ordered by the number of URLs from that domain.

TLD Count geodata.gov 812,524 nifl.gov 504,910 stat-usa.gov 398,961 tradestatsexpress.gov 243,729 arnet.gov 174,057 acqnet.gov 171,493 dccourts.gov 161,289 meish.org 147,261 web-services.gov 137,202 metrokc.gov 132,210 sdi.gov 91,887 davie-fl.gov 88,123 belmont.gov 87,332 aftac.gov 84,507 careervoyages.gov 57,192 women-21.gov 56,255 egrpra.gov 54,775 4women.gov 45,684 4woman.gov 42,192 nypa.gov 36,099 secure-banking.com 33,059 nhmfl.gov 27,569 darpa.gov 21,454 usafreedomcorps.gov 18,001 peacecore.gov 17,744 californiadesert.gov 15,172 federaljudgesassoc.org 15,126 arpa.gov 15,093 transportationfortomorrow.org 14,926 okgeosurvey1.gov 14,595

Here is the same kind of table but this time for the EOT2012 dataset.

TLD Count militaryonesource.mil 859,843 yfrog.com 682,664 staticflickr.com 640,606 akamaihd.net 384,769 4sqi.net 350,707 foursquare.com 340,492 adf.ly 334,767 pinterest.com 244,293 consumerfinance.gov 237,361 nrd.gov 194,215 wh.gov 179,233 t.co 175,033 youtu.be 172,301 sndcdn.com 161,039 pnnl.gov 132,994 eia.gov 112,034 transparency.gov 109,039 nationalguard.mil 108,854 acus.gov 93,810 nrsc.org 85,925 mzstatic.com 84,202 404.gov 82,409 savingsbondwizard.gov 76,867 treasuryhunt.gov 76,394 mynextmove.org 75,927 fedshirevets.gov 75,529 onrr.gov 75,484 veterans.gov 75,350 broadbandmap.gov 72,889 ntm-a.com 71,126

Those are pretty long tables but I think they start to point at some interesting things from this analysis.  The domains that were present and harvested in 2008 and that weren’t harvested in 2012.  In looking at the list, some of them (metrokc.gov, davie-fl.gov, okgeosurvey1.gov) were most likely out of scope for “Federal Web” but got captured because of the gov TLD.

In the EOT2012 list you start to see artifacts from an increase in attention to social media site capture for the EOT2012 project.  Sites like yfrog.com, staticflickr.com, adf.ly, t.co, youtu.be, foursquare.com, pintrest.com probably came from that increased attention.

Here is a list of the twenty most changed domains from EOT2008 to EOT2012.  This number is based on the absolute change in the number of URLs captured for each of the archives.

Domain EOT2008 EOT2012 Change Abolute Change % Change house.gov 13,694,187 35,894,356 22,200,169 22,200,169 162% facebook.com 11,895 7,503,640 7,491,745 7,491,745 62,982% dvidshub.net 1,097 5,612,410 5,611,313 5,611,313 511,514% senate.gov 5,043,974 9,924,917 4,880,943 4,880,943 97% gpo.gov 8,705,511 3,888,645 -4,816,866 4,816,866 -55% nih.gov 5,276,262 1,267,764 -4,008,498 4,008,498 -76% nasa.gov 6,693,542 3,063,382 -3,630,160 3,630,160 -54% navy.mil 94,081 3,611,722 3,517,641 3,517,641 3,739% usgs.gov 4,896,493 1,690,295 -3,206,198 3,206,198 -65% loc.gov 5,059,848 7,587,179 2,527,331 2,527,331 50% flickr.com 157,155 2,286,890 2,129,735 2,129,735 1,355% youtube.com 346,272 2,369,108 2,022,836 2,022,836 584% hhs.gov 2,361,866 366,024 -1,995,842 1,995,842 -85% osd.mil 180,046 2,111,791 1,931,745 1,931,745 1,073% af.mil 230,920 2,067,812 1,836,892 1,836,892 795% ed.gov 2,334,548 510,413 -1,824,135 1,824,135 -78% granicus.com 782 1,785,724 1,784,942 1,784,942 228,253% lanl.gov 2,081,275 309,007 -1,772,268 1,772,268 -85% usda.gov 2,892,923 1,324,049 -1,568,874 1,568,874 -54% googleusercontent.com 2 1,560,457 1,560,455 1,560,455 78,022,750%

You see big increases in facebook.com (+62,982%), flickr.com (+1,355%), youtube.com (584%) and googleusercontent.com (78,022,750%) in content from EOT2008 to EOT2012.

Other increases that are notable include dvidshub.net which is the domain for a site called Defense Video & Imagery Distribution System that increased by 511,514%, navy.mil (3,739%), osd.mil (1,073%), af.mil (795%).  I like to think this speaks to a desired increase in attention to .mil content in the EOT2012 project.

Another domain that stands out to me is granicus.com which I was unaware of but after a little looking turns out to be one of the big cloud service providers for the federal government (or at least it was according to the EOT2012 dataset).

.gov and .mil subdomains

The last piece I wanted to look at related to domain names was to see what sort of changes there were in the gov and mil portions of the EOT2008 and EOT2012 crawls.  This time I wanted to look at the subdomains.

I filtered my dataset a bit so that I was only looking at the .mil and .gov content.

In the EOT2008 archive there were a total of 16,072 unique subdomains and in EOT2012 there were 22,477 subdomains.  This is an increase of 40% between the two archive projects.

The EOT2008 has 5,371 subdomains unique to its holdings and EOT2012 has 11,776 unique subdomains.

Subdomains that had the most content (based on URLs downloaded) and which are only present in EOT2008 are presented below.  (Limited to the top 30)

EOT2008 Subdomain Count gos2.geodata.gov 809,442 boucher.house.gov 772,759 kendrickmeek.house.gov 685,368 citizensbriefingbook.change.gov 446,632 stat-usa.gov 305,936 nifl.gov 285,833 scidac-new.ca.sandia.gov 247,451 tradestatsexpress.gov 243,729 hpcf.nersc.gov 221,626 gopher.info.usaid.gov 219,051 novel.nifl.gov 218,962 dli2.nsf.gov 206,932 contractorsupport.acf.hhs.gov 188,841 pnwin.nbii.gov 188,591 faq.acf.hhs.gov 184,212 ccdf.acf.hhs.gov 182,606 arnet.gov 174,018 regulations.acf.hhs.gov 171,762 acqnet.gov 171,493 dccourts.gov 161,289 employers.acf.hhs.gov 139,141 search.info.usaid.gov 137,816 web-services.gov 137,202 earth2.epa.gov 136,441 cjtf7.army.mil 134,507 ncweb-north.wr.usgs.gov 134,486 opre.acf.hhs.gov 133,689 childsupportenforcement.acf.hhs.gov 132,023 modis-250m.nascom.nasa.gov 128,810 casd.uscourts.gov 124,146

Here is the same sort of data for the EOT2012 dataset

EOT2012 Subdomain Count militaryonesource.mil 698,035 uscodebeta.house.gov 387,080 democrats.foreignaffairs.house.gov 312,270 gulflink.fhpr.osd.mil 262,246 coons.senate.gov 257,721 democrats.energycommerce.house.gov 243,341 consumerfinance.gov 225,815 dcmo.defense.gov 217,255 nrd.gov 187,267 wh.gov 179,103 usaxs.xray.aps.anl.gov 178,298 democrats.budget.house.gov 175,109 democrats.edworkforce.house.gov 162,077 apps.militaryonesource.mil 157,144 naturalresources.house.gov 155,918 purl.fdlp.gov 154,718 media.dma.mil 137,581 algreen.house.gov 131,388 democrats.transportation.house.gov 129,345 democrats.naturalresources.house.gov 124,808 hanabusa.house.gov 123,794 pitts.house.gov 122,402 visclosky.house.gov 122,223 garamendi.house.gov 114,221 vault.fbi.gov 113,873 green.house.gov 113,040 sewell.house.gov 112,973 levin.house.gov 111,971 eia.gov 111,889 hahn.house.gov 111,024

This last table is a little long,  but I found the data pretty interesting to look at.   The table below shows the biggest change for domains and subdomains that were shared between the EOT2008 and EOT2012 archives. I’ve included the top forty entries for that list.

Subdomain/Domain EOT2008 EOT2012 Change Absolute Change % Change listserv.access.gpo.gov 2,217,565 7,487 -2,210,078 2,210,078 -100% carter.house.gov 1,898,462 29,680 -1,868,782 1,868,782 -98% catalog.gpo.gov 1,868,504 34,040 -1,834,464 1,834,464 -98% loc.gov 63,534 1,875,264 1,811,730 1,811,730 2,852% gpo.gov 52,427 1,796,925 1,744,498 1,744,498 3,327% bensguide.gpo.gov 90,280 1,790,017 1,699,737 1,699,737 1,883% edocket.access.gpo.gov 1,644,578 7,822 -1,636,756 1,636,756 -100% nws.noaa.gov 103,367 1,676,264 1,572,897 1,572,897 1,522% navair.navy.mil 220 1,556,320 1,556,100 1,556,100 707,318% congress.gov 1,525,467 356 -1,525,111 1,525,111 -100% cha.house.gov 1,366,520 109,192 -1,257,328 1,257,328 -92% usbg.gov 1,026,360 6,724 -1,019,636 1,019,636 -99% dol.gov 1,052,335 41,909 -1,010,426 1,010,426 -96% resourcescommittee.house.gov 1,008,655 335 -1,008,320 1,008,320 -100% calvert.house.gov 20,530 1,014,416 993,886 993,886 4,841% fdlp.gov 989,415 1,554 -987,861 987,861 -100% lcweb2.loc.gov 466,623 1,451,708 985,085 985,085 211% cramer.house.gov 1,011,872 60,879 -950,993 950,993 -94% ed.gov 1,141,069 241,165 -899,904 899,904 -79% vaccines.mil 5,638 856,113 850,475 850,475 15,085% clinicaltrials.gov 919,362 193,158 -726,204 726,204 -79% army.mil 4,831 725,934 721,103 721,103 14,927% boehner.house.gov 7,472 695,625 688,153 688,153 9,210% nces.ed.gov 702,644 31,922 -670,722 670,722 -95% prc.gov 739,849 75,682 -664,167 664,167 -90% navy.mil 1,481 654,254 652,773 652,773 44,077% house.gov 818,095 172,066 -646,029 646,029 -79% fueleconomy.gov 675,522 79,943 -595,579 595,579 -88% fema.gov 636,005 53,321 -582,684 582,684 -92% frwebgate.access.gpo.gov 621,361 55,097 -566,264 566,264 -91% siadapp.dmdc.osd.mil 43 559,076 559,033 559,033 1,300,077% fdsys.gpo.gov 548,618 28 -548,590 548,590 -100% tiger.census.gov 549,046 750 -548,296 548,296 -100% rs6.loc.gov 550,489 6,695 -543,794 543,794 -99% bennelson.senate.gov 16,203 553,698 537,495 537,495 3,317% crapo.senate.gov 28,569 540,928 512,359 512,359 1,793% eia.doe.gov 508,675 1,629 -507,046 507,046 -100% epa.gov 623,457 117,794 -505,663 505,663 -81% defenselink.mil 502,006 1,866 -500,140 500,140 -100% access.gpo.gov 472,373 3,110 -469,263 469,263 -99%

I find this table interesting for a number of reasons.  First you see quite a bit more decline that I have seen in my other tables like this.  In fact 26 of the 40 subdomains/domains (54%) on this list decreased from EOT2008 to EOT2012.

In looking at the list as well I can see some of the sites that I can see the transition of some of the sites within GPO, for example access.gpo.gov going down 90% in captured content, fdsys.gpo.gov going down by 94%, bensguide.gpo.gov increasing by 1,883%.

Wrapping Up

I like to think that it helps to justify some of the work that the partners of the End of Term project are committing to the project when you see that there are large numbers of domains and subdomains that existed in 2008 but that weren’t crawled again in 2012 (and we can only assume they weren’t around in 2012).

There are a few more things I want to look at in this work so stay tuned.

If you have questions or comments about this post,  please let me know via Twitter.

Mark E. Phillips: Comparing Web Archives: EOT2008 and EOT2012 – Where

Thu, 2016-06-16 15:36

This post carries on in the analysis of the End of Term web archives for 2008 and 2012. Previous posts in this series discuss when content was harvested and what kind of content was harvested and included in the archives.

In this post we will look at where content came from, specifically the data held in the top level domains, domain names and sub-domain names.

Top Level Domains

The first thing to look at is the top level domains for all of the URLs in the CDX files.

In the EOT2008 archive there are a total of 241 unique TLDs.  In the EOT2012 archive there are a total of 251 unique TLDs.  This is a modest increase of 4.15% from EOT2008 to EOT2012.

The EOT2008 and EOT2012 archives share 225 TLDs between the two archives.  There are 16 TLDs that are unique to the EOT2008 archive and 26 TLDs that are unique to the EOT2012 archive.

TLDs unique to EOT2008

Unique to 2008 URLs from TLD null 18,772 www 583 yu 357 labs 20 webteam 16 cg 10 security 8 ssl 8 b 8 css 7 web 6 dev 4 education 4 misc 2 secure 2 campaigns 2

TLDs unique to EOT2012

Unique to 2012 URLs from TLD whois 17,500 io 7,935 pn 987 sy 541 lr 478 so 418 nr 363 tf 291 xxx 258 re 186 xn--p1ai 171 bi 153 dm 120 tel 78 ck 65 ax 64 sx 54 tg 50 ki 48 gg 25 kn 25 gp 24 pm 20 fk 18 cf 7 wf 3

I believe that the “null” TLD from EOT2008 is an artifact of the crawling process and possibly represents rows in the CDX file that correspond to metadata records in the warc/arcs from 2008.  I will have to do some digging to confirm.

Change in TLD

Next up we take a look at the 225 TLDs that are shared between the archives. First up are the fifteen most changed based on the increase or decrease in the number of URLs from that TLD

TLD eot2008 eot2012 Change Absolute Change % change com 7,809,711 45,594,482 37,784,771 37,784,771 483.8% gov 137,829,050 109,141,353 -28,687,697 28,687,697 -20.8% mil 3,555,425 16,223,861 12,668,436 12,668,436 356.3% net 653,187 9,269,406 8,616,219 8,616,219 1319.1% edu 3,552,509 2,442,626 -1,109,883 1,109,883 -31.2% int 135,939 685,168 549,229 549,229 404.0% uk 70,262 594,020 523,758 523,758 745.4% ly 95 503,457 503,362 503,362 529854.7% org 5,108,645 5,588,750 480,105 480,105 9.4% us 840,516 474,156 -366,360 366,360 -43.6% co 2,839 211,131 208,292 208,292 7336.8% be 4,019 203,178 199,159 199,159 4955.4% jp 23,896 220,602 196,706 196,706 823.2% me 35 182,963 182,928 182,928 522651.4% tv 10,373 191,736 181,363 181,363 1748.4%

Interesting is the change in the first two.  There was an increase of over 37 million URLs (484%) for the com TDL between EOT2008 and EOT2012.  There was also a decrease (-21%) or over 28 million URLs for the gov TLD.  The mil TLD also increased by 356% between the EOT2008 and EOT2012 harvests with an increase of over 12 million URLs.

You can see that .ly and .me increased by some serious percentage,  529,855% and 522,651% respectively.

Taking a look at just the percent of change, here are the five most changed based on that percentage

TLD eot2008 eot2012 Change Absolute Change % change ly 95 503,457 503,362 503,362 529854.7% me 35 182,963 182,928 182,928 522651.4% gl 129 49,733 49,604 49,604 38452.7% gd 9 3,273 3,264 3,264 36266.7% cat 43 11,703 11,660 11,660 27116.3%

I have a feeling that at the majority of the ly, me, gl, and gd TLD content came in as redirect URLs from link shortening services.

Domain Names

There are 87,889 unique domain names in the EOT2008 archive, this increases dramatically in the EOT2012 archive to 186,214 which is an increase of 118% in the number of domain names.

There are 30,066 domain names that are shared between the two archives.  There are 57,823 domain names that are unique to the EOT2008 archive and 156.148 domain names that are unique to the EOT2012 archive.

Here is a table showing thirty of the domains that were only present in the EOT2008 archive ordered by the number of URLs from that domain.

TLD Count geodata.gov 812,524 nifl.gov 504,910 stat-usa.gov 398,961 tradestatsexpress.gov 243,729 arnet.gov 174,057 acqnet.gov 171,493 dccourts.gov 161,289 meish.org 147,261 web-services.gov 137,202 metrokc.gov 132,210 sdi.gov 91,887 davie-fl.gov 88,123 belmont.gov 87,332 aftac.gov 84,507 careervoyages.gov 57,192 women-21.gov 56,255 egrpra.gov 54,775 4women.gov 45,684 4woman.gov 42,192 nypa.gov 36,099 secure-banking.com 33,059 nhmfl.gov 27,569 darpa.gov 21,454 usafreedomcorps.gov 18,001 peacecore.gov 17,744 californiadesert.gov 15,172 federaljudgesassoc.org 15,126 arpa.gov 15,093 transportationfortomorrow.org 14,926 okgeosurvey1.gov 14,595

Here is the same kind of table but this time for the EOT2012 dataset.

TLD Count militaryonesource.mil 859,843 yfrog.com 682,664 staticflickr.com 640,606 akamaihd.net 384,769 4sqi.net 350,707 foursquare.com 340,492 adf.ly 334,767 pinterest.com 244,293 consumerfinance.gov 237,361 nrd.gov 194,215 wh.gov 179,233 t.co 175,033 youtu.be 172,301 sndcdn.com 161,039 pnnl.gov 132,994 eia.gov 112,034 transparency.gov 109,039 nationalguard.mil 108,854 acus.gov 93,810 nrsc.org 85,925 mzstatic.com 84,202 404.gov 82,409 savingsbondwizard.gov 76,867 treasuryhunt.gov 76,394 mynextmove.org 75,927 fedshirevets.gov 75,529 onrr.gov 75,484 veterans.gov 75,350 broadbandmap.gov 72,889 ntm-a.com 71,126

Those are pretty long tables but I think they start to point at some interesting things from this analysis.  The domains that were present and harvested in 2008 and that weren’t harvested in 2012.  In looking at the list, some of them (metrokc.gov, davie-fl.gov, okgeosurvey1.gov) were most likely out of scope for “Federal Web” but got captured because of the gov TLD.

In the EOT2012 list you start to see artifacts from an increase in attention to social media site capture for the EOT2012 project.  Sites like yfrog.com, staticflickr.com, adf.ly, t.co, youtu.be, foursquare.com, pintrest.com probably came from that increased attention.

Here is a list of the twenty most changed domains from EOT2008 to EOT2012.  This number is based on the absolute change in the number of URLs captured for each of the archives.

Domain EOT2008 EOT2012 Change Abolute Change % Change house.gov 13,694,187 35,894,356 22,200,169 22,200,169 162% facebook.com 11,895 7,503,640 7,491,745 7,491,745 62,982% dvidshub.net 1,097 5,612,410 5,611,313 5,611,313 511,514% senate.gov 5,043,974 9,924,917 4,880,943 4,880,943 97% gpo.gov 8,705,511 3,888,645 -4,816,866 4,816,866 -55% nih.gov 5,276,262 1,267,764 -4,008,498 4,008,498 -76% nasa.gov 6,693,542 3,063,382 -3,630,160 3,630,160 -54% navy.mil 94,081 3,611,722 3,517,641 3,517,641 3,739% usgs.gov 4,896,493 1,690,295 -3,206,198 3,206,198 -65% loc.gov 5,059,848 7,587,179 2,527,331 2,527,331 50% flickr.com 157,155 2,286,890 2,129,735 2,129,735 1,355% youtube.com 346,272 2,369,108 2,022,836 2,022,836 584% hhs.gov 2,361,866 366,024 -1,995,842 1,995,842 -85% osd.mil 180,046 2,111,791 1,931,745 1,931,745 1,073% af.mil 230,920 2,067,812 1,836,892 1,836,892 795% ed.gov 2,334,548 510,413 -1,824,135 1,824,135 -78% granicus.com 782 1,785,724 1,784,942 1,784,942 228,253% lanl.gov 2,081,275 309,007 -1,772,268 1,772,268 -85% usda.gov 2,892,923 1,324,049 -1,568,874 1,568,874 -54% googleusercontent.com 2 1,560,457 1,560,455 1,560,455 78,022,750%

You see big increases in facebook.com (+62,982%), flickr.com (+1,355%), youtube.com (584%) and googleusercontent.com (78,022,750%) in content from EOT2008 to EOT2012.

Other increases that are notable include dvidshub.net which is the domain for a site called Defense Video & Imagery Distribution System that increased by 511,514%, navy.mil (3,739%), osd.mil (1,073%), af.mil (795%).  I like to think this speaks to a desired increase in attention to .mil content in the EOT2012 project.

Another domain that stands out to me is granicus.com which I was unaware of but after a little looking turns out to be one of the big cloud service providers for the federal government (or at least it was according to the EOT2012 dataset).

.gov and .mil subdomains

The last piece I wanted to look at related to domain names was to see what sort of changes there were in the gov and mil portions of the EOT2008 and EOT2012 crawls.  This time I wanted to look at the subdomains.

I filtered my dataset a bit so that I was only looking at the .mil and .gov content.

In the EOT2008 archive there were a total of 16,072 unique subdomains and in EOT2012 there were 22,477 subdomains.  This is an increase of 40% between the two archive projects.

The EOT2008 has 5,371 subdomains unique to its holdings and EOT2012 has 11,776 unique subdomains.

Subdomains that had the most content (based on URLs downloaded) and which are only present in EOT2008 are presented below.  (Limited to the top 30)

EOT2008 Subdomain Count gos2.geodata.gov 809,442 boucher.house.gov 772,759 kendrickmeek.house.gov 685,368 citizensbriefingbook.change.gov 446,632 stat-usa.gov 305,936 nifl.gov 285,833 scidac-new.ca.sandia.gov 247,451 tradestatsexpress.gov 243,729 hpcf.nersc.gov 221,626 gopher.info.usaid.gov 219,051 novel.nifl.gov 218,962 dli2.nsf.gov 206,932 contractorsupport.acf.hhs.gov 188,841 pnwin.nbii.gov 188,591 faq.acf.hhs.gov 184,212 ccdf.acf.hhs.gov 182,606 arnet.gov 174,018 regulations.acf.hhs.gov 171,762 acqnet.gov 171,493 dccourts.gov 161,289 employers.acf.hhs.gov 139,141 search.info.usaid.gov 137,816 web-services.gov 137,202 earth2.epa.gov 136,441 cjtf7.army.mil 134,507 ncweb-north.wr.usgs.gov 134,486 opre.acf.hhs.gov 133,689 childsupportenforcement.acf.hhs.gov 132,023 modis-250m.nascom.nasa.gov 128,810 casd.uscourts.gov 124,146

Here is the same sort of data for the EOT2012 dataset

EOT2012 Subdomain Count militaryonesource.mil 698,035 uscodebeta.house.gov 387,080 democrats.foreignaffairs.house.gov 312,270 gulflink.fhpr.osd.mil 262,246 coons.senate.gov 257,721 democrats.energycommerce.house.gov 243,341 consumerfinance.gov 225,815 dcmo.defense.gov 217,255 nrd.gov 187,267 wh.gov 179,103 usaxs.xray.aps.anl.gov 178,298 democrats.budget.house.gov 175,109 democrats.edworkforce.house.gov 162,077 apps.militaryonesource.mil 157,144 naturalresources.house.gov 155,918 purl.fdlp.gov 154,718 media.dma.mil 137,581 algreen.house.gov 131,388 democrats.transportation.house.gov 129,345 democrats.naturalresources.house.gov 124,808 hanabusa.house.gov 123,794 pitts.house.gov 122,402 visclosky.house.gov 122,223 garamendi.house.gov 114,221 vault.fbi.gov 113,873 green.house.gov 113,040 sewell.house.gov 112,973 levin.house.gov 111,971 eia.gov 111,889 hahn.house.gov 111,024

This last table is a little long,  but I found the data pretty interesting to look at.   The table below shows the biggest change for domains and subdomains that were shared between the EOT2008 and EOT2012 archives. I’ve included the top forty entries for that list.

Subdomain/Domain EOT2008 EOT2012 Change Absolute Change % Change listserv.access.gpo.gov 2,217,565 7,487 -2,210,078 2,210,078 -100% carter.house.gov 1,898,462 29,680 -1,868,782 1,868,782 -98% catalog.gpo.gov 1,868,504 34,040 -1,834,464 1,834,464 -98% loc.gov 63,534 1,875,264 1,811,730 1,811,730 2,852% gpo.gov 52,427 1,796,925 1,744,498 1,744,498 3,327% bensguide.gpo.gov 90,280 1,790,017 1,699,737 1,699,737 1,883% edocket.access.gpo.gov 1,644,578 7,822 -1,636,756 1,636,756 -100% nws.noaa.gov 103,367 1,676,264 1,572,897 1,572,897 1,522% navair.navy.mil 220 1,556,320 1,556,100 1,556,100 707,318% congress.gov 1,525,467 356 -1,525,111 1,525,111 -100% cha.house.gov 1,366,520 109,192 -1,257,328 1,257,328 -92% usbg.gov 1,026,360 6,724 -1,019,636 1,019,636 -99% dol.gov 1,052,335 41,909 -1,010,426 1,010,426 -96% resourcescommittee.house.gov 1,008,655 335 -1,008,320 1,008,320 -100% calvert.house.gov 20,530 1,014,416 993,886 993,886 4,841% fdlp.gov 989,415 1,554 -987,861 987,861 -100% lcweb2.loc.gov 466,623 1,451,708 985,085 985,085 211% cramer.house.gov 1,011,872 60,879 -950,993 950,993 -94% ed.gov 1,141,069 241,165 -899,904 899,904 -79% vaccines.mil 5,638 856,113 850,475 850,475 15,085% clinicaltrials.gov 919,362 193,158 -726,204 726,204 -79% army.mil 4,831 725,934 721,103 721,103 14,927% boehner.house.gov 7,472 695,625 688,153 688,153 9,210% nces.ed.gov 702,644 31,922 -670,722 670,722 -95% prc.gov 739,849 75,682 -664,167 664,167 -90% navy.mil 1,481 654,254 652,773 652,773 44,077% house.gov 818,095 172,066 -646,029 646,029 -79% fueleconomy.gov 675,522 79,943 -595,579 595,579 -88% fema.gov 636,005 53,321 -582,684 582,684 -92% frwebgate.access.gpo.gov 621,361 55,097 -566,264 566,264 -91% siadapp.dmdc.osd.mil 43 559,076 559,033 559,033 1,300,077% fdsys.gpo.gov 548,618 28 -548,590 548,590 -100% tiger.census.gov 549,046 750 -548,296 548,296 -100% rs6.loc.gov 550,489 6,695 -543,794 543,794 -99% bennelson.senate.gov 16,203 553,698 537,495 537,495 3,317% crapo.senate.gov 28,569 540,928 512,359 512,359 1,793% eia.doe.gov 508,675 1,629 -507,046 507,046 -100% epa.gov 623,457 117,794 -505,663 505,663 -81% defenselink.mil 502,006 1,866 -500,140 500,140 -100% access.gpo.gov 472,373 3,110 -469,263 469,263 -99%

I find this table interesting for a number of reasons.  First you see quite a bit more decline that I have seen in my other tables like this.  In fact 26 of the 40 subdomains/domains (54%) on this list decreased from EOT2008 to EOT2012.

In looking at the list as well I can see some of the sites that I can see the transition of some of the sites within GPO, for example access.gpo.gov going down 90% in captured content, fdsys.gpo.gov going down by 94%, bensguide.gpo.gov increasing by 1,883%.

Wrapping Up

I like to think that it helps to justify some of the work that the partners of the End of Term project are committing to the project when you see that there are large numbers of domains and subdomains that existed in 2008 but that weren’t crawled again in 2012 (and we can only assume they weren’t around in 2012).

There are a few more things I want to look at in this work so stay tuned.

If you have questions or comments about this post,  please let me know via Twitter.

David Rosenthal: Bruce Schneier on the IoT

Thu, 2016-06-16 15:00
John Leyden at The Register reports that Government regulation will clip coders' wings, says Bruce Schneier. He spoke at Infosec 2016:
Government regulation of the Internet of Things will become inevitable as connected kit in arenas as varied as healthcare and power distribution becomes more commonplace, ... “Governments are going to get involved regardless because the risks are too great. When people start dying and property starts getting destroyed, governments are going to have to do something,” ... The trouble is we don’t yet have a good regulatory structure that might be applied to the IoT. Policy makers don’t understand technology and technologists don’t understand policy. ... “Integrity and availability are worse than confidentiality threats, especially for connected cars. Ransomware in the CPUs of cars is gonna happen in two to three years,” ... technologists and developers ought to design IoT components so they worked even when they were offline and failed in a safe mode." Not to mention the problem that the DMCA places researchers who find vulnerabilities in the IoT at risk of legal sanctions, despite the recent rule change. So much for the beneficial effects of government regulation.

This post will take over from Gadarene swine as a place to collect the horrors of the IoT. Below the fold a list of some of the IoT lowlights in the 17 weeks since then.

Schneier pointed to cars as vulnerable, and indeed both the Nissan Leaf:
when Nissan put together the companion app for its Leaf electric vehicle—the app will turn the climate control on or off—it decided not to bother requiring any kind of authentication. When a Leaf owner connects to their car via a smartphone, the only information that Nissan's APIs use to target the car is its VIN—the requests are all anonymous.and the Mitsubishi Outlander:
the Outlander uses wifi to connect the car directly with a smartphone, which is less secure and allowed Monroe to disable the alarm and then open the car. Describing the hack methodology and solutions, Munro speculates that the car’s insecure software system was probably a result of cost-cutting by Mitsubishi. “I assume that it’s been designed like this to be much cheaper for Mitsubishi than [the more secure] GSM/web service/mobile app based solution,”failed to include any security at all in their connected car systems. In both cases the researchers had to go public before the company admitted that they had a problem. This is not a good strategy:
Only one in four respondents to the survey could remember an incidence of car hacking occurring in the last year. That’s a dramatic drop from just a few months earlier, when a survey by the same firm performed just days after WIRED’s car hacking exposé in July found that 72 percent of ... consumers—were aware of the Jeep hack when asked about it specifically."Only" a quarter of car buyers remembered that Jeeps were hackable a year later. It'd take a lot of advertising dollars to be that effective. Among the authors commenting on the risks of connected cars were Jean-Louis Gassée, Jonathan Gitlin and Josh Corman at the Building IoT conference:
Corman zeroed in on our increasingly connected cars and medical devices as key targets. The consequences of mass compromising of connected vehicles, for example, would be confidence in vehicle manufacturers, transport infrastructure and knock-on effects at the GDP level.Speaking of medical devices, Cory Doctorow at BoingBoing reported on a paper in World Neurosurgery that discusses the dystopian security issues posed by brain implants. He also reported that Automated drug cabinets have 1400+ critical vulns that will never be patched.

Connected homes were equally problematic:. Thermostats:
More than 30 users of Hive, which is owned by British Gas, have complained their heating has been turned up to the maximum level by the iPhone app without their instruction, the Daily Mail reports.lightbulbs:
Matthew Garrett "bought some awful light bulbs so you don't have to." And you really, really shouldn't buy the iRainbow light bulb set: the controller box runs all sorts of insecure services, including an open WiFi hotspot that lets anyone into your home network.thermostats:
Nest in fact pushed out a buggy software update for its Learning Thermostat in January 2016 that led to some of the devices not maintaining temperature.home automation hubs:
The extraordinary decision of Nest to brick its $300 Revolv home automation hub has served as a wake-up call to the tech industry. Both customers and the broader internet of things (IoT) industry were appalled when Nest removed all support for the device, making it as useful as a tub of hummus, as one angry consumer memorably noted. The result has been a series of articles, blog posts and public discussions over how to ensure that the next generation of internet and smart-home products continues to work in an open environment and are not locked down to specific companies.entire home automation systems such as Samsung's SmartThings ecosystem - two separate vulnerabilities discovered by researchers at U. Mich provide the bad guys capabilities such as:
unlock doors, modify home access codes, create false smoke detector alarms, or put security and automation devices into vacation mode. security cameras:
The IP cameras that you bought to secure your physical space suddenly turn into a vast cloud network designed to share your pictures and videos far and wide. The best part? It’s all plug-and-play, no configuration necessary! and of course the home routers without which they wouldn't function:
the US Federal Trade Commission settled charges that alleged the hardware manufacturer failed to protect consumers as required by federal law. The settlement resolves a complaint that said the 2014 mass compromise was the result of vulnerabilities that allowed attackers to remotely log in to routers and, depending on user configurations, change security settings or access files stored on connected devices.all featured in the roll of dishonor. Were their manufacturers grateful for the help security researchers gave them in making their products less insecure? In some cases yes, in others they responded by hurling legal threats at the researchers.

David Rosenthal: Bruce Schneier on the IoT

Thu, 2016-06-16 15:00
John Leyden at The Register reports that Government regulation will clip coders' wings, says Bruce Schneier. He spoke at Infosec 2016:
Government regulation of the Internet of Things will become inevitable as connected kit in arenas as varied as healthcare and power distribution becomes more commonplace, ... “Governments are going to get involved regardless because the risks are too great. When people start dying and property starts getting destroyed, governments are going to have to do something,” ... The trouble is we don’t yet have a good regulatory structure that might be applied to the IoT. Policy makers don’t understand technology and technologists don’t understand policy. ... “Integrity and availability are worse than confidentiality threats, especially for connected cars. Ransomware in the CPUs of cars is gonna happen in two to three years,” ... technologists and developers ought to design IoT components so they worked even when they were offline and failed in a safe mode." Not to mention the problem that the DMCA places researchers who find vulnerabilities in the IoT at risk of legal sanctions, despite the recent rule change. So much for the beneficial effects of government regulation.

This post will take over from Gadarene swine as a place to collect the horrors of the IoT. Below the fold a list of some of the IoT lowlights in the 17 weeks since then.

Schneier pointed to cars as vulnerable, and indeed both the Nissan Leaf:
when Nissan put together the companion app for its Leaf electric vehicle—the app will turn the climate control on or off—it decided not to bother requiring any kind of authentication. When a Leaf owner connects to their car via a smartphone, the only information that Nissan's APIs use to target the car is its VIN—the requests are all anonymous.and the Mitsubishi Outlander:
the Outlander uses wifi to connect the car directly with a smartphone, which is less secure and allowed Monroe to disable the alarm and then open the car. Describing the hack methodology and solutions, Munro speculates that the car’s insecure software system was probably a result of cost-cutting by Mitsubishi. “I assume that it’s been designed like this to be much cheaper for Mitsubishi than [the more secure] GSM/web service/mobile app based solution,”failed to include any security at all in their connected car systems. In both cases the researchers had to go public before the company admitted that they had a problem. This is not a good strategy:
Only one in four respondents to the survey could remember an incidence of car hacking occurring in the last year. That’s a dramatic drop from just a few months earlier, when a survey by the same firm performed just days after WIRED’s car hacking exposé in July found that 72 percent of ... consumers—were aware of the Jeep hack when asked about it specifically."Only" a quarter of car buyers remembered that Jeeps were hackable a year later. It'd take a lot of advertising dollars to be that effective. Among the authors commenting on the risks of connected cars were Jean-Louis Gassée, Jonathan Gitlin and Josh Corman at the Building IoT conference:
Corman zeroed in on our increasingly connected cars and medical devices as key targets. The consequences of mass compromising of connected vehicles, for example, would be confidence in vehicle manufacturers, transport infrastructure and knock-on effects at the GDP level.Speaking of medical devices, Cory Doctorow at BoingBoing reported on a paper in World Neurosurgery that discusses the dystopian security issues posed by brain implants. He also reported that Automated drug cabinets have 1400+ critical vulns that will never be patched.

Connected homes were equally problematic:. Thermostats:
More than 30 users of Hive, which is owned by British Gas, have complained their heating has been turned up to the maximum level by the iPhone app without their instruction, the Daily Mail reports.lightbulbs:
Matthew Garrett "bought some awful light bulbs so you don't have to." And you really, really shouldn't buy the iRainbow light bulb set: the controller box runs all sorts of insecure services, including an open WiFi hotspot that lets anyone into your home network.thermostats:
Nest in fact pushed out a buggy software update for its Learning Thermostat in January 2016 that led to some of the devices not maintaining temperature.home automation hubs:
The extraordinary decision of Nest to brick its $300 Revolv home automation hub has served as a wake-up call to the tech industry. Both customers and the broader internet of things (IoT) industry were appalled when Nest removed all support for the device, making it as useful as a tub of hummus, as one angry consumer memorably noted. The result has been a series of articles, blog posts and public discussions over how to ensure that the next generation of internet and smart-home products continues to work in an open environment and are not locked down to specific companies.entire home automation systems such as Samsung's SmartThings ecosystem - two separate vulnerabilities discovered by researchers at U. Mich provide the bad guys capabilities such as:
unlock doors, modify home access codes, create false smoke detector alarms, or put security and automation devices into vacation mode. security cameras:
The IP cameras that you bought to secure your physical space suddenly turn into a vast cloud network designed to share your pictures and videos far and wide. The best part? It’s all plug-and-play, no configuration necessary! and of course the home routers without which they wouldn't function:
the US Federal Trade Commission settled charges that alleged the hardware manufacturer failed to protect consumers as required by federal law. The settlement resolves a complaint that said the 2014 mass compromise was the result of vulnerabilities that allowed attackers to remotely log in to routers and, depending on user configurations, change security settings or access files stored on connected devices.all featured in the roll of dishonor. Were their manufacturers grateful for the help security researchers gave them in making their products less insecure? In some cases yes, in others they responded by hurling legal threats at the researchers.

District Dispatch: Caribbean librarians visit the Washington Office

Thu, 2016-06-16 13:45

Visiting librarians from the Caribbean with ALA Washington Office staff.

On Tuesday, the American Library Association (ALA) was pleased to receive a delegation of librarians and archivists from the Caribbean. These visitors are invited to the United States under the auspices of the International Visitor Leadership Program of the U.S. Department of State. The delegation included:

  • Ryllis Mannix, Antigua and Barbuda
  • Joseph Prosper, Antigua and Barbuda
  • Junior Browne, Barbados
  • Grace Haynes, Barbados
  • Vernanda Raymond, Dominica
  • Claudette Paula Bartholomew Frederick, Grenada
  • Evauntay Bridgewater, Saint Kitts and Nevis
  • Petrine Clarke Whyte, Saint Kitts and Nevis
  • Donna Mason Mclean, Saint Vincent and the Grenadines

Accompanying the delegation were international visitor liaisons Mr. Jason Brown and Ms. Elka Charren.

The central interest of the visitors concerned intellectual property and we did indeed have an energetic discussion of those issues. We touched on the Google Books and Georgia State cases as well as the details concerning the digitization of local content and the intellectual property issues. Not surprisingly, a number of the policy issues are actually not so different—we heard very familiar challenges and themes.

The delegation will be spending several weeks in the United States that includes a visit to the ALA Annual Conference in Orlando—so  perhaps you’ll see them there!

The ALA representatives in this meeting were Alan S. Inouye, Carrie Russell, and Brian Clark. We thoroughly enjoyed the time together and look forward to future meetings with representatives from around the world as we fulfill one of the responsibilities of the Washington Office—to represent ALA and U.S. libraries with international delegations.

The post Caribbean librarians visit the Washington Office appeared first on District Dispatch.

Islandora: Islandoracon 2017!

Thu, 2016-06-16 13:18

The Islandora Foundation is thrilled to announce the second Islandoracon, to be held at the lovely LIUNA Station in Hamilton, Ontario. Islandoracon2017 is sponsored in part by our local host, McMaster University. We will have a lot more information for you in the weeks and months to come, but for now, please save the date so you can join us.

 

FOSS4Lib Recent Releases: Avalon Media System - 5.0

Thu, 2016-06-16 12:07

Last updated June 16, 2016. Created by Peter Murray on June 16, 2016.
Log in to edit this page.

Package: Avalon Media SystemRelease Date: Monday, June 13, 2016

FOSS4Lib Recent Releases: Islandora - 7.x-1.7

Thu, 2016-06-16 08:01

Last updated June 16, 2016. Created by Peter Murray on June 16, 2016.
Log in to edit this page.

Package: IslandoraRelease Date: Wednesday, June 15, 2016

Pages