You are here

Feed aggregator

FOSS4Lib Upcoming Events: Islandora Camp FL 2016

planet code4lib - Fri, 2015-08-21 13:34
Date: Wednesday, March 9, 2016 - 08:00 to Friday, March 11, 2016 - 17:00Supports: IslandoraFedora Repository

Last updated August 21, 2015. Created by Peter Murray on August 21, 2015.
Log in to edit this page.

From the announcement:

Islandora is going to Florida! March 9 - 11, 2016, join us on the campus of Florida Gulf Coast University in Fort Myers Florida.

FOSS4Lib Upcoming Events: Islandora Camp CT 2015

planet code4lib - Fri, 2015-08-21 13:31
Date: Tuesday, October 20, 2015 - 08:00 to Friday, October 23, 2015 - 17:00Supports: IslandoraFedora Repository

Last updated August 21, 2015. Created by Peter Murray on August 21, 2015.
Log in to edit this page.

From the announcement:

Islandora is heading to Connecticut! October 20 - 23, join us on the beautiful campus of the University of Connecticut Graduate Business Learning Center, in downtown Hartford.

DuraSpace News: INVITATION: DSpace User Group Meeting in Tübingen, Germany

planet code4lib - Fri, 2015-08-21 00:00

From Pascal-Nicolas Becker, University Library Tübingen

Winchester, MA  The German DSpace User Group was reconstituted in October of 2014. The DSpace German-speaking community now has the opportunity to talk about DSpace topics face-to-face.

SearchHub: Solr as an Apache Spark SQL DataSource

planet code4lib - Thu, 2015-08-20 20:55
Join us for our upcoming webinar, Solr & Spark for Real-Time Big Data Analytics. You’ll learn more about how to use Solr as an Apache Spark SQL DataSource and how to combine data from Solr with data from other enterprise systems to perform advanced analysis tasks at scale. Full details and registration… Part 1 of 2: Read Solr results as a DataFrame This post is the first in a two-part series where I introduce an open source toolkit created by Lucidworks that exposes Solr as a Spark SQL DataSource. The DataSource API provides a clean abstraction layer for Spark developers to read and write structured data from/to an external data source. In this first post, I cover how to read data from Solr into Spark. In the next post, I’ll cover how to write structured data from Spark into Solr. To begin, you’ll need to clone the project from github and build it using Maven: git clone cd spark-solr mvn clean package -DskipTests After building, run the twitter-to-solr example to populate Solr with some tweets. You’ll need your own Twitter API keys, which can be created by following the steps documented here. Start Solr running in Cloud mode and create a collection named “socialdata” partitioned into two shards: bin/solr -c && bin/solr create -c socialdata -shards 2 The remaining sections in this document assume Solr is running in cloud mode on port 8983 with embedded ZooKeeper listening on localhost:9983. Also, to ensure you can see tweets as they are indexed in near real-time, you should enable auto soft-commits using Solr’s Config API. Specifically, for this exercise, we’ll commit tweets every 2 seconds. curl -XPOST http://localhost:8983/solr/socialdata/config \ -d '{"set-property":{"updateHandler.autoSoftCommit.maxTime":"2000"}}' Now, let’s populate Solr with tweets using Spark streaming: $SPARK_HOME/bin/spark-submit --master $SPARK_MASTER \ --conf "spark.executor.extraJavaOptions=-Dtwitter4j.oauth.consumerKey=? -Dtwitter4j.oauth.consumerSecret=? -Dtwitter4j.oauth.accessToken=? -Dtwitter4j.oauth.accessTokenSecret=?" \ --class com.lucidworks.spark.SparkApp \ ./target/spark-solr-1.0-SNAPSHOT-shaded.jar \ twitter-to-solr -zkHost localhost:9983 -collection socialdata Replace $SPARK_MASTER with the URL of your Spark master server. If you don’t have access to a Spark cluster, you can run the Spark job in local mode by passing: --master local[2] However, when running in local mode, there is no executor, so you’ll need to pass the Twitter credentials in the spark.driver.extraJavaOptions parameter instead of spark.executor.extraJavaOptions. Tweets will start flowing into Solr; be sure to let the streaming job run for a few minutes to build up a few thousand tweets in your socialdata collection. You can kill the job using ctrl-C. Next, let’s start up the Spark Scala REPL shell to do some interactive data exploration with our indexed tweets: cd $SPARK_HOME ADD_JARS=$PROJECT_HOME/target/spark-solr-1.0-SNAPSHOT-shaded.jar bin/spark-shell $PROJECT_HOME is the location where you cloned the spark-solr project. Next, let’s load the socialdata collection into Spark by executing the following Scala code in the shell: val tweets = sqlContext.load("solr", Map("zkHost" -> "localhost:9983", "collection" -> "socialdata") ).filter("provider_s='twitter'") On line 1, we use the sqlContext object loaded into the shell automatically by Spark to load a DataSource named “solr”. Behind the scenes, Spark locates the solr.DefaultSource class in the project JAR file we added to the shell using the ADD_JARS environment variable. On line 2, we pass configuration parameters needed by the Solr DataSource to connect to Solr using a Scala Map. At a minimum, we need to pass the ZooKeeper connection string (zkHost) and collection name. By default, the DataSource matches all documents in the collection, but you can pass a Solr query to the DataSource using the optional “query” parameter. This allows to you restrict the documents seen by the DataSource using a Solr query. On line 3, we use a filter to only select documents that come from Twitter (provider_s=’twitter’). At this point, we have a Spark SQL DataFrame object that can read tweets from Solr. In Spark, a DataFrame is a distributed collection of data organized into named columns (see: Conceptually, DataFrames are similar to tables in a relational database except they are partitioned across multiple nodes in a Spark cluster. The following diagram depicts how a DataFrame is constructed by querying our two-shard socialdata collection in Solr using the DataSource API: It’s important to understand that Spark does not actually load the socialdata collection into memory at this point. We’re only setting up to perform some analysis on that data; the actual data isn’t loaded into Spark until it is needed to perform some calculation later in the job. This allows Spark to perform the necessary column and partition pruning operations to optimize data access into Solr. Every DataFrame has a schema. You can use the printSchema() function to get information about the fields available for the tweets DataFrame: tweets.printSchema() Behind the scenes, our DataSource implementation uses Solr’s Schema API to determine the fields and field types for the collection automatically. scala> tweets.printSchema() root |-- _indexed_at_tdt: timestamp (nullable = true) |-- _version_: long (nullable = true) |-- accessLevel_i: integer (nullable = true) |-- author_s: string (nullable = true) |-- createdAt_tdt: timestamp (nullable = true) |-- currentUserRetweetId_l: long (nullable = true) |-- favorited_b: boolean (nullable = true) |-- id: string (nullable = true) |-- id_l: long (nullable = true) ... Next, let’s register the tweets DataFrame as a temp table so that we can use it in SQL queries: tweets.registerTempTable("tweets") For example, we can count the number of retweets by doing: sqlContext.sql("SELECT COUNT(type_s) FROM tweets WHERE type_s='echo'").show() If you check your Solr log, you’ll see the following query was generated by the Solr DataSource to process the SQL statement (note I added the newlines between parameters to make it easier to read the query): q=*:*& fq=provider_s:twitter& fq=type_s:echo& distrib=false& fl=type_s,provider_s& cursorMark=*& start=0& sort=id+asc& collection=socialdata& rows=1000 There are a couple of interesting aspects of this query. First, notice that the provider_s field filter we used when we declared the DataFrame translated into a Solr filter query parameter (fq=provider_s:twitter). Solr will cache an efficient data structure for this filter that can be reused across queries, which improves performance when reading data from Solr to Spark. In addition, the SQL statement included a WHERE clause that also translated into an additional filter query (fq=type_s:echo). Our DataSource implementation handles the translation of SQL clauses to Solr specific query constructs. On the backend, Spark handles the distribution and optimization of the logical plan to execute a job that accesses data sources. Even though there are many fields available for each tweet in our collection, Spark ensures that only the fields needed to satisfy the query are retrieved from the data source, which in this case is only type_s and provider_s. In general, it’s a good idea to only request the specific fields you need access to when reading data in Spark. The query also uses deep-paging cursors to efficiently read documents deep into the result set. If you’re curious how deep paging cursors work in Solr, please read: Also, matching documents are streamed back from Solr, which improves performance because the client side (Spark task) does not have to wait for a full page of documents (1000) to be constructed on the Solr side before receiving data. In other words, documents are streamed back from Solr as soon as the first hit is identified. The last interesting aspect of this query is the distrib=false parameter. Behind the scenes, the Solr DataSource will read data from all shards in a collection in parallel from different Spark tasks. In other words, if you have a collection with ten shards, then the Solr DataSource implementation will use 10 Spark tasks to read from each shard in parallel. The distrib=false parameter ensures that each shard will only execute the query locally instead of distributing it to other shards. However, reading from all shards in parallel does not work for Top N type use cases where you need to read documents from Solr in ranked order across all shards. You can disable the parallelization feature by setting the parallel_shards parameter to false. When set to false, the Solr DataSource will execute a standard distributed query. Consequently, you should use caution when disabling this feature, especially when reading very large result sets from Solr. Not only SQL Beyond SQL, the Spark API exposes a number of functional operations you can perform on a DataFrame. For example, if we wanted to determine the top authors based on the number of posts, we could use the following SQL: sqlContext.sql("select author_s, COUNT(author_s) num_posts from tweets where type_s='post' group by author_s order by num_posts desc limit 10").show() Or, you can use the DataFrame API to achieve the same: tweets.filter("type_s='post'").groupBy("author_s").count(). orderBy(desc("count")).limit(10).show() Another subtle aspect of working with DataFrames is that you as a developer need to decide when to cache the DataFrame based on how expensive it was to create it. For instance, if you load 10’s of millions of rows from Solr and then perform some costly transformation that trims your DataFrame down to 10,000 rows, then it would be wise to cache the smaller DataFrame so that you won’t have to re-read millions of rows from Solr again. On the other hand, caching the original millions of rows pulled from Solr is probably not very useful, as that will consume too much memory. The general advice I follow is to cache DataFrames when you need to reuse them for additional computation and they require some computation to generate. Wrap-up Of course, you don’t need the power of Spark to perform a simple count operation as I did in my example. However, the key takeaway is that the Spark SQL DataSource API makes it very easy to expose the results of any Solr query as a DataFrame. Among other things, this allows you to combine data from Solr with data from other enterprise systems, such as Hive or Postgres, to perform advanced data analysis tasks at scale. Another advantage of the DataSource API is that it allows developers to interact with a data source using any language supported by Spark. For instance, there is no native R interface to Solr, but using Spark SQL, a data scientist can pull data from Solr into an R job seamlessly. In the next post, I’ll cover how to write a DataFrame to Solr using the DataSource API. Join Tim for our upcoming webinar, Solr & Spark for Real-Time Big Data Analytics. You’ll learn more about how to use Solr as a Spark SQL DataSource and how to combine data from Solr with data from other enterprise systems to perform advanced analysis tasks at scale. Full details and registration…

The post Solr as an Apache Spark SQL DataSource appeared first on Lucidworks.

District Dispatch: Start-ups, start your engines

planet code4lib - Thu, 2015-08-20 17:25

From Flickr

Yesterday marked the 12th annual Start-Up Day Across America – a “holiday” dedicated to raising awareness of the importance of local entrepreneurship. Organized under the auspices of the Congressional Caucus on Innovation and Entrepreneurship, National Start-Up Day represents an opportunity for owners of local businesses to meet directly with their elected federal representatives and share their ideas and concerns about the direction of the innovation economy.

The American Library Association believes strongly in the foundational purpose of National Start-Up Day: The advancement of our economy through the encouragement of the entrepreneurial spirit. In fact, you might say that every day is National Start-Up day for America’s libraries. Libraries of all types provide a host of services and resources that can help entrepreneurs and aspiring entrepreneurs at every stage of their efforts to bring an innovative idea to fruition.

According to the ALA/University of Maryland Digital Inclusion Survey, most public libraries (over 99%) report providing economic/workforce services. Of those, about 48% report providing entrepreneurship and small business development services. These services range from providing programming and informational resources on financial analysis, customer relations, supply chain management and marketing, to working directly with actors in the financial sector to help patrons gain access to seed capital and consulting services.

For example, New York Public Library’s Science, Industry and Business Library offers one-on-one small business counseling through SCORE (a small business non-profit associated with the Small Business Administration), and the Brooklyn and Houston Public Libraries are partners in business plan competitions that offer seed capital to local entrepreneurs. These are just a few examples of how libraries are doing their part to create synergies that grow our economy by fostering innovation.

Furthermore, libraries are not just places to launch a business – they’re also places to grow a business. Last Spring, Larra Clark of the ALA Office for Information Technology Policy organized and participated in a program highlighting the growth of co-working areas in libraries at the annual South by Southwest (SXSW) festival. To accommodate the increasing number of self-employed, temp and freelance workers in our communities, numerous libraries offer dedicated work spaces. Startups use the spaces to build and launch new businesses and enterprises using robust digital technologies and resources. Jonathan Marino – one of Larra’s collaborators at SXSW – runs MapStory, an interactive and collaborative platform for mapping change over time, out of the co-working space at D.C. Public Library. Jonathan is just one of numerous entrepreneurs who rely on library resources to operate their respective ventures on a daily basis.

Even if you’re not ready to monetize or market your product, you can come to the library to bring your product into the physical world for the first time. As makerspaces sprout up in libraries across the country, people of all ages with nothing more than a budding idea and a library card are becoming engineers; they’re using their library’s 3D printer, laser cutter and/or CNC router to build a prototype of an item they hope may one day galvanize consumers.

The point of all of this is not just that libraries do lots of stuff to help entrepreneurs (although we do, and we’re proud of that). It’s also that the library community doesn’t have a single, narrow vision for advancing the innovation economy – we help individuals in all areas, of all ages and backgrounds advance their own visions. In short, librarians encourage the diverse communities we serve to ignite the innovation economy in a diversity of ways.

The ALA hopes that National Start-Up Day Across America enriches the discourse on small business policy and encourages entrepreneurs in every part of our country to continue to drive our economy forward.

The post Start-ups, start your engines appeared first on District Dispatch.

Jonathan Rochkind: Blacklight Community Survey

planet code4lib - Thu, 2015-08-20 17:18

I’ve created a Google Docs survey targetted at organizations who have Blacklight installations (or vendor-hosted BL installations on their behalf? Is that a thing?).

Including Blacklight-based stacks like Hydra.

The goal of the survey is to learn more about how Blacklight is being used in “the wild”, specifically but not limited to people’s software stacks they are using BL with.

If you host (or have, or plan to) a Blacklight-based application, it would be great if you filled out the survey!

Filed under: General

District Dispatch: Building bridges at the “Department of Opportunity”

planet code4lib - Thu, 2015-08-20 15:51

This week, the Department of Housing and Urban Development (HUD) (aka the Department of Opportunity) gathered national partners and local government and housing leaders from 28 communities to begin the real work of ConnectHome. Launched in July, the demonstration project aims to connect more than 275,000 households in 28 communities to low-cost internet, devices and technology training. It was my pleasure to participate on behalf of ALA and libraries (along with Metropolitan New York Library Council Director Nate Hill) and discuss the commitment and power of libraries and librarians in closing the digital divide and boosting creation/making as well as access/consumption.

Zach Leverenz, CEO of EveryoneOn, and Robert Greenblum, senior policy advisor to HUD Secretary Julián Castro, opened the convening and served as masters of ceremony throughout the day. “If you’re not digitally literate in the 21st century, you’re illiterate,” Greenblum said, recalling an early conversation with leaders at the 80/20 Foundation in Austin. Greenblum and other HUD officials highlighted the power of 28 communities working at the same time to build broadband connections and technology skills that boost educational and economic opportunity. At the national level, HUD is setting an agency goal around broadband adoption, including developing metrics for measuring progress on closing the digital divide.

ConnectHome Logo

“Empowerment” was the one-word description of the impact of digital inclusion work underway in Austin, according to Sylvia Blanco, executive vice president for the Housing Authority of the City of Austin (HACA). In a city with 92% broadband adoption, only about 30% of public housing residents had some sort of computing device, and only 28% of these residents were connected to the internet. Blanco and local Austin partners in the “Unlocking the Connection” program will serve as “peer mentors” for the ConnectHome initiative. (Of note, I also learned that the Institute of Museum and Library Services’ Building Digital Communities framework served as the template for Austin’s digital inclusion strategic plan.)

Besides meeting new and ongoing collaborators, a favorite thing about gatherings like this is when I make a presentation about the opportunities afforded by libraries, I am immediately approached by audience members who want to tell me about the great staff in their local libraries. In this case, I heard about the leadership of Denver Public Library Director Michelle Jeske and Rockford (IL) Public Library Director Lynn Stainbrook. While ConnectHome offers a new avenue for serving community residents, librarians have already made a mark through early learning opportunities, afterschool programs and technology training. This suite of programs and services is a hallmark of libraries.

At the risk of overusing the metaphor (I know, too late!), this week also is one of connecting the dots. Affordability is a key barrier for accessing the internet, and the Federal Communications Commission is currently considering how to address this through the Lifeline program. The ALA will file comments in this public comment period, joining with others in the civil rights and digital inclusion community to advocate for updating this universal service program to include broadband. Less than half of people with incomes under $25,000 have home broadband access, hobbling equitable access to information in the digital society and undermining economic and innovation goals.

But, as the ConnectHome effort clearly recognizes, broadband access must be married with robust adoption efforts. The Lifeline program is one part of a larger effort needed to scaffold digital opportunity, including relevant content, context and digital literacy training. We hope this message will be reflected in the Broadband Opportunity Council report that goes to President Obama this month.

And all of this activity is part of the concerted effort to increase awareness of libraries as part of the solution for advancing national policy priorities among decision makers and influencers. Digital inclusion and innovation is threaded throughout the National Policy Agenda for Libraries and The E’s of Libraries®.

The ConnectHome attention now turns to local convenings, which will take place in the 28 communities around the country through the end of October. We invite local librarians and other partners in this program to share their impressions and plans from these meetings so we can continue to learn from each other and move the country forward.

The post Building bridges at the “Department of Opportunity” appeared first on District Dispatch.

SearchHub: Introducing Anda: a New Crawler Framework in Lucidworks Fusion

planet code4lib - Thu, 2015-08-20 15:01

Introduction Lucidworks Fusion 2.0 ships with roughly 30 out-of-the-box connector plugins to facilitate data ingestion from a variety of common datasources. 10 of these connectors are powered by a new general-purpose crawler framework called Anda, created at Lucidworks to help simplify and streamline crawler development. Connectors to each of the following Fusion datasources are powered by Anda under-the-hood:

  • Web
  • Local file
  • Box
  • Dropbox
  • Google Drive
  • SharePoint
  • JavaScript
  • JIRA
  • Drupal
  • Github

Inspiration for Anda came from the realization that most crawling tasks have quite a bit in common across crawler implementations. Much of the work entailed in writing a crawler stems from generic requirements unrelated to the exact nature of the datasource being crawled, which indicated the need for some reusable abstractions. The below crawler functionalities are implemented entirely within the Anda framework code, and while their behavior is quite configurable via properties in Fusion datasources, the developer of a new Anda crawler needn’t write any code to leverage these features:

  • Starting, stopping, and aborting crawls
  • Configuration management
  • Crawl-database maintenance and persistence
  • Link-legality checks and link-rewriting
  • Multithreaded-ness and thread-pool management
  • Throttling
  • Recrawl policies
  • Deletion
  • Alias handling
  • De-duplication of similar content
  • Content splitting (e.g. CSV and archives)
  • Emitting content

Instead, Anda reduces the task of developing a new crawler to providing the Anda framework with access to your data. Developers provide this access by implementing one of two Java interfaces that form the core of the Anda Java API: Fetcher and FS (short for filesystem). These interfaces provide the framework code with the necessary methods to fetch documents from a datasource and discern their links, enabling traversal to additional content in the datasource. Fetcher and FS are designed to be as simple to implement as possible, with most all of the actual traversal logic relegated to framework code.

Developing a Crawler

With so many generic crawling tasks, it’s just inefficient to write an entirely new crawler from scratch for each additional datasource. So in Anda, the framework itself is essentially the one crawler, and we plug-in access to the data that we want it to crawl. The Fetcher interface is the more generic of two ways to provide that access.

Writing a Fetcher public interface Fetcher extends Component<Fetcher> { public Content fetch(String id, long lastModified, String signature) throws Exception; }

Fetcher is a purposefully simple Java interface that defines a method fetch() to fetch one document from a datasource. There’s a WebFetcher implementation of Fetcher in Fusion that knows how to fetch web pages (where the id argument to fetch() will be a web page URL), a GithubFetcher for Github content, etc. The fetch() method returns a Content object containing the content of the “item” referenced by id, as well as any links to additional “items”, whatever they may be. The framework itself is truly agnostic to the exact type of “items”/datasource in play—dealing with any datasource-specific details is the job of the Fetcher.

A Fusion datasource definition provides Anda with a set of start-links (via the startLinks property) that seed the first calls to fetch() in order to begin the crawl, and traversal continues from there via links returned in Content objects from fetch(). Crawler developers simply write code to fetch one document and discern its links to additional documents, and the framework takes it from there. Note that Fetcher implementations should be thread-safe, and the fetchThreads datasource property controls the size of the framework’s thread-pool for fetching.

Incremental Crawling

The additional lastModified and signature arguments to fetch() enable incremental crawling. Maintenance and persistence of a crawl-database is one of the most important tasks handled completely by the framework, and values for lastModified (a date) and signature (an optional String value indicating any kind of timestamp, e.g. ETag in a web-crawl) are returned as fields of Content objects, saved in the crawl-database, and then read from the crawl-database and passed to fetch() in re-crawls. A Fetcher should use these metadata to optionally not read and return an item’s content when it hasn’t changed since the last crawl, e.g. by setting an If-Modified-Since header along with the lastModified value in the case of making HTTP requests. There are special “discard” Content constructors for the scenario where an unchanged item didn’t need to be fetched.

Emitting Content

Content objects returned from fetch() might be discards in incremental crawls, but those containing actual content will be emitted to the Fusion pipeline for processing and to be indexed into Solr. The crawler developer needn’t write any code in order for this to happen. The pipelineID property of all Fusion datasources configures the pipeline through which content will be processed, and the user can configure the various stages of that pipeline using the Fusion UI.


Fetcher extends another interface called Component, used to define its lifecycle and provide configuration. Configuration properties themselves are defined using an annotation called @Property, e.g.:

@Property(title="Obey robots.txt?", type=Property.Type.BOOLEAN, defaultValue="true") public static final String OBEY_ROBOTS_PROP = "obeyRobots";

This example from WebFetcher (the Fetcher implementation for Fusion web crawling) defines a boolean datasource property called obeyRobots, which controls whether WebFetcher should heed directives in robots.txt when crawling websites (disable this setting with care!). Fields with @Property annotations for datasource properties should be defined right in the Fetcher class itself, and the title= attribute of a @Property annotation is used by Fusion to render datasource properties in the UI.

Error Handling

Lastly, it’s important to notice that fetch() is allowed to throw any Java exception. Exceptions are persisted, reported, and handled by the framework, including logic to decide how many times fetch() must consecutively fail for a particular item before that item will be deleted in Solr. Most Fetcher implementations will want to catch and react to certain errors (e.g. retrying failed HTTP requests in a web crawl), but any hard failures can simply bubble up through fetch().

What’s next?

Anda’s sweet spot is definitely around quick and easy development of crawlers at present, which usually connote something a bit more specific than the term “connector”. That items link to other items is currently a core assumption of the Anda framework. Web pages have links to other web pages and filesystems have directories linking to other files, yielding structures that clearly require crawling.

We’re working towards enabling additional ingestion paradigms, such as iterating over result-sets (e.g. from a traditional database) instead of following links to define the traversal. Mechanisms to seed crawls in such a fashion are also under development. For now, it may make sense to develop connectors whose ingestion paradigms are less about crawling (e.g. the Slack and JDBC connectors in Fusion) using the general Fusion connectors framework. Stay tuned for future blog posts covering new methods of content ingestion and traversal in Anda.

An Anda SDK with examples and full documentation is also underway, and this blog post will be updated as soon as it’s available. Please Contact Lucidworks in the meantime.

Download Fusion

Additional Reading

Fusion Anda Documentation

Planned upcoming blog posts (links will be posted when available):

Web Crawling in Fusion
The Anda-powered Fusion web crawler provides a number of options to control how web pages are crawled and indexed, control speed of crawling, etc.

Content De-duplication with Anda
De-duplication of similar content is a complex but generalizable task that we’ve tackled in the Anda framework, making it available to any crawler developed using Anda.

Anda Crawler Development Deep-dive
Writing a Fetcher is one of the two ways to provide Anda with access to your data; it’s also possible to implement an interface called FS (short for filesystem). Which one you choose will depend chiefly on whether the target datasource can be modeled in terms of a standard filesystem. If a datasource generally deals in files and directories, then writing an FS may be easier than writing a Fetcher.

The post Introducing Anda: a New Crawler Framework in Lucidworks Fusion appeared first on Lucidworks.

Open Knowledge Foundation: The 2015 Global Open Data Index is around the corner – these are the new datasets we are adding to it!

planet code4lib - Thu, 2015-08-20 14:57

After a two months, 82 ideas for datasets, 386 voters, thirteen civil society organisation consultations and very active discussions on the Index forum, we have finally arrived at a consensus on what datasets will be including in the 2015 Global Open Data Index (GODI).

This year, as part of our objective to ensure that the Global Open Data index is more than a simple measurement tool, we started a discussion with the open data community and our partners in civil society to help us determine which datasets are of high social and democratic value and should be assessed in the 2015 Index. We believe that by making the choice of datasets a collaborative decision, we will be able to raise awareness of and start a conversation around the datasets required for the Index to truly become a civil society audit of the open data revolution. The process included a global survey, a civil society consultation and a forum discussion (read more in a previous blog post about the process).

The community had some wonderful suggestions, making deciding on fifteen datasets no easy task. To narrow down the selection, we started by eliminating the datasets that were not suitable for global analysis. For example, some datasets are collected at the city level and can therefore not be easily compared at a national level. Secondly, we looked to see if there is was a global standard that would allow us to easily compare between countries (such as UN requirements for countries etc). Finally, we tried to find a balance between financial datasets, environmental datasets, geographical datasets and datasets pertaining to the quality of public services. We consulted with experts from different fields and refined our definitions before finally choosing the following datasets:

  1. Government procurement data (past and present tenders) – This dataset is crucial for monitoring government contracts be it to expose corruption or to ensure the efficient use of public funds. Furthermore, when combined with budget and spending data, contracting data helps to provide a full and coherent picture of public finance. We will be looking at both tenders and awards.
  2. Water quality -Water is life and it belongs to all of us. Since this is an important and basic building stone of society, having access to data on drinking water may assist us not only in monitoring safe drinking water but also to help providing it everywhere.
  3. Weather forecast – Weather forecast data is not only one of the most commonly used datasets in mobile and web applications, it is also of fundamental importance for agriculture and disaster relief. Having both weather predictions and historical weather data helps not only to improve quality of life, but to monitor climate change. As such, through the index, we will measure whether governments openly publish data both data on the 5 day forecast and historical figures.
  4. Land ownership – Land ownership data can help citizens understand their urban planning and development as well as assisting in legal disputes over land. In order to assess this category, we are using national cadastres, a map showing land registry.
  5. Health performance data – While this was one of the most popular datasets requested during the consultation, it was challenging to define what would be the best dataset(s) to assess health performance (see the forum discussion). We decided to use this category as an opportunity to test ideas about what to evaluate. After numerous discussions and debates, we decided that this year we would use the following as proxy indicators of health performance:
      Location of public hospitals and clinics.
      Data on infectious diseases rates in a country.
    That being said, we are actively seeking and would greatly appreciate your feedback! Please use the country level comment section to suggest any other datasets that you encounter that might also be a good measure of health performance (for example, from number of beds to budgets). This feedback will help us to learn and define this data category even better for next year’s Index.



In addition to the new datasets, we refined the definitions to some of the existing datasets, while using our new datasets definition guidelines. These were written in order to both produce a more accurate measurement and to create more clarity about what we are looking for with each dataset. The guidelines suggest at least 3 key data characteristics for each datasets, define how often each dataset needs to be updated in order to be considered timely, and suggests level aggregation acceptable for each datasets. The following datasets were changed in order to meet the guidelines:

Elections results – Data should be reported at the polling station level as to allow civil society to monitor elections results better and uncover false reporting. In addition, we added indicators such as number of registered voters, number of invalid votes and number of spoiled ballots.

National map – In addition to the scale of 1:250,000, we added features such as – markings of national roads, national borders, marking of streams, rivers, lakes, mountains.

Pollutant emissions – We defined the specific pollutants that should be included in the datasets.

National Statistics – GDP, unemployment and populations have been selected as the indicators that must be reported.

Public Transport – We refined the definition so it will examine only national level services (as opposed to inter cities ones). We also do not looking for real time data, but time tables.

Location datasets (previously Postcodes) – Postcode data is incredibly valuable for all kinds of business and civic activity; however, 60 countries in the world do not have a postcode system and as such, this dataset has been problematic in the past. For these countries, we have suggested examining a different dataset, administrative boundaries. While it is not as specific as postcodes, administrative boundaries can help to enrich different datasets and create better geographical analysis.

Adding datasets and changing definitions has been part of ongoing iterations and improvements that we have done to the Index this year. While it has been a challenge, we are hoping that these improvements help to create a more fair and accurate assessment of open data progress globally. Your feedback plays an essential role in shaping and improving the Index going forward, please do share it with us.

For the full descriptions of this year’s datasets can be found here.

Galen Charlton: Evergreen 2.9: now with fewer zombies

planet code4lib - Thu, 2015-08-20 01:57

While looking to see what made it into the upcoming 2.9 beta release of Evergreen, I had a suspicion that something unprecedented had happened. I ran some numbers, and it turns out I was right.

Evergreen 2.9 will feature fewer zombies.

Considering that I’m sitting in a hotel room taking a break from Sasquan, the 2015 World Science Fiction Convention, zombies may be an appropriate theme.

But to put it more mundanely, and to reveal the unprecedented bit: more files were deleted in the course of developing Evergreen 2.9 (as compared to the previous stable version) than entirely new files were added.

To reiterate: Evergreen 2.9 will ship with fewer files, even though it includes numerous improvements, including a big chunk of the cataloging section of the web staff client.

Here’s a table counting the number of new files, deleted files, and files that were renamed or moved from the last release in a stable series to the first release in the next series.

Between release… … and release Entirely new files Files deleted Files renamed rel_1_6_2_3 rel_2_0_0 1159 75 145 rel_2_0_12 rel_2_1_0 201 75 176 rel_2_1_6 rel_2_2_0 519 61 120 rel_2_2_9 rel_2_3_0 215 137 2 rel_2_3_12 rel_2_4_0 125 30 8 rel_2_4_6 rel_2_5_0 143 14 1 rel_2_5_9 rel_2_6_0 83 31 4 rel_2_6_7 rel_2_7_0 239 51 4 rel_2_7_7 rel_2_8_0 84 30 15 rel_2_8_2 master 99 277 0

The counts were made using git diff --summary --find-rename FROM..TO | awk '{print $1}' | sort | uniq -c and ignoring file mode changes. For example, to get the counts between release 2.8.2 and the master branch as of this post, I did:

$ git diff --summary --find-renames origin/tags/rel_2_8_2..master|awk '{print $1}'|sort|uniq -c 99 create 277 delete 1 mode

Why am I so excited about this? It means that we’ve made significant progress in getting rid of old code that used to serve a purpose, but no longer does. Dead code may not seem so bad — it just sits there, right? — but like a zombie, it has a way of going after developers’ brains. Want to add a feature or fix a bug? Zombies in the code base can sometimes look like they’re still alive — but time spent fixing bugs in dead code is, of course, wasted. For that matter, time spent double-checking whether a section of code is a zombie or not is time wasted.

Best for the zombies to go away — and kudos to Bill Erickson, Jeff Godin, and Jason Stephenson in particular for removing the remnants of Craftsman, script-based circulation rules, and JSPac from Evergreen 2.9.

DuraSpace News: UPDATE: SHARE Research Information Systems Task Group

planet code4lib - Thu, 2015-08-20 00:00

Winchester, MA  The SHARE ( Research Information Systems Task Group led by DuraSpace CEO Debra Hanken Kurtz will write a brief white paper that surfaces key considerations concerning the quality and completeness of research activity administrative data.

SearchHub: If They Can’t Find It, They Can’t Buy It

planet code4lib - Wed, 2015-08-19 20:13
Sarath Jarugula, VP Partners & Alliances at Lucidworks has a blog post up on IBM’s blog, If They Can’t Find It, They Can’t Buy It: How to Combine Traditional Knowledge with Modern Technical Advances to Drive a Better Commerce Experience: “Search is at the heart of every ecommerce experience. Yet most ecommerce vendors fail to deliver the right user experience. Getting support for the most common types of search queries can be a challenge for even the largest online retailers. Let’s take a look at how traditional online commerce and retail is being transformed by technology advances across search, machine learning, and analytics.” Read the full post on IBM’s blog. Join us for our upcoming webinar Increase Conversion With Better Search.

The post If They Can’t Find It, They Can’t Buy It appeared first on Lucidworks.

FOSS4Lib Upcoming Events: Fedora 4 Workshop at eResearch Australasia

planet code4lib - Wed, 2015-08-19 18:45
Date: Friday, October 23, 2015 - 08:00 to 17:00Supports: Fedora Repository

Last updated August 19, 2015. Created by Peter Murray on August 19, 2015.
Log in to edit this page.

From the announcement:

Harvard Library Innovation Lab: Link roundup August 19, 2015

planet code4lib - Wed, 2015-08-19 18:11

We found some cool stuff you might like.

Michael Itkoff :: How To

Vintage exercise how-to GIFs – mesmerizing

Delight Your Inner Kid With This Giant Lite-Brite | Mental Floss

A really big Lite-Bright

Locking the Web Open: A Call for a Distributed Web

All the pieces are in place for a better web. Let’s build it.

Looking for a Breakthrough? Study Says to Make Time for Tedium

“Moving innovation forward requires effort and time not directly related to the idea itself”

Kodak’s First Digital Moment

Tools, like cameras, are built by linking together complex chains of logic.

Roy Tennant: Where Your Favorite Programming Language Ranks

planet code4lib - Wed, 2015-08-19 16:04

Every programmer knows that any time you want to start a religious war just ask everyone’s favorite programming language and why. This will almost certainly touch off an ever-more-heated exchange as to why one’s particular choice should be every thinking person’s obvious selection. It may even devolve so far as to include name calling. But hey, we’re all friends here so no need to be nasty about our favorite tools.

And why use opinion when we have actual usage data? In this case, the popularity of various languages on Stack Overflow, a popular programming discussion site, and GitHub, the now “go to” code repository. Thus we have the twin lens of what people say that they do and what they actually do.

The result isn’t really all that surprising overall. All the usual suspects appear at the top of the scattergram: Java, Javascript, PHP, Python, flavors of C, Ruby, etc. But I have to say that I’m gratified that Perl, my classic tool (I’m now also dabbling in Python) is still fairly popular.

Why CSS (Cascading Style Sheets) and XML are there is anyone’s guess, as the last time I checked they weren’t programming languages (XSLT justifiably appears). But whatever.

Check out your favorites and see where they fall on the curve.


Note: Thanks to Lorcan Dempsey for pointing this out.

DPLA: Unexpected: Animals do the most amazing things

planet code4lib - Wed, 2015-08-19 15:30

We’ve always had a strange relationship with animals. Some are  beloved family members, we farm, hunt, and fish others, and we are awestruck by some for their natural beauty and power. Whatever we think of them, we love to photograph them. And, that’s been the case since the camera started to capture their likenesses in the 19th Century.

Dogs hold a particular place in our hearts. These sled dogs from the 1870s were part of the winter mail line near Lake Superior.

Gems of Lake Superior scenery, No. 95 [Stereograph], ca.1870s. Childs, B. F. (Brainard F.) (ca. 1841-1921). Courtesy of The New York Public Library.

Dogs are especially photogenic when they are doing tricks. Especially when they carry kittens, children, and tiny cans of dog food in their carts.

St. Bernard Lodge, P.O. Mill Creek, California, 1946. Eastman, Jervie Henry. Courtesy of the University of California, Davis, Library via the California Digital Library.


Publicity at Hollywood dog training school, Southern California, 1935. Courtesy of the University of Southern California Libraries.


Apparently, harnessing and riding animals of all sorts was, in the early era of photography, an American pastime.

Cawston Ostrich Farm Postcard: Anna Held Riding an Ostrich. Courtesy of the South Pasadena Public Library via the California Digital Library.


Frank Buck’s Jungleland from the New York World’s Fair (1939-1940). Courtesy of The New York Public Library.


Boy riding catfish, 1941. Douglass, Neal. Courtesy of the Austin History Center, Austin Public Library via The Portal to Texas History.


A Young Girl in a Goat-Drawn Wagon, 1926. Courtesy of the Private Collection of T. Bradford Willis via The Portal to Texas History.


Children Riding a Deer-Drawn Wagon. Courtesy of the Private Collection of T. Bradford Willis via The Portal to Texas History.


Photographs of animals riding other animals warm our hearts, too.

Horse & dog pals – winter time. Copyright (c) Leslie Jones. This work is licensed for use under a Creative Commons Attribution Non-Commercial No Derivatives License (CC BY-NC-ND). Courtesy of the Boston Public Library via Digital Commonwealth.


Monkey riding a goat, 1935. Copyright (c) Leslie Jones. This work is licensed for use under a Creative Commons Attribution Non-Commercial No Derivatives License (CC BY-NC-ND). Courtesy of the Boston Public Library via Digital Commonwealth.


And finally, this donkey on wheels just leaves us speechless.

Charles “Chick” Hoover and his roller skating donkey, Pinky, in Banning, California, ca. 1958. Courtesy of the Banning Library District via the California Digital Library.

LITA: Attend the 2015 LITA Forum

planet code4lib - Wed, 2015-08-19 12:00

Don’t Miss the 2015 LITA Forum
Minneapolis, MN
November 12-15, 2015

Registration is Now Open!

Join us in Minneapolis, Minnesota, at the Hyatt Regency Minneapolis for the 2015 LITA Forum, a three-day education and networking event featuring 2 preconferences, 3 keynote sessions, more than 55 concurrent sessions and 15 poster presentations. This year including content and planning collaboration with LLAMA. It’s the 18th annual gathering of the highly regarded LITA Forum for technology-minded information professionals. Meet with your colleagues involved in new and leading edge technologies in the library and information technology field. Registration is limited in order to preserve the important networking advantages of a smaller conference. Attendees take advantage of the informal Friday evening reception, networking dinners and other social opportunities to get to know colleagues and speakers.

Keynote Speakers:

  • Mx A. Matienzo, Director of Technology for the Digital Public Library of America
  • Carson Block, Carson Block Consulting Inc.
  • Lisa Welchman, President of Digital Governance Solutions at ActiveStandards.

The Preconference Workshops:

  • So You Want to Make a Makerspace: Strategic Leadership to support the Integration of new and disruptive technologies into Libraries: Practical Tips, Tricks, Strategies, and Solutions for bringing making, fabrication and content creation to your library.
  • Beyond Web Page Analytics: Using Google tools to assess searcher behavior across web properties.

Comments from past attendees:

“Best conference I’ve been to in terms of practical, usable ideas that I can implement at my library.”
“I get so inspired by the presentations and conversations with colleagues who are dealing with the same sorts of issues that I am.”
“After LITA I return to my institution excited to implement solutions I find here.”
“This is always the most informative conference! It inspires me to develop new programs and plan initiatives.”

Forum Sponsors:

EBSCO, Ex Libris, Optimal Workshop, OCLC, Innovative, BiblioCommons, Springshare, A Book Apart and Rosenfeld Media.

Get all the details, register and book a hotel room at the 2015 Forum Web site.

See you in Minneapolis.

William Denton: Music, Code and Data: Hackfest and Happening at Access 2015

planet code4lib - Wed, 2015-08-19 01:09

Access is the annual Canadian conference about libraries and technology. The 2015 conference is in Toronto (the program looks great). As usual at Access, before the conference starts there’s a one-day hackfest. Katie Legere and I are running a special hackfest about music and sonification, to be followed by a concert after the hackfest is over. It’s a hackfest! It’s a happening! It’s code and data and music! Full details: Music, Code and Data: Hackfest and Happening at Access 2015.

DuraSpace News: REGISTER: Fedora 4 Workshop at eResearch Australasia in October

planet code4lib - Wed, 2015-08-19 00:00

Winchester, MA  A one-day Fedora 4 Training Workshop will be held on October 23, 2015 in Brisbane, Queensland, Australia. The event coincides with the eResearch Australasia Conference and will take place in the same venue–the Brisbane Convention & Exhibition Centre. The workshop is being generously subsidized by the University of New South Wales (UNSW) so the cost for attending is only $80AUD. Register here.


Subscribe to code4lib aggregator