Sindice reindexed: find your datasets (much faster)

Having streamlined several procedures inside Sindice, rebuilding the sindice index from scratch now takes just a few hours.

Over the weekend, we built a new Sindice index based on the latest updates of Siren and improvements to the pipelines. This is now in production and sports the following enhancements:

Ranking

  • no more big docs first (sorry guys, this was an issue in the last weeks)
  • properties are weighted differently

Preprocessing

  • Improved support of encoded URIS: Decode encoded characters in URIs prior to indexing. The two versions of the URIs, the decoded and encoded one, are indexed. As a consequence, if you look for an URI (or keyword) with any special characters (e.g., Knud_Möller), this will match also encoded URIs (e.g., http://dblp.l3s.de/d2r/resource/authors/Knud_M%C3%B6ller).
  • Improved tokenisation of URI localname: Previously an URI localname was tokenised, but the non-tokenised version of the localname was not indexed. For example, given the URI http://rdf.data-vocabulary.org/#startDate, we now index as part of the local name: start, date, and startDate (the later one was not possible previously).
  • Improved support of mailto URI: Previously the URI mailtotest@test.com was improperly tokenised and indexed. Now you can search either for mailto:test@test.com or test@test.com. Both will match the mailto URI.

Query Language and processing

  • Improved support of special character in URI. E.g. previously, a tilde ‘~’ in a URI was invalidating the query.
  • Various bug fixes as reported by users.
  • Improved query processing for Ntriple Query. Huge performance benefits in certain situations.

Index Data Structure

  • Improved index compression.

These are the technicals.

In practice the thing I love the most is that my favorite queries (the group by dataset ones = open web dataset finding queries) now work easily 10 times faster :) .

As always, follow the development of our core open source Semantic Information Retrieval Engine SIREn on github: https://github.com/rdelbru/SIREn.

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Written by Giovanni Tummarello on May 31, 2011. Post filed under Announcements, Sindice.

No comments

No comments for this post.

Add your comment

HTML tags allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> .

You can trackback this post from your own site using this URL:
http://blog.sindice.com/2011/05/31/sindice-reindexed-find-your-datasets-much-faster/trackback/