Sindice reindexed: find your datasets (much faster)
Having streamlined several procedures inside Sindice, rebuilding the sindice index from scratch now takes just a few hours.
Over the weekend, we built a new Sindice index based on the latest updates of Siren and improvements to the pipelines. This is now in production and sports the following enhancements:
Ranking
- no more big docs first (sorry guys, this was an issue in the last weeks)
- properties are weighted differently
Preprocessing
- Improved support of encoded URIS: Decode encoded characters in URIs prior to indexing. The two versions of the URIs, the decoded and encoded one, are indexed. As a consequence, if you look for an URI (or keyword) with any special characters (e.g., Knud_Möller), this will match also encoded URIs (e.g., http://dblp.l3s.de/d2r/resource/authors/Knud_M%C3%B6ller).
- Improved tokenisation of URI localname: Previously an URI localname was tokenised, but the non-tokenised version of the localname was not indexed. For example, given the URI http://rdf.data-vocabulary.org/#startDate, we now index as part of the local name: start, date, and startDate (the later one was not possible previously).
- Improved support of mailto URI: Previously the URI mailtotest@test.com was improperly tokenised and indexed. Now you can search either for mailto:test@test.com or test@test.com. Both will match the mailto URI.
Query Language and processing
- Improved support of special character in URI. E.g. previously, a tilde ‘~’ in a URI was invalidating the query.
- Various bug fixes as reported by users.
- Improved query processing for Ntriple Query. Huge performance benefits in certain situations.
Index Data Structure
- Improved index compression.
These are the technicals.
In practice the thing I love the most is that my favorite queries (the group by dataset ones = open web dataset finding queries) now work easily 10 times faster
.
As always, follow the development of our core open source Semantic Information Retrieval Engine SIREn on github: https://github.com/rdelbru/SIREn.
Add your comment
http://blog.sindice.com/2011/05/31/sindice-reindexed-find-your-datasets-much-faster/trackback/














No comments
No comments for this post.