Searching infinite amounts of Web Data: The new Sindice Index and Frontend

Several goals have kept the the Sindice Team constantly busy in the past year or so. Luckly we’re now getting close to their deployment and today we’re happy to begin by introducing SIREn, the new Sindice core index,  its supporting new frontend and the API. 

SIREn: Sindice’s own semantic search engine

SIREn (Semantic Information Retrieval Engine, SIREn) is the new engine serving queries on our frontend and API.  The difference between SIREn and any of your well known RDF databases is that it uses the same principles found in web search engines rather than techniques derived by the database world, namely it is an Information Retrieval based engine.

Typically, information retrieval is associated to full text queries, and this is supported by SIREn too. On top of this, however, SIREn adds support for certain structured queries, allowing for example to look for documents that contain specific triple patters or text within only given properties.
So, while on the one hand SIREn does not implement full SPARQL, and never will, likely, on the other you get benefits which are highly appealing for a search engine:

  • Theoretically Infinite scalability in a cluster setting – very low hardware demand per amounts of data indexed (as compared to RDF databases)
  • Fast, incremental indexing. No slowdowns as data size grows.
  • Advanced ranking of query results, top K

On top of this SIREn is not programmed from scratch, rather it is built as a plug in of Apache Solr, thus inheriting all the enterprise strength features its provides.
To give you an idea of the hardware requirements to index an excess of 20b triples, it all fits in 1 master server machine – which we then split in 2 for query throughput and duplicated in master slave configuration for service smoothness during updates.

Keeping in mind that we’re talking about very different query capabilities, we can report that in terms of performance, a single machine with SIREn can index 20b triples in 1 day on 2 machines. As a term of comparison, without getting into precise details, let us assure you that this – at the moment –  is 4 to 6 times faster than what we have practically seen achieve with the best DB related technologies at hand, running on twice as much hardware – while indexing half the data (10B).

SIREn is Open Source, released under Apache licence. Scientific details please refer to “R. Delbru, S. Campinas, G. Tummarello. Searching Web Data: an Entity Retrieval and High-Performance Indexing Model.” To appear in Journal of Web Semantics, 2011.

TestDriving SIREn in action: the new sindice frontend and API

The best way to see in action (most of) the features of SIREn is going for a test drive with the new sindice front end, serving queries on the live updated Sindice dataset (230M RDF documents at the time of writing), e.g. from the “advanced” view:

From here you can create complex queries. e.g. combining keyword queries with more advanced queries such as those requiring certain triple patterns (e.g. <s> <p> <o>  → * foaf:knows http://g1o.net/foaf.rdf#me) or having other requisites such as domain name, format, data etc. The full documentation is available online.

In this release we then also have:

  • Results filters, which show on the left in the result pages, to allow quick query refinement.
  • Group by dataset – this is a first version. It turns document search into a dataset search, showing you how many results grouped per second level domain (often equivalent with dataset) .  For example, you are looking for datasets containing events, you can find out that www.songkick.com provides more than 700K documents containing event-related information.
  • An improved ranking for search results. The main ranking logic is based on the TF-IDF methodology. However, we have added some particular rules under the hood to boost certain results. If your keywords is matching the url of the document, its label, or one of the class or property URIs, the document will be ranked higher.
  • Better support for multi-lingual characters (based on UAX#29 Unicode Text Segmentation, thanks to Lucene) and diacritical marks. With respect to diacritical marks, you can now search for either ‘café’ or ‘cafe’. Sindice will return results for documents about either “café” or “cafe”. In the first case (‘café’), results for documents containing ‘café’ will be ranked higher.

To build applications on top of Sindice, the entire capabilities of SIREn are available via a new API (v2).

Also, if you’re around SemTech this July, you can talk to us directly at our presentations, one indeed about Siren and the other where we’ll give a sneak peak to Sindice Site Services – a cool service still in alpha.

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Written by Giovanni Tummarello on May 15, 2011. Post filed under Sindice.

3 comments

  1. [...] Sindice's announcement: "The difference between SIREn and any of your well known RDF databases is that it uses the same [...]

  2. [...] Sindice's announcement: "The difference between SIREn and any of your well known RDF databases is that it uses the same [...]

  3. [...] Sindice's announcement: "The difference between SIREn and any of your well known RDF databases is that it uses the same [...]

Add your comment

HTML tags allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> .

You can trackback this post from your own site using this URL:
http://blog.sindice.com/2011/05/15/searching-infinite-amounts-of-web-data-the-new-sindice-index-and-frontend/trackback/
 
css.php