Searching infinite amounts of Web Data: The new Sindice Index and Frontend
SIREn: Sindice’s own semantic search engine
SIREn (Semantic Information Retrieval Engine, SIREn) is the new engine serving queries on our frontend and API. The difference between SIREn and any of your well known RDF databases is that it uses the same principles found in web search engines rather than techniques derived by the database world, namely it is an Information Retrieval based engine.
Typically, information retrieval is associated to full text queries, and this is supported by SIREn too. On top of this, however, SIREn adds support for certain structured queries, allowing for example to look for documents that contain specific triple patters or text within only given properties.
So, while on the one hand SIREn does not implement full SPARQL, and never will, likely, on the other you get benefits which are highly appealing for a search engine:
- Theoretically Infinite scalability in a cluster setting – very low hardware demand per amounts of data indexed (as compared to RDF databases)
- Fast, incremental indexing. No slowdowns as data size grows.
- Advanced ranking of query results, top K
On top of this SIREn is not programmed from scratch, rather it is built as a plug in of Apache Solr, thus inheriting all the enterprise strength features its provides.
To give you an idea of the hardware requirements to index an excess of 20b triples, it all fits in 1 master server machine – which we then split in 2 for query throughput and duplicated in master slave configuration for service smoothness during updates.
Keeping in mind that we’re talking about very different query capabilities, we can report that in terms of performance, a single machine with SIREn can index 20b triples in 1 day on 2 machines. As a term of comparison, without getting into precise details, let us assure you that this – at the moment – is 4 to 6 times faster than what we have practically seen achieve with the best DB related technologies at hand, running on twice as much hardware – while indexing half the data (10B).
SIREn is Open Source, released under Apache licence. Scientific details please refer to “R. Delbru, S. Campinas, G. Tummarello. Searching Web Data: an Entity Retrieval and High-Performance Indexing Model.” To appear in Journal of Web Semantics, 2011.
TestDriving SIREn in action: the new sindice frontend and API
The best way to see in action (most of) the features of SIREn is going for a test drive with the new sindice front end, serving queries on the live updated Sindice dataset (230M RDF documents at the time of writing), e.g. from the “advanced” view:
From here you can create complex queries. e.g. combining keyword queries with more advanced queries such as those requiring certain triple patterns (e.g. <s> <p> <o> → * foaf:knows http://g1o.net/foaf.rdf#me) or having other requisites such as domain name, format, data etc. The full documentation is available online.
In this release we then also have:
- Results filters, which show on the left in the result pages, to allow quick query refinement.
- Group by dataset – this is a first version. It turns document search into a dataset search, showing you how many results grouped per second level domain (often equivalent with dataset) . For example, you are looking for datasets containing events, you can find out that www.songkick.com provides more than 700K documents containing event-related information.
- An improved ranking for search results. The main ranking logic is based on the TF-IDF methodology. However, we have added some particular rules under the hood to boost certain results. If your keywords is matching the url of the document, its label, or one of the class or property URIs, the document will be ranked higher.
- Better support for multi-lingual characters (based on UAX#29 Unicode Text Segmentation, thanks to Lucene) and diacritical marks. With respect to diacritical marks, you can now search for either ‘café’ or ‘cafe’. Sindice will return results for documents about either “café” or “cafe”. In the first case (‘café’), results for documents containing ‘café’ will be ranked higher.
To build applications on top of Sindice, the entire capabilities of SIREn are available via a new API (v2).
Also, if you’re around SemTech this July, you can talk to us directly at our presentations, one indeed about Siren and the other where we’ll give a sneak peak to Sindice Site Services – a cool service still in alpha.
Add your comment