How to get some unexpected big data satisfaction.
First: build an infrastructure to process millions of document documents. Instead of just doing it home-brew, however, do your big data homework, no shortcuts. Second: unclog some long standing clogged pipe
The feeling is that of “it all makes sense” and it happened to us the other day when we started the dataset indexing pipeline with a queue of a dozen of large datasets (see later for the explanation). After doing that, we just sit back and watched the Sindice infrastructure, that usually takes 1-2 million documents per day, reason and index 50-100 times as much in the same timeframe, no sweats.
The outcome showing on the homepage shortly after:
But what is exactly this “Dataset Pipeline” ?
Well its simply a by dataset approach.. to ingest datasets.
In other words, We decided to do away with basic URL to URL crawling for the LOD cloud websites and instead are taking a dataset by dataset approach, choosing just datasets we consider value to index.
This unfortunately, is a bit of a manual work. Yes, it is sad that LOD (which is an initiative that is born to make data more accessible) hasn’t really succeded at adoption of a way to collect the “deep semantic web” in the same way that a simple sitemap.org initiative has for the normal web. But that is the case.
So what we do now in terms of data acquisition is:
- For normal websites that expose structured data (RDF, RDFa, Microformats, Schema etc etc), we support Sitemaps. Just submit it and you’ll see it indexed in Sindice. For these web sites we can be in sync (24h delay or less) e,g, just expose a simple sitemap “lastmod” field and we’ll get your latest data,
- For RDF datasets (Eg. LOD) you have to ask and send us a pointer to the dump:) . And we’ll add you to a manual watchlist and possibly to the list of those that also make it to SPARQL. Refresh can be scheduled at regular intervals
Which brings us to the next question: which data makes it to the homepage search and which to the Sparql endpoint? is it the same?
No its not the same.
The frontpage search
of Sindice is based on Siren
indexes all (including the full materialization of the reasoning). This index is also updated multiple times a day e.g. in response to pings or sitemap updates. As a downside, while our frontpage advanced search
(plus the cache api
) can be certainly enough to implement some very cool apps (see http://sig.ma
based on a combination of these APIs), they certainly won’t suffice to answer all the possible questions.
The sparql endpoint
on the other hand has full capabilities but only containes a selected subset of the data in Sindice. The SPARQL sindice datasets consists of the first few hundeds websites (in terms of ranking) – see a list here -
and, at the moment, the following RDF LOD datasets:
dbpedia , triples= 698.420.083
europeana ,triples= 117.118.947
cordis , 7.107.144
eures , triples= 7.101.481
ookaboo, triples= 51.088.319
geonames, triples= 114.390.681
worldbank , triples= 87.351.631
omim, triples= 358.222
sider triples= 18.884.280
goa triples= 11.008.490
chembl triples= 137.449.790 (and by the way also see hcls.sindicetech.com
for bioscience data and assisted tools)
ctd triples= 103.354.444
drugbank triples= 456.112
dblp triples= 81.986.947
There is however some “impedence” mismatch between the Siren and SPARQL approach. E.g. Siren is a document index (e.g. in Sindice we arbitrarely limit the size of the documents to 5MB). So you might ask, how we manage to index in our frontpage search big datasets that come in “one chunk” like the above?
The answer is we first “slice them” per entity.
The above mentioned Dataset Pipeline contains a mapreduce job turns a single big dataset into millions of small RDF files containing ‘entity descriptions’ much like what you get resolving a URI tht belongs to a LOD datasets. This way we generated (out of the above datasets) the descriptions for approximately 100 million entities. Which became the 100M documents that went trough the pipeline in well less than one day (on our DERI/Sindice.com11 machine hadoop cluster).
A hint of what’s next: all the above in a comprehensible UI.
From the standard Sindice search you’ll be able to see what’s updated, what’s not, what’s in sparql, what’s not, what’s in each dataset (with links to each site indepth RDF analytics) and more. You’ll be also able to “request” that a normal website (e.g. your site) be included in the list of those that make it to the Sparql endpoint.
Until that happens (likely August) may we recommend you subscribe to the http://sindicetech.com
twitter feed. Via SindiceTech, in the next weeks and months, we’ll make availabe selected parts of the Sindice.com infrastructure to developers and enterprises. (e.g. you might have seen our recent SparQLed release
Due go to the Kisti institute (Korea) which is now supporting
some of the works on Sindice.com and to Openlink for kindly providing Virtuoso
Cluster edition, powering our SPARQL endpoint. Thanks also go to the LOD2 and LATC EU projects, to SFI and others which we list here