Sindice now supports Efficient Data discovery and Sync
So far semantic web search engines and semantic aggregation services have been inserting datasets by hand or have been based on “random walk” like crawls with no data completeness or freshness guarantees.
After quite some work, we are happy to announce that Sindice is now supporting effective large scale data acquisition with *efficient syncing* capabilities based on already existing standards (a specific use of the sitemap protocol).
For example if you publish 300000 products using RDFa or whatever you want to use (microformats, 303s etc), by making sure you comply to the proposed method, Sindice will now guarantee you
a) to crawl your dataset completely (might take some time since we do this “politely”)
b) ..but only crawl you once and then get just the updated URLs on a daily bases! (so timely data update guarantee)
So this is not “Crawling” anymore, but rather a live “DB like” connection between remote, diverse dataset all based on http. in our opinion this is a *very* important step forward for semantic web data aggregation infrastructures.
The specification we support (and how to make sure you’re being properly indexed) are published here (pretty simple stuff actually!)
and results can be seen from websites which are already implementing these (you might be already doing that indeed without knowing..)
Why not make sure that your site can be effectively kept in sync today?
As always we look forward for comments, suggestions and ideas on how to serve better your data needs (e.g. yes, we’ll also support Openlink dataset sync proposal once the specs are finalized). Feel free to ask specific questions about this or any other Sindice related issue on our dev forum http://sindice.com/main/forum
on behalf of the team http://sindice.com/main/about. Special credits for this to Tamas Benko and Robert Fuller.
p.s. we’re still interested in hiring selected researchers and developers
Add your comment