Sindice, its startup company and 12 billion+ live triples SPARQL endpoint

Today we’re happy to share two big news:

1) We’re launching today our startup company set to drive commercialization of Sindice services. See press release below. Hurray!

2) We’re making public our main Sindice SPARQL endpoint, containing the entire Sindice dataset. This currently indexes over 12 billion triples and is live updated as data comes in. This is the result of over 12 months of engineering. See technical post (below the press release) :)

——————————————

Sindice.com: Leading Semantic Data Technology Providers Establish Sindice Ltd.

Sindice launches database-like access to the Web of Data for developers and businesses

Galway, Ireland – 14th June 2011 – For immediate release

An ever growing number of Web sites contain rich meta-data about the page contents, e.g. product prices and features in shopping sites, opening hour information for restaurants, or artist details for concerts. This data is also set to explode as search engines recently agreed on metadata terminology with the schema.org initiative. The resulting  “Web of Data” allows for new mobile applications, browser functionality, or search engines. A critical bottleneck for business solutions, however, is a consolidated and customizable way of accessing all of this data at Web scale.

This business need is the target for Sindice.com, the main service provided by Sindice LTD. With the rich meta-data from more than 260 million pages in its index, Sindice.com currently offers the world’s largest repository of continuously updated Web of Data content, and it’s making it available to current and new generation horizontal and vertical market applications. Key service features include: a focused crawler which allows quick but sustainable data indexing and delta detection, and proprietary data indexes which are updated in realtime.

Today we also announce the latest addition to the Sindice APIs: a Web Scale SPARQL access point. A long awaited feature for data oriented Web developers, this functionality allows third parties to  develop novel applications for desktop, mobile, browser, or browser hosted extensions that leverage structured data from a vast collection of Web sites in parallel, eliminating the need to crawl, extract, transform, cleanse, and load on a per Web application basis.

Sindice LTD Solidly Backed by Leading Semantic Data Technology Providers

Sindice LTD is a newly established enterprise with its headquarters in Galway, Ireland. It is a joint venture by the founders of the Sindice.com team (DERI research institute at the National University of Ireland Galway and the FBK research institute from Trento Italy) led by Dr. Giovanni Tummarello and Dr. Renaud Delbru,  OpenLink Software Inc., the makers of the Virtuoso Database for InterWeb scale Linked Data and other leading-edge data integration technology, and Hepp Research GmbH, leading experts in RDFa technology for e-commerce applications, already supported by most major search engines. Sindice LTD will have privileged access to relevant expertise and intellectual property from all of its members.

“Sindice.com turns any existing Web site – offering semantic markup – into a table in a giant Web-scale database”, says Dr. Giovanni Tummarello, the CEO of Sindice LTD. “Based on this, Sindice.com offers advanced search functionality, data cleansing and data consolidation services and private dataspaces. Sindice.com is also focusing on quite practical products set to immediately reward those who have and are investing in semantic markup already e.g. adopting GoodRelations for ecommerce pages, Opengraph or the new Schema.org.”, he adds.

Professor Martin Hepp, the inventor and lead developer of the GoodRelations standard for e-commerce, stresses the potential for targeted shopping, location-based services, and B2B scenarios: “The potential of semantic markup for e-commerce is now firmly recognized by all three major search companies on the planet – after Yahoo (2008) and Google (2010), Bing has recently also announced plans to support the GoodRelations standard. Today’s search engines, however, harvest only the tip of the iceberg of this data solely for a better rendering of the search results. Sindice’s technology allows much more sophisticated novel commerce applications”. Prof. Hepp is backing Sindice LTD via his Hepp Research GmbH, a semantic-data consulting firm.

“Highly scalable and semantically rich linked data spaces has been a pragmatic pursuit at OpenLink Software for a long time”, says Kingsley Idehen, its Founder and CEO.  “Tuning and delivering linked data spaces as big, data-oriented cloud services is extremely challenging. With Sindice, we deliver live Data as a Service (DaaS), in a way that will provide tremendous strategic advantages to the use of Semantic technologies at InterWeb and Enterprise scales”, he explains. “Web-scale semantic dataspaces have been discussed for ages”, he continues, “but tuning the technology and allowing cloud scalability is still extremely difficult. Sindice LTD will provide effortlessly scalable live semantic dataspaces to enable a new wave of data intelligence in applications and within enterprises”

Contact Details

Sindice LTD
DERI IDA Business Park Lower Dangan, Galway Ireland

http://sindice.com

info@sindice.com

OpenLink Software Inc.
10 Burlington Mall Road, Suite 265, Burlington, MA 01803, U.S.A.

http://virtuoso.openlinksw.com/

general.information@openlinksw.com

Hepp Research GmbH
Karlstrasse 8, 88212 Ravensburg, Germany
contact@heppresearch.com

http://www.heppresearch.com/

 

———————

Ok this said, lets get to the technicals

Our SPARQL endpoint sparql.sindice.com is now open in public beta.

Lets take it out for a spin!

e.g. lets take this page http://www.deri.ie/about/team/ which happens to be RDFa marked up ( check it). What are the most common names of people working in DERI?

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT  distinct ?name, count(?name) as ?conte
WHERE {
GRAPH <http://www.deri.ie/about/team/> {
?person
a foaf:Person ;
foaf:givenName ?name;
foaf:phone ?phone .
}
} ORDER BY  DESC(?conte) ?name

http://sparql.sindice.com/sparql?default-graph-uri=&query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0ASELECT++distinct+%3Fname%2C+count%28%3Fname%29+as+%3Fconte%0D%0AWHERE+{%0D%0A++GRAPH+%3Chttp%3A%2F%2Fwww.deri.ie%2Fabout%2Fteam%2F%3E+{%0D%0A++++%3Fperson%0D%0A++++a+foaf%3APerson+%3B%0D%0A+++++foaf%3AgivenName+%3Fname%3B%0D%0A++++++++foaf%3Aphone+%3Fphone+.%0D%0A++}+%0D%0A}+ORDER+BY++DESC%28%3Fconte%29+%3Fname%0D%0A&format=text%2Fhtml&debug=on

Or why don’t we look up the phone numbers of everybody named “Brian” ?

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT  ?person ?phone
WHERE {
GRAPH <http://www.deri.ie/about/team/> {
?person a foaf:Person ;
foaf:givenName “Brian” ;
foaf:phone ?phone .
}
}

http://sparql.sindice.com/sparql?default-graph-uri=&query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0ASELECT++%3Fperson+%3Fphone%0D%0AWHERE+{%0D%0AGRAPH+%3Chttp%3A%2F%2Fwww.deri.ie%2Fabout%2Fteam%2F%3E+{%0D%0A%3Fperson+a+foaf%3APerson+%3B%0D%0Afoaf%3AgivenName+%22Brian%22+%3B%0D%0Afoaf%3Aphone+%3Fphone+.%0D%0A}%0D%0A}&format=text%2Fhtml&debug=on

but the cool stuff obviously comes when you start joining other datasets e.g. papers we wrote, places attended etc.

E.g. just a few months ago the the linked data workshop took place. Its homepage, http://events.linkeddata.org , is also marked with RDFa .

So we can ask Sindice, Is there anyone from the current DERI team page who presented a paper at that workshop and if so what is the title of the paper and his current deri profile picture picture?

which is naive SPARQL sounds something like this:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT  ?name, ?workshop_paper_title, ?deri_picture
WHERE {
GRAPH <http://events.linkeddata.org/ldow2011/> {
?ldowPerson a foaf:Person;
foaf:name ?name.
?paper dc:creator ?ldowPerson;
dc:title ?workshop_paper_title.

}
GRAPH <http://www.deri.ie/about/team/> {
?deriPerson a foaf:Person;
foaf:givenName ?givenName;
foaf:familyName ?familyName;
foaf:img ?deri_picture.
}

FILTER (regex(?name, ?givenName))
FILTER (regex(?name, ?familyName))
}

http://sparql.sindice.com/sparql?default-graph-uri=&query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0APREFIX+dc%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2F%3E%0D%0ASELECT++%3Fname%2C+%3Fworkshop_paper_title%2C+%3Fderi_picture%0D%0AWHERE+{%0D%0A++++GRAPH+%3Chttp%3A%2F%2Fevents.linkeddata.org%2Fldow2011%2F%3E+{%0D%0A++++++%3FldowPerson+a+foaf%3APerson%3B%0D%0A+++++++++++++++++++++foaf%3Aname+%3Fname.%0D%0A+++++++%3Fpaper+dc%3Acreator+%3FldowPerson%3B%0D%0A++++++++++++++++++dc%3Atitle+%3Fworkshop_paper_title.%0D%0A%0D%0A+++++}%0D%0A++++GRAPH+%3Chttp%3A%2F%2Fwww.deri.ie%2Fabout%2Fteam%2F%3E+{%0D%0A+++++++++%3FderiPerson+a+foaf%3APerson%3B%0D%0A++++++++++++++++foaf%3AgivenName+%3FgivenName%3B%0D%0A+++++++++++++++foaf%3AfamilyName+%3FfamilyName%3B%0D%0A+++++++++++++++foaf%3Aimg+%3Fderi_picture.%0D%0A+++++}%0D%0A+++++FILTER+%28regex%28%3Fname%2C+%3FgivenName%29%29%0D%0A+++++FILTER+%28regex%28%3Fname%2C+%3FfamilyName%29%29%0D%0A}&format=text%2Fhtml&debug=on

But the best is: All the data is LIVE, updated every few minutes, currently holding and refreshing more than 12 billion triples. These are less triple than what Siren – our frontend – currently hold, because we feed Siren also with materialized reasoning, by the way.

Imagine the implications: any website compliant with sindice effective data acquisition practices can becomes a datasource for you to use and interlink. You can therefore join linked data sources, content management systems, ecommerce websites, offering microformats etc at will at will and pretty much in real-time as new data appears or is updated on the web.

Of course this is a beta so there will certainly be issues to report, we look forward to heard feedback. Rest assured we’re constantly working to improve the service and on our way to a pretty exciting roadmap.

Gio on behalf of the team.

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)
Written by Giovanni Tummarello on Jun 14, 2011. Post filed under Sindice.

3 comments

  1. Comment by Ivan Herman on June 15, 2011 

    Congratulations, guys…

  2. Pingback from Geoffnet » Blog Archive » Data on Demand on August 27, 2011 

    [...] Sindice LTD Established + his 12 billion triples SPARQL endpoint (sindice.com) [...]

  3. Comment by dipendra singh on January 12, 2012 

    Great Work guys…..
    Can Sindice SPARQL Endpoints be queried from desktop application just like typing query on http://sparql.sindice.com/.

Add your comment

HTML tags allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> .

You can trackback this post from your own site using this URL:
http://blog.sindice.com/2011/06/14/sindice-ltd-established-his-12-billion-triples-sparql-endpoint/trackback/