Sindice Blog

  • Sindice now supports Efficient Data discovery and Sync

    Posted on Jul 9, 2010 with 1 comment

    So far semantic web search engines and semantic aggregation services have been inserting datasets by hand or have been based on “random walk” like crawls with no data completeness or freshness guarantees.

    After quite some work, we are happy to announce that Sindice is now supporting effective large scale data acquisition with *efficient syncing* capabilities based on already existing standards (a specific use of  the sitemap protocol).

    For example if you publish 300000 products using RDFa or whatever you want to use (microformats,  303s etc), by making sure you comply to the proposed method, Sindice will now guarantee you

    a) to crawl your dataset completely (might take some time since we do this “politely”)

    b) ..but only crawl you once and then get just the updated URLs on a daily bases! (so timely data update guarantee)

    So this is not “Crawling” anymore, but rather a live “DB like” connection between remote, diverse dataset all based on http. in our opinion this is a *very* important step forward for semantic web data aggregation infrastructures.

    The specification we support (and how to make sure you’re being properly indexed) are published here  (pretty simple stuff actually!)

    http://sindice.com/developers/publishing

    and results can be seen from websites which are already implementing these (you might be already doing that indeed without knowing..)

    http://sindice.com/search?q=domain:www.scribd.com+date:last_week&qt=term

    Why not make sure that your site can be effectively kept in sync today?

    As always  we look forward for comments, suggestions and ideas on how to serve better your data needs (e.g. yes, we’ll also support Openlink dataset sync proposal once the specs are finalized). Feel free to ask specific questions about this or any other Sindice related issue on our dev forum http://sindice.com/main/forum

    Giovanni,

    on behalf of the team http://sindice.com/main/about. Special credits for this to Tamas Benko and Robert Fuller.

    p.s. we’re still interested in hiring selected researchers and developers

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice planned downtime this weekend

    Posted on Jun 9, 2010 with no comments

    Hi. Due to an expansion of one of our datacentres (and the electrical work that this implies), Sindice and related services such as sig.ma will be down from 1730 GMT+1, 11-Jun-2010 (Friday) to 1730 GMT+1, 12-Jun-2010 (Saturday). This major upgrade will give us increased room to grow the Sindice infrastructure over time. On 27-May-2010 we hit a major milestone in Sindice of having indexed 100 million documents (over 6.5 billion triples) from the semantic web. At current rates of data acquisition, we expect to hit the 200 million document mark before Christmas – so we’ll need that extra room!

    If you have any further queries on this downtime, please leave a comment or contact us via the Sindice Developers group.

    Update: Thanks to the efforts of NUIG’s ISS team sindice.com is back online ahead of schedule. All services (including sig.ma should now be operational).

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Any23 v0.4.0 Released

    Posted on May 27, 2010 with no comments

    The Any23 logo

    Dear All,
    the Sindice FBK team is proud to announce the Any23 0.4.0 release.

    In this new release we paid particular attention in data validation and correction,
    in  particular  we can claim  to extract the  Open Graph Protocol[1]  metadata also
    whether affected by syntactical errors[2].

    We’ve also added full support for the N-Quads[3] format.

    As usual everybody is invited to adopt this new release and
    report any encountered bug[4].

    A live demo is running at [5], please feel free to try it.

    We’re planning the milestone 0.5.0, so if you are waiting for the fix of
    a particular improvement please submit it to us using our issue tracker[4].

    Below an extract of the 0.4.0 release note [6]:
    • The any23-service module has been separated from the any23-core module, the Ant build system has been dropped. [Issue 44]
    • Added support for HTML metadata (RDFa / Microformats) validation and correction (validator). [Issue 77]
    • Added flag to disable the nesting relationship property enrichment. [Issue 67]
    • Improved coverage of Microformat tests. [Issue 65]
    • Improved documentation. [Issue 44]
    • Various code consolidation. [Issues 68, 69, 70, 71, 72, 73, 74, 77]

    Thanks for supporting our work.

    The Any23 Developers Team

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Any23 v0.3.0 Released

    Posted on Apr 23, 2010 with no comments

    The Any23 logo

    Dear All,

    we’re pleased to announce the Any23 0.3.0 release.

    Please keep in mind this is a beta, so everybody using Any23 in a development
    session is invited to migrate to this latest version and report in our issue tracker [1]
    any eventual bug.

    As usual we have a live demo running at [2], please feel free to try it.

    We’re planning the Milestone 0.4, so if you are waiting for the fix of a
    particular issue please verify that it is open, and eventually add a comment to ask more priority.

    To end, below you’ll find an extract of the 0.3.0 release note [3]:

    1. Added detection and enrichment of nested microformats. [Issue #61]
    2. Added detection and support of N-Quads as input and output format. [Issue #7]
    3. General Improvements in RDFa extraction. [Issue #12, Issue #14]
    4. Added support of Turtle embedded in HTML script tag. [Issue #62]
    5. Improvement in encoding support. [Issue #43]
    6. Improvement in Core API. [Issue #27]
    7. Improved support for Species Microformat. [Issue #63]

    Thanks for supporting our work.

    The Any23 Developers Team

    [1] http://code.google.com/p/any23/issues/list
    [2] http://any23.org/
    [3] http://any23.googlecode.com/svn/trunk/any23-core/RELEASE-NOTES.txt

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Any23 v0.2 Released

    Posted on Feb 19, 2010 with no comments
    We are proud to announce a new release of Any23Anything to Triples
    Any23 is a Java library that parses RDF from a variety of Web document formats.
    The currently supported input formats are RDFa, RDF/XML, Turtle, N3, N-Triples,
    and a number of Microformats.
    Any23 is an Open Source project originated from the code created within the Sindice project
    and now used both inside sindice and in related projects e.g. Sig.Ma .
    Any23 comes with a handy command-line tool for parsing RDF and converting between formats.
    We have also set up a demo service where you can try any23 online and use a REST API to convert
    between different RDF formats, similar in spirit to triplr.org:
    The major new features in this release are:
    • Redesigned Java API
    • -  Input from string, stream, file, or URI
    • -  Allow choosing which extractors to use
    • -  Report origin of triples (document/extractor) to client processors
    • -  Various processors/serializers for extracted triples
    • Added flexible command-line tool for easy testing
    • Vastly improved website and documentation
    • Media type and encoding detection via Apache Tika
    • Switched RDF library from Jena to Sesame
    • Added Maven build
    • Better RDF extraction from Microformats
    • Extractors come with example file to document typical in- and output
    • Major refactoring
    • Lots and lots of bugfixes
    The following people have contributed to this release:
    Michele Mostarda and Davide Palmisano (FBK, Trento, Italy, Web of Data Unit (WED) );
    Richard Cyganiak and Jurgen Umbrich (DERI, NUI Galway, Ireland);
    Michele Catasta (EPFL, Lausanne, Switzerland), Giovanni Tummarello.
    This release is the first result of the joint effort between Fondazione Bruno Kessler and DERI,
    that recently started working together on Sindice. We strongly believe that Any23 could benefit from the wide
    Open Source community, especially considering the license under which it has been released.
    We think that the new Any23 v0.2, now integrated in the Sindice ingestion pipeline,
    will impact on the quality of the indexed data.
    This release adds other pieces of Open Sources in Sindice, notably the Semantic Information Retrieval Index (SIREN) available at http://siren.sindice.com .
    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice downtime notice

    Posted on Feb 8, 2010 with no comments

    We’re talking about downtime again on the Sindice project but this time it’s not due to climate change or some other unforeseen event. Our datacentre cooling system is getting a major upgrade. Unfortunately, due to the scale of this upgrade (the existing cooling system which takes up a significant area of the datacentre needs to be removed, and a new set of units need to be installed) – we need to bring down everything running in the datacentre.

    While technically possible to host a second Sindice site in a second datacentre (we do have access to such a datacentre) – it’s not a scenario we have examined in detail to date due to the overall high availability of our current datacentre. In addition, Sindice currently runs on 14 servers – meaning we’d have to have 14 servers sitting around in standby mode. While we’d like to have 100% availability through fail-over, we prefer to offer the higher performance on a day to day basis using those 14 servers in other ways (including a new Hadoop cluster which we’re in the process of bringing online – more about that in a future posting).

    In summary, the current schedule for our datacentre is to be down from 0000 GMT, 05-Mar-2010 (Friday) to 2000 GMT, 08-Mar-2010 (Monday). We’re hoping the people working on the datacentre upgrades can get things done sooner but the schedule is pretty aggressive already. Obviously if we get things back up sooner, we’ll let you know.

    If you have any further queries on this downtime, please leave a comment or contact us via the Sindice Developers group.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice outage

    Posted on Nov 25, 2009 with no comments

    Since I started worked in DERI in January of this year, one of my jobs has been to improve the Sindice infrastructure to make it more robust and fault-tolerant. The Sindice infrastructure consists of a total of 15 servers (including two small hadoop clusters). We’ve taken various actions to improve availability including adopting a basic release process, production-hardening the code, monitoring individual subsystems and tuning systems which exhibited higher failure rates. We’ve been using Zabbix as our monitoring framework since April. It indicates that the overall availability of Sindice’s search functionality since 20-Apr-2009 has been 94.2%. This means that Sindice has been unavailable for a total of about 12 days. For a system built on bleeding edge components and developed as a set of research projects first, and a coherent production infrastructure second, this isn’t bad but we are continuously working on improving this (as to whether we’ll ever achieve the mythical five nines, that’s a discussion for another day). About half of that downtime is accounted for by planned outages for maintenance, system upgrades and infrastructure maintenance.

    The most recent outage we suffered (for about 2.5 days) started on 18-Nov-2009 and the source of the outage was somewhat more serious than a compnent or server failure. Those of you from Ireland will be aware that after very heavy rain, the country experienced severe flooding in various areas. Unfortunately, this excessive rainfall caused part of the ground floor of the DERI building to flood. Thanks to the quick response from DERI staff members and NUI Galway facilities people, all systems in our data centre were shutdown before any damage was caused (water and electricity don’t mix all that well). The Sindice infrastructure is entirely located in the DERI building at this time (while we share some facilities with the main NUI Galway data centre, we don’t currently have a fully replicated infrastructure for the Sindice project). Once the cause of the flooding had been addressed and the data centre had been fully dried out, we took some time to verify that all of the electrical infrastructure was intact before we proceeded to restart the Sindice systems on Friday morning. I’m happy to report that all systems came back up without problems and Sindice (and related projects including Sig.ma) resumed operations.

    Obviously, we’ve learned some lessons during this outage – we’re currently identifying various measures we can put in place to avoid similar problems in the future and we’re also taking the opportunity to review the overall disaster recovery and business continuity measures we have in place for DERI and the services provided by DERI such as Sindice. As a result of this incident, we believe we’ll be able to deliver a more robust and reliable service and maybe even reach two or three nines availability for Sindice in 2010!

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • New: Inspector, Full Cache API – all with Online Data Reasoning

    Posted on Oct 12, 2009 with 1 comment

    We’re happy to release today 2 distinct yet interplaying features in Sindice: The Sindice Inspector and the Sindice Cache API (both including Sindice’s Online Data Reasoning).

    A) Sindice Inspector

    - Takes anything with structured data on (RDF, RDFa, Microformats), and provides several handy ways:

    • in a “Sigma” based view
    • a novel card/frame based view
    • a SVG based interactive graph view (a la google map)
    • sortable triples, with prettyprint namespace support
    • full ontology tree view for Online Data Reasoning debugging

    - Does live Online Data Reasoning: allows a data publisher to see which ontologies are implicitly or explicitly (directly or indirectly, via other ontologies) and

    • visualizes the full closure of inferred statements using different colors.
    • provides a tree of the ontologies in use and their dependencies.

    Ways to use it:

    • a tool from Sindice.com (the inspect tab on the homepage) or the Inspector Homepage
    • a Bookmarklet (drag it to your bookmark bar and use while browsing)
    • an API either raw (Any23 output, no reasoning) or with reasoning
    • to send links to structured data files around, every visualization has its own permalink.

    Examples:

    • Sortable Triples with Reasoning Closure (try)
    • Ontology Import Tree of Axel Polleres’s DERI foaf file (try notice how the GEO ontology and the DCTerms elements are only imported indirectly via other ontologies but yet contribute to the reasoning)
    • Graph of the RDF representation of Axel’s Facebook public profile (from microformats) (try).
    • All the data in a eventful.com page (try)

    B) Sindice Cache API

    Tired of your favorite data being offline now and then? convinced that you can’t really do a linked data application without any network safety net? rejoice :-) . With the Sindice Cache API you can:

    • access and retrieve any of the currently 64 million RDF sources in Sindice with a simple REST api;
    • access and retrieve the full set of inferred triples created by Online Data Reasoning (instant access to the precomputed closure, 0 wait time);
    • visualize the cache with the same handy tools as available in the inspector. Just try from any Sindice result page.

    Feel free to use Sindice Cache with reasoning as a fallback service when data is not available and as a way to add full recursive ontology importing + reasoning support to your application (with none of the massive pain associated with the full procedure). A document you need is not in the cache? just ping the URL in and it will be available within minutes;

    ———————————-

    But what is Online Data Reasoning exactly? Background and implementation

    When RDF/RDFa data is put out there on the web, the “explicit” information given in the data is just part of the story. Reasoning is a process by which the vocabulary used in the data (Ontologies) are analyzed to extract new pieces of information (otherwise implicit) – e.g. giving a “date of birth” to an “agent” would makes that a “person”.

    Typically, Semantic Web software have manually imported the ontologies they needed. If ontologies are published using W3C best practices, however, it becomes possible to do the entire process automatically: ontological properties can be resolved (i.e. as they are resolvable URIs, they are HTTP fetched) and therefore all can be imported automatically …

    … but ontologies can import other ontologies, and form circles and so on.

    Implementing the process efficiently (with maximal reuse of previous results across files), scalable (with algorithms and frameworks based on Hadoop) and correctly (e.g. no “contamination” of reasoning results between different data files) was not – believe me – an easy task.

    We invested in it, however, since we believe it is important to answer queries based also on the “implicit” part of the document: if not else, it allows the markup on web pages to be considerably more concise.

    A glimpse behind the curtain

    The implementation of the large scale reasoning pipeline is based on [1] and engineered on top of the Hadoop platform.

    Ontologies in RDF/RDFa documents (but also in microformats which have been converted to RDF in some form) can be “included” either explicitly with owl:imports declarations or implicitly by using property and class URIs that link directly to the data describing the ontology itself. As ontologies might refer to other ontologies, the import process then needs to be recursively iterated until the dependency graph of ontologies is complete. For example, in Fig. 1 is shown the dependency graph of ontologies of a document Doc. The document imports a first ontology, A, through the use of the property A#property and a second ontology, B, through the use of the class B#class. In addition, the ontology B is importing with the use of an owl:imports assertion the ontology C.

    Fig. 1: A document and its dependency graph of ontologies (in bold) materialised. The dashed circles represent inferred ontological assertions in their own context.

    Fig. 1: A document and its dependency graph of ontologies (in bold) materialised. The dashed circles represent inferred ontological assertions in their own context.

    In general, for each document one would have to recursively fetch the ontologies, create a model composed by these the original document and only at this point computing the deductive closure.

    Clearly, doing this in isolation for each of the file is bound to be very time consuming and in general inefficient since a lot of processing time will be used to recalculate deductions which could be instead reused for possibly large classes of other documents during the indexing procedure. To reuse previous inference results, a simple strategy has been traditionally to put several (if not all) the ontologies together compute and reuse their deductive closures across all the documents to be indexed. While this simple approach is computationally convenient, it turns out to be sometimes inappropriate, since data publishers can reuse or extend ontology terms with divergent points of view.

    For example, if an ontology other than the FOAF vocabulary itself extends foaf:name as an inverse functional property (i.e. transform the name as a primary key), an inferencing agent should not consider this axiom outside the scope of the document that references this particular ontology. Doing so would severely decrease the precision of semantic querying, by diluting the results with many false positives. For this reason, a fundamental requirement of the procedure that we developed has been to confine ontological assertions and reasoning tasks into contexts in order to track provenance of inference results. Coming back to Fig. 1, the inferred ontological assertions are stored in “virtual contexts” (dashed circles), and the dashed links symbolise the origin of the inferred assertions, i.e. the set of ontologies that lead to such assertions. By tracking the provenance of each single assertion, we are able to restrict inference to a particular context and prevent one ontology to alter the semantics of other ontologies on a global scale.

    This context-dependent Online Data Reasoning mechanism allows Sindice to avoid the deduction of undesirable assertions in documents, a common risk when working with the Web of Data. However, this context mechanism does not restrict the freedom of expression of data publisher. Data publisher are still allowed to reuse and extend ontologies in any manner, but the consequences of their modifications will be confined in their own context, i.e. their published documents, and will not alter the intended semantics of the other documents on the Web.

    When all the fragments of data, ontologies and partial reasoning results have been identified, these are all put inside instances of OntoText OWLIM where the actual final inference occurs to produce the triples which will then be indexed. Big thanks to all at Ontotext for supporting us with issues when we encountered them.

    The technique presented here is mainly focused on the TBox level (ontology level), since it considers only import relations between ontologies as dependency relationships. But the context-dependent reasoning mechanism can be extended to the ABox level. Instead of following import relations, it would be possible for example follow relations such as owl:sameAs between instances.

    A bit about the performance

    from Robert Fuller [DERI]

    date Wed, Aug 26, 2009 at 11:15 AM

    subject Re: do we have numbers about the latest implementation of the reasoner?

    Hi,

    Based on the current running of dump splitter, which is processing dbpedia.org

    Yesterday we processed 2.5million documents which works out to (just under) 30 per second across 3 hadoop nodes, so that’s 10 per second on each node. I think there are 4 concurrent jobs running on each node, which works out for a single document 400ms to apply reasoning and update the index and hbase.

    I hope this helps.

    Rob.

    It does indeed.

    Credits:

    Reasoning Services and methodology [1] – Renaud Delbru, Michele Catasta, Robert Fuller

    Data extraction – Any23 library – http://code.google.com/p/any23/ – Richard Cyganiak, Jurgen Umbrich, Michele Catasta

    User interface and frontend programming – Szymon Danielczyk

    The Card/Frame and SVG visualization courtesy of http://rhizomik.net/ – thanks especially to Roberto Garcia who supported us during his visit this summer with his wife Rosa and little Victor :-) hi guys.

    [1] R. Delbru, A. Polleres, G. Tummarello and S. Decker. Context Dependent Reasoning for Semantic Documents in Sindice. In Proceedings of the 4th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS). Kalrsruhe, Germany, 2008.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Position opening: Senior Engineer at Sindice.com

    Posted on Aug 22, 2009 with no comments

    The Sindice Team at DERI.ie has now an opening for a senior software engineer to work on the Sindice.com infrastructure (and other related projects http://sig.ma and others of the Data Intensive Infrastructures research group). Working location is Galway, Ireland.

    We require strong java development skills, proven experience with enterprise development practices and frameworks, great flexibility, pragmatic problem solving mindset and positive attitude. Knowledge of semantic web technologies, and experience with lucene and hadoop are considered important pluses.

    Please write directly to g.tummarelloatgmail.com

    thanks for forwarding this
    Giovanni

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sig.ma – Live views on the Web of Data

    Posted on Jul 22, 2009 with 5 comments

    Today we release Sig.ma, Hurray \o/ !
    Sig.ma is a pretty advanced application implemented on top of Sindice which gives a very visual and interactive access to the “Web of Data” as a whole.  Best thing to do, really, is watch the screencast. Bear the first 60 seconds where I introduce the Web of Data, it’s pretty fast after that.

    While the demo is probably.. agreably cool :-) there is more to talk about.

    While Sig.ma is by no mean the first data aggregator for the Semantic Web, its contribution is to show that the sum is really bigger than the single parts and exciting possibilities lie in a holistic approach for automatic semistructured data discovery and consolidation.

    In Sig.ma, elements such as large scale semantic web indexing, logic reasoning, data aggregation heuristics, pragmatic ontology alignments and, last but not least, user interaction and refinement, all play together to provide entity descriptions which become live, embeddable data mash ups.

    An interesting example:

    when we first saw the B&W pictures (e.g. see the demo ) pop up automatically the first time we ran Sigma we were really excited: that DERI data had been there forever yet never meaningfully used or integrated.. let alone automatically! That DERI RDF file does no reuse the right URI for people , doesn’t use Inverse Functional Properties such as “emails”, and  uses only one of many ways to say “author”.

    But here it was! That file was there, discovered automatically and  contributing marvelously to the mashup providing information about papers,  (including technical reports that would not be listed otherwise) an extra picture, the phone number, a confirmation of the personal homepage, research projects and more.

    Note: this doesn’t mean that the DERI file is bad at all actually. It’s simply not unrealistically great, in other words it was created with a realistic effort, the same that we can expect from any data publisher.

    There was no way to get that very useful data with classic  Semantic Web inference and rule consolidation alone. All it took was instead the mix of semantic web practices and tricks with pragmatic and elements of soft computing (quite basic indeed).

    In our opinion it all makes sense and inspires the following thoughts:

    1. A little semantic might in fact go a long way:  no way there could be something comparable to Sig.ma had we not had a large core of semantically structured data (the Web of Data itself). Publish way more please!  Be this in whatever format can be consolidated to RDF.
    2. … it goes in fact even more a long way when the user is involved, and can with pragmatic actions (e.g. “reject” or “approve”) to  steer and validate the results.
    3. For data publishers: just like on the HTML web you can simply care only about your site.  If you don’t reuse other people URIs or you don’t put “sameAs” links or you don’t really use the ontology everyone else is using then..  it can work all the same most likely and for most applications!
    4. .. but overdescribe
      Be verbose with your semantic descriptions,  more than what you would be for a human. A well described entity will be the best possible “entity” identifier that one could think. It will automatically generate invisible but robust links to others entity descriptions. So dont just write name = fooguy, make sure you expose all you have (and are willing to share) and let aggregation engines use this data to at least do the best consolidation possible.  Good descriptions will also make you show more often in semantic aggregations, foster new applications and make people more likely to integrate with you.
    5. For data consumers: We are working for you really and willing to do the hard work.
      This is again very similar to the HTML world. How difficult is to make sense of all the broken HTML out there? Very! How many people have to do it really? just a few, the browser makers.  Others can reuse their efforts and concentrate on other aspects. Sig.ma and Sindice are engines that do the hard part for you as a Web of Data developer.  We provide open services and open source components (heck, at the end of the week we’re even releasing our index open source :-) , next the reasoning engine).  If there is interest and market others will come and there will be more choice

    So let me conclude with a good-fortune  Sig.ma of Stefan Decker :-) (50 sources, sigma “Stefan Decker” + add info “Stefan Decker DERI”, with a  couple of manual sources added or deleted)


    And the rest follows from the small FAQ in the Sig.ma about page. Cheers!

    Why is this potentially revolutionary?

    As appropriate data sources become available (pages annotated with RDFa or Microformats), Sig.ma is in a different league in terms of information richness and precision compared to methods solely based on web text analysis.

    Sig.ma can be used by humans and software agents alike to obtain structured data about any entity.

    Is Sigma noise free?

    Not yet. Sig..ma still employs heuristics for many aspects and has to deal with heterogeneous data in the current Web of Data – a very early stage environment! What we can say however is:

    1. Sig.ma is interactive and can learn from its usage: when a user deletes a piece of information or a source, Sigma writes it down and that piece of information is less likely to show back at a later time.
    2. We have deliberately chosen very simple strategies at this point to test the general idea more than advanced strategies: the potential for improvement is tremendous.
    3. The Web of Data itself is very new: until very recently there was basically no way to see this data in action and markup has been done on a best effort-hacker enthusiastic-leap of faith way. Now that Google and Yahoo are starting to recognize the value of page markup, it is realistic to expect improvements in data coverage and quality.

    Why does my phone number/picture/favourite movie not appear?

    Pages exposing RDF, RDFa or Microformats will appear. If you or your company want information to be found on the web of data, it is very simple to mark up your HTML using RDFa, then submit it to Sindice. You will find it returned by Sig.ma within 10-15 minutes.

    How is Sig.ma built? Can I build applications like Sig.ma?

    Sig.ma is enabled by Sindice, an index of the web of data. Thanks to Sindice, Sig.ma can accurately locate sources of web data using not only text but also precise attribute value searches and more. Sindice is alive and growing, constantly finding new information, receiving “pings” and immediately adding new documents etc. Where to start? Please write on our forum.

    Acknowledgements

    Sig.ma and Sindice are built at DERI mainly within the OKKaM Project (ICT-215032) but also with the support of the Science Foundation Ireland under Grant No. SFI/02/CE1/I131, of the ROMULUS project(ICT-217031) and the iMP project.

    R&D by  Michele Catasta, Richard Cyganiak, Szymon Danielczyk and Giovanni Tummarello.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)