Sindice Blog
-
Sindice downtime notice
We’re talking about downtime again on the Sindice project but this time it’s not due to climate change or some other unforeseen event. Our datacentre cooling system is getting a major upgrade. Unfortunately, due to the scale of this upgrade (the existing cooling system which takes up a significant area of the datacentre needs to be removed, and a new set of units need to be installed) – we need to bring down everything running in the datacentre.
While technically possible to host a second Sindice site in a second datacentre (we do have access to such a datacentre) – it’s not a scenario we have examined in detail to date due to the overall high availability of our current datacentre. In addition, Sindice currently runs on 14 servers – meaning we’d have to have 14 servers sitting around in standby mode. While we’d like to have 100% availability through fail-over, we prefer to offer the higher performance on a day to day basis using those 14 servers in other ways (including a new Hadoop cluster which we’re in the process of bringing online – more about that in a future posting).
In summary, the current schedule for our datacentre is to be down from 0000 GMT, 05-Mar-2010 (Friday) to 2000 GMT, 08-Mar-2010 (Monday). We’re hoping the people working on the datacentre upgrades can get things done sooner but the schedule is pretty aggressive already. Obviously if we get things back up sooner, we’ll let you know.
If you have any further queries on this downtime, please leave a comment or contact us via the Sindice Developers group.
-
Sindice outage
Since I started worked in DERI in January of this year, one of my jobs has been to improve the Sindice infrastructure to make it more robust and fault-tolerant. The Sindice infrastructure consists of a total of 15 servers (including two small hadoop clusters). We’ve taken various actions to improve availability including adopting a basic release process, production-hardening the code, monitoring individual subsystems and tuning systems which exhibited higher failure rates. We’ve been using Zabbix as our monitoring framework since April. It indicates that the overall availability of Sindice’s search functionality since 20-Apr-2009 has been 94.2%. This means that Sindice has been unavailable for a total of about 12 days. For a system built on bleeding edge components and developed as a set of research projects first, and a coherent production infrastructure second, this isn’t bad but we are continuously working on improving this (as to whether we’ll ever achieve the mythical five nines, that’s a discussion for another day). About half of that downtime is accounted for by planned outages for maintenance, system upgrades and infrastructure maintenance.
The most recent outage we suffered (for about 2.5 days) started on 18-Nov-2009 and the source of the outage was somewhat more serious than a compnent or server failure. Those of you from Ireland will be aware that after very heavy rain, the country experienced severe flooding in various areas. Unfortunately, this excessive rainfall caused part of the ground floor of the DERI building to flood. Thanks to the quick response from DERI staff members and NUI Galway facilities people, all systems in our data centre were shutdown before any damage was caused (water and electricity don’t mix all that well). The Sindice infrastructure is entirely located in the DERI building at this time (while we share some facilities with the main NUI Galway data centre, we don’t currently have a fully replicated infrastructure for the Sindice project). Once the cause of the flooding had been addressed and the data centre had been fully dried out, we took some time to verify that all of the electrical infrastructure was intact before we proceeded to restart the Sindice systems on Friday morning. I’m happy to report that all systems came back up without problems and Sindice (and related projects including Sig.ma) resumed operations.
Obviously, we’ve learned some lessons during this outage – we’re currently identifying various measures we can put in place to avoid similar problems in the future and we’re also taking the opportunity to review the overall disaster recovery and business continuity measures we have in place for DERI and the services provided by DERI such as Sindice. As a result of this incident, we believe we’ll be able to deliver a more robust and reliable service and maybe even reach two or three nines availability for Sindice in 2010!
-
New: Inspector, Full Cache API – all with Online Data Reasoning
We’re happy to release today 2 distinct yet interplaying features in Sindice: The Sindice Inspector and the Sindice Cache API (both including Sindice’s Online Data Reasoning).
A) Sindice Inspector
- Takes anything with structured data on (RDF, RDFa, Microformats), and provides several handy ways:
- in a “Sigma” based view
- a novel card/frame based view
- a SVG based interactive graph view (a la google map)
- sortable triples, with prettyprint namespace support
- full ontology tree view for Online Data Reasoning debugging
- Does live Online Data Reasoning: allows a data publisher to see which ontologies are implicitly or explicitly (directly or indirectly, via other ontologies) and
- visualizes the full closure of inferred statements using different colors.
- provides a tree of the ontologies in use and their dependencies.
Ways to use it:
- a tool from Sindice.com (the inspect tab on the homepage) or the Inspector Homepage
- a Bookmarklet (drag it to your bookmark bar and use while browsing)
- an API either raw (Any23 output, no reasoning) or with reasoning
- to send links to structured data files around, every visualization has its own permalink.
Examples:
- Sortable Triples with Reasoning Closure (try)
- Ontology Import Tree of Axel Polleres’s DERI foaf file (try notice how the GEO ontology and the DCTerms elements are only imported indirectly via other ontologies but yet contribute to the reasoning)
- Graph of the RDF representation of Axel’s Facebook public profile (from microformats) (try).
- All the data in a eventful.com page (try)
B) Sindice Cache API
Tired of your favorite data being offline now and then? convinced that you can’t really do a linked data application without any network safety net? rejoice
. With the Sindice Cache API you can:- access and retrieve any of the currently 64 million RDF sources in Sindice with a simple REST api;
- access and retrieve the full set of inferred triples created by Online Data Reasoning (instant access to the precomputed closure, 0 wait time);
- visualize the cache with the same handy tools as available in the inspector. Just try from any Sindice result page.
Feel free to use Sindice Cache with reasoning as a fallback service when data is not available and as a way to add full recursive ontology importing + reasoning support to your application (with none of the massive pain associated with the full procedure). A document you need is not in the cache? just ping the URL in and it will be available within minutes;
———————————-
But what is Online Data Reasoning exactly? Background and implementation
When RDF/RDFa data is put out there on the web, the “explicit” information given in the data is just part of the story. Reasoning is a process by which the vocabulary used in the data (Ontologies) are analyzed to extract new pieces of information (otherwise implicit) – e.g. giving a “date of birth” to an “agent” would makes that a “person”.
Typically, Semantic Web software have manually imported the ontologies they needed. If ontologies are published using W3C best practices, however, it becomes possible to do the entire process automatically: ontological properties can be resolved (i.e. as they are resolvable URIs, they are HTTP fetched) and therefore all can be imported automatically …
… but ontologies can import other ontologies, and form circles and so on.
Implementing the process efficiently (with maximal reuse of previous results across files), scalable (with algorithms and frameworks based on Hadoop) and correctly (e.g. no “contamination” of reasoning results between different data files) was not – believe me – an easy task.
We invested in it, however, since we believe it is important to answer queries based also on the “implicit” part of the document: if not else, it allows the markup on web pages to be considerably more concise.
A glimpse behind the curtain
The implementation of the large scale reasoning pipeline is based on [1] and engineered on top of the Hadoop platform.
Ontologies in RDF/RDFa documents (but also in microformats which have been converted to RDF in some form) can be “included” either explicitly with owl:imports declarations or implicitly by using property and class URIs that link directly to the data describing the ontology itself. As ontologies might refer to other ontologies, the import process then needs to be recursively iterated until the dependency graph of ontologies is complete. For example, in Fig. 1 is shown the dependency graph of ontologies of a document Doc. The document imports a first ontology, A, through the use of the property A#property and a second ontology, B, through the use of the class B#class. In addition, the ontology B is importing with the use of an owl:imports assertion the ontology C.

Fig. 1: A document and its dependency graph of ontologies (in bold) materialised. The dashed circles represent inferred ontological assertions in their own context.
In general, for each document one would have to recursively fetch the ontologies, create a model composed by these the original document and only at this point computing the deductive closure.
Clearly, doing this in isolation for each of the file is bound to be very time consuming and in general inefficient since a lot of processing time will be used to recalculate deductions which could be instead reused for possibly large classes of other documents during the indexing procedure. To reuse previous inference results, a simple strategy has been traditionally to put several (if not all) the ontologies together compute and reuse their deductive closures across all the documents to be indexed. While this simple approach is computationally convenient, it turns out to be sometimes inappropriate, since data publishers can reuse or extend ontology terms with divergent points of view.
For example, if an ontology other than the FOAF vocabulary itself extends foaf:name as an inverse functional property (i.e. transform the name as a primary key), an inferencing agent should not consider this axiom outside the scope of the document that references this particular ontology. Doing so would severely decrease the precision of semantic querying, by diluting the results with many false positives. For this reason, a fundamental requirement of the procedure that we developed has been to confine ontological assertions and reasoning tasks into contexts in order to track provenance of inference results. Coming back to Fig. 1, the inferred ontological assertions are stored in “virtual contexts” (dashed circles), and the dashed links symbolise the origin of the inferred assertions, i.e. the set of ontologies that lead to such assertions. By tracking the provenance of each single assertion, we are able to restrict inference to a particular context and prevent one ontology to alter the semantics of other ontologies on a global scale.
This context-dependent Online Data Reasoning mechanism allows Sindice to avoid the deduction of undesirable assertions in documents, a common risk when working with the Web of Data. However, this context mechanism does not restrict the freedom of expression of data publisher. Data publisher are still allowed to reuse and extend ontologies in any manner, but the consequences of their modifications will be confined in their own context, i.e. their published documents, and will not alter the intended semantics of the other documents on the Web.
When all the fragments of data, ontologies and partial reasoning results have been identified, these are all put inside instances of OntoText OWLIM where the actual final inference occurs to produce the triples which will then be indexed. Big thanks to all at Ontotext for supporting us with issues when we encountered them.
The technique presented here is mainly focused on the TBox level (ontology level), since it considers only import relations between ontologies as dependency relationships. But the context-dependent reasoning mechanism can be extended to the ABox level. Instead of following import relations, it would be possible for example follow relations such as owl:sameAs between instances.
A bit about the performance
from Robert Fuller [DERI]
date Wed, Aug 26, 2009 at 11:15 AM
subject Re: do we have numbers about the latest implementation of the reasoner?
Hi,
Based on the current running of dump splitter, which is processing dbpedia.org
Yesterday we processed 2.5million documents which works out to (just under) 30 per second across 3 hadoop nodes, so that’s 10 per second on each node. I think there are 4 concurrent jobs running on each node, which works out for a single document 400ms to apply reasoning and update the index and hbase.
I hope this helps.
Rob.
It does indeed.
Credits:
Reasoning Services and methodology [1] – Renaud Delbru, Michele Catasta, Robert Fuller
Data extraction – Any23 library – http://code.google.com/p/any23/ – Richard Cyganiak, Jurgen Umbrich, Michele Catasta
User interface and frontend programming – Szymon Danielczyk
The Card/Frame and SVG visualization courtesy of http://rhizomik.net/ – thanks especially to Roberto Garcia who supported us during his visit this summer with his wife Rosa and little Victor
hi guys.[1] R. Delbru, A. Polleres, G. Tummarello and S. Decker. Context Dependent Reasoning for Semantic Documents in Sindice. In Proceedings of the 4th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS). Kalrsruhe, Germany, 2008.
-
Position opening: Senior Engineer at Sindice.com
The Sindice Team at DERI.ie has now an opening for a senior software engineer to work on the Sindice.com infrastructure (and other related projects http://sig.ma and others of the Data Intensive Infrastructures research group). Working location is Galway, Ireland.
We require strong java development skills, proven experience with enterprise development practices and frameworks, great flexibility, pragmatic problem solving mindset and positive attitude. Knowledge of semantic web technologies, and experience with lucene and hadoop are considered important pluses.
Please write directly to g.tummarelloatgmail.com
thanks for forwarding this
Giovanni -
Sig.ma – Live views on the Web of Data
Today we release Sig.ma, Hurray \o/ !
Sig.ma is a pretty advanced application implemented on top of Sindice which gives a very visual and interactive access to the “Web of Data” as a whole. Best thing to do, really, is watch the screencast. Bear the first 60 seconds where I introduce the Web of Data, it’s pretty fast after that.While the demo is probably.. agreably cool
there is more to talk about.While Sig.ma is by no mean the first data aggregator for the Semantic Web, its contribution is to show that the sum is really bigger than the single parts and exciting possibilities lie in a holistic approach for automatic semistructured data discovery and consolidation.
In Sig.ma, elements such as large scale semantic web indexing, logic reasoning, data aggregation heuristics, pragmatic ontology alignments and, last but not least, user interaction and refinement, all play together to provide entity descriptions which become live, embeddable data mash ups.
An interesting example:
when we first saw the B&W pictures (e.g. see the demo ) pop up automatically the first time we ran Sigma we were really excited: that DERI data had been there forever yet never meaningfully used or integrated.. let alone automatically! That DERI RDF file does no reuse the right URI for people , doesn’t use Inverse Functional Properties such as “emails”, and uses only one of many ways to say “author”.
But here it was! That file was there, discovered automatically and contributing marvelously to the mashup providing information about papers, (including technical reports that would not be listed otherwise) an extra picture, the phone number, a confirmation of the personal homepage, research projects and more.
Note: this doesn’t mean that the DERI file is bad at all actually. It’s simply not unrealistically great, in other words it was created with a realistic effort, the same that we can expect from any data publisher.
There was no way to get that very useful data with classic Semantic Web inference and rule consolidation alone. All it took was instead the mix of semantic web practices and tricks with pragmatic and elements of soft computing (quite basic indeed).
In our opinion it all makes sense and inspires the following thoughts:
- A little semantic might in fact go a long way: no way there could be something comparable to Sig.ma had we not had a large core of semantically structured data (the Web of Data itself). Publish way more please! Be this in whatever format can be consolidated to RDF.
- … it goes in fact even more a long way when the user is involved, and can with pragmatic actions (e.g. “reject” or “approve”) to steer and validate the results.
- For data publishers: just like on the HTML web you can simply care only about your site. If you don’t reuse other people URIs or you don’t put “sameAs” links or you don’t really use the ontology everyone else is using then.. it can work all the same most likely and for most applications!
- .. but overdescribe
Be verbose with your semantic descriptions, more than what you would be for a human. A well described entity will be the best possible “entity” identifier that one could think. It will automatically generate invisible but robust links to others entity descriptions. So dont just write name = fooguy, make sure you expose all you have (and are willing to share) and let aggregation engines use this data to at least do the best consolidation possible. Good descriptions will also make you show more often in semantic aggregations, foster new applications and make people more likely to integrate with you. - For data consumers: We are working for you really and willing to do the hard work.
This is again very similar to the HTML world. How difficult is to make sense of all the broken HTML out there? Very! How many people have to do it really? just a few, the browser makers. Others can reuse their efforts and concentrate on other aspects. Sig.ma and Sindice are engines that do the hard part for you as a Web of Data developer. We provide open services and open source components (heck, at the end of the week we’re even releasing our index open source
, next the reasoning engine). If there is interest and market others will come and there will be more choice
So let me conclude with a good-fortune Sig.ma of Stefan Decker
(50 sources, sigma “Stefan Decker” + add info “Stefan Decker DERI”, with a couple of manual sources added or deleted)
And the rest follows from the small FAQ in the Sig.ma about page. Cheers!
Why is this potentially revolutionary?
As appropriate data sources become available (pages annotated with RDFa or Microformats), Sig.ma is in a different league in terms of information richness and precision compared to methods solely based on web text analysis.
Sig.ma can be used by humans and software agents alike to obtain structured data about any entity.
Is Sigma noise free?
Not yet. Sig..ma still employs heuristics for many aspects and has to deal with heterogeneous data in the current Web of Data – a very early stage environment! What we can say however is:
- Sig.ma is interactive and can learn from its usage: when a user deletes a piece of information or a source, Sigma writes it down and that piece of information is less likely to show back at a later time.
- We have deliberately chosen very simple strategies at this point to test the general idea more than advanced strategies: the potential for improvement is tremendous.
- The Web of Data itself is very new: until very recently there was basically no way to see this data in action and markup has been done on a best effort-hacker enthusiastic-leap of faith way. Now that Google and Yahoo are starting to recognize the value of page markup, it is realistic to expect improvements in data coverage and quality.
Why does my phone number/picture/favourite movie not appear?
Pages exposing RDF, RDFa or Microformats will appear. If you or your company want information to be found on the web of data, it is very simple to mark up your HTML using RDFa, then submit it to Sindice. You will find it returned by Sig.ma within 10-15 minutes.
How is Sig.ma built? Can I build applications like Sig.ma?
Sig.ma is enabled by Sindice, an index of the web of data. Thanks to Sindice, Sig.ma can accurately locate sources of web data using not only text but also precise attribute value searches and more. Sindice is alive and growing, constantly finding new information, receiving “pings” and immediately adding new documents etc. Where to start? Please write on our forum.
Acknowledgements
Sig.ma and Sindice are built at DERI mainly within the OKKaM Project (ICT-215032) but also with the support of the Science Foundation Ireland under Grant No. SFI/02/CE1/I131, of the ROMULUS project(ICT-217031) and the iMP project.
R&D by Michele Catasta, Richard Cyganiak, Szymon Danielczyk and Giovanni Tummarello.
-
Bring on them pings!
In standard web search, we seldom need to obtain as results pages that were published just a few minutes ago.
This is very different for semantic index that want to help you build applications that respond to “information out there”, possibly as quickly as possible. A way to make this happen is to efficiently process large amounts of “Pings”
Pinging Sindice is simple… submit your url and find your page in the index 15 minutes later. Go to http://sindice.com/main/submit and paste in your url, or use the ping api described at http://sindice.com/developers/api
At least that’s the theory. In practice this has historically never been working too well in Sindice.
Up to two weeks ago.
As normal, the story starts earlier. Three months ago I answered “yes” when Giovanni asked “can you fix our ping manager?”. I didn’t know what I was putting myself up for… but it’s fixed now and Giovanni is a happy man.
Here’s how it works:
- Pings are received by a servlet and inserted into a mysql database with a status of ‘queued’.
- Every 5 minutes the ping manager wakes up and looks for any queued pings. For each ping:- change the ping status to ‘in_progress’
- fetch the ping using any23
- apply TBox reasoning, recursively loading referenced ontologies fetching any missing ontologies as required.
- apply ABox reasoning
- update the Sindice page repository
- update the Sindice master index (SIREn)
- change the ping status to ‘processed’. - Every 20 minutes copy the master (writable) index to the slave (read-only) index.
Some of the changes we’ve made recently to improve the ping manager are:
- A precise monitoring of each stage of the ping manager into our master control panel
- Reports and triggers when something goes wrong
- We added a ‘slow queue’ for large sets of pings. While we typically receive about 5000 pings each day, we have received as many as 100,000 from a single site! In these cases we move the large set of pings into the ‘slow queue’ allowing them to be processed with a lower priority than individual pings or smaller data sets.
- We’ve made many refactorings to process large ontologies and to limit ontology fetching recursion.
As a work in progress, the upcoming improvements include:
- Batch processing of large sets of pings. We will fetch the pings with sindice crawler and process them in our hadoop cluster. Note that this is what is already happening now when Sindice processes large sites via semantic sitemaps ( http://sw.deri.org/2007/07/sitemapextension/ ) or blocks of crawled data. Once this is done, the single machine limit is overcome in terms of pings volume.
- Loopback from the ping manager into the crawler. We will use the pings as seeds for the sindice crawler which will navigate sites honouring the robots.txt file.
- Auto refresh of pings. We will refresh our index automatically after your site changes in order to maintain up-to-date results.
Fifteen years in the software game and I’m still learning every day. Here are three important lessons I have learned or re-learned in the past few months:
- Caching is magic. Applying reasoning is a fairly expensive operation, but we’ve found that very many documents use the same set of ontologies. By caching each set of TBox reasoning results we save tons of computational overhead, far more than anticipated. Right now we are using JCS (http://jakarta.apache.org/jcs/) which is working great. Later I hope we can move memcached (http://www.danga.com/memcached/) in order to a) easily share the cache between servers and b) automatically expire the cached results.
- Reason in a sandbox. Renaud’s contextual reasoning approach[1] provides good and reliable results without any risks of contamination. Renaud is working on a blog article that will explain in practice what it does and what what you can do to make your date work nice with it.
- Java profiles are very useful. Yourkit is great. (http://www.yourkit.com/java/profiler/ )
So don’t forget or be shy: make sure your RDF producing app gets indexed right away by pinging us when new information is made available.
Keep them pings coming! We are now ready for plenty of them.
And special thanks to Michele, Renaud, Stephen, Giovanni and the rest of the great team here at DERI, it’s a pleasure working with ye!
[1] R. Delbru, A. Polleres, G. Tummarello and S. Decker. Context Dependent Reasoning for Semantic Documents in Sindice. In Proceedings of the 4th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS). Kalrsruhe, Germany, 2008. [slides]
-
New Cluster and Sindice downtime (2 June 2009)
We’re getting today a new mighty UPS to support our refurbished-but-still-cool 40 machines/80TB map reduce cluster for large dataset processing. The cluster is based on hardware coming from a decommissioned ICHEC cluster (thanks again guys).
For purely numerical computations, as they were previously used, these machines are not considered cost effective anymore. For IO Bound scenarios, however, they have still a lot to give. So we fit them with plenty of new HDs and we look forward to scale up our current operations and try entirely new things as well.
Unfortunately, as a result of this new installation, the DERI data center will be shut down for a few hours which includes the 10 or so current Sindice servers.
Sorry for the inconvenience and the late notification. Thanks in advance to Michele and Stephen who’ll be taking care of this tonight.
Will post pictures and some stats once its all in place
-
Physically attending next
.. time to upgrade the “meet us at” information on the homepage. It has been stuck for some time.
Rest assure however that for how static the homepage unfortunately seems, works inside are probably more active than ever and we’ll be posting about it this coming week.
In the meanwhile you might be wondering were we can be met in person.
Your first choice should be, of course, loverly Galway Ireland, DERI Institute, ground floor, first corridor to the left. Hotel and restaurant prieces are seriously going down due to the recession so it’s even more pleasant to stay for a weekend.
As an alternative, if you’re in Europe, we’re often around. For different projects we often visit Brussels, Rome, Barcelona and London. As far as US goes, we plan to be at the following venues:
- Semantic Technology Conference 09, San Jose 14-18 June
Meet Giovanni Tummarello or director Stefan Decker. Among the other things we’ll be giving a tutorial on the web of data for the pragmatic developer - IJCAI 2009, Pasadena CA, 11-17 July
- International Semantic Web Conference 2009, Washington 25-29 October 2009
Interested to talk about the Web of Data and how to leverage it? so are we
- Semantic Technology Conference 09, San Jose 14-18 June
-
Sindice downtime, new staff and a cool use case
Dear all, apologies for the downtime this past 2 days. Basically 2 AC went down in the DERI data-center which caused excessive moisture which caused the UPS to blow smoke (a UPS way to say “go on without me”), which brought the data-center down.
Systems are now restored and we’re back in operations. In an unrelated news i’d like to welcome our 2 new full time engineers at the DERI Data Intensive Infrastructure group (http://di2.deri.org) Robert Fuller and Stephen Mulcahy. They’ll be multiplexing between several projects at DI2 but the effects will be visible on Sindice soon, e.g. with the new Sindice release (Beta 2) possibly in a couple of months.
Ah i was happy to find this in the wild: Sindice used to search Twine pages having certain topics. That’s the idea folks: putting RDF into web pages is now so simple and direct with RDFa and more and more sites are doing it (e.g. also to beat microformat accessibility issues). Producing RDFa from your own application is straightforward and thanks to Sindice all your instances can be connected in a snap (and for free) with each others and with others that are using the same markup.
-
Sindice: Developer and Cluster Infrastructure Responsable needed
To further develop Sindice and for a new project for large scale analysis of Web Data, the data intensive infrastructure group ad DERI (yes us) has currently opening for 2 positions:
- Scientific Developer. We’re looking for a highly skilled and motivated individual to help us develop research demonstrators and prototypes. We require advanced and proven experience in Java technology stack. Experience with Lucene, Hadoop and or Ajax development are strong plus.
- Linux Cluster Responsible. We are building a 500 core cluster starting soon, the candidate will be a wiz in building, improving, and managing this, its users and the experiments that it will run.
We offer:
- Higly competent team members to work with. A very supportive environment.
- A work which is stricly connected to research on new topics, new projects and can allow opportunities for experimentation and deployment of new ideas.
- For academia, a very competitive salary.
- Location is the DERI institute in Galway Ireland. Galway is a very pittoresque city in the West of Ireland.
Positions available immediately, salary on application, please write to giovanni.tummarello@deri.org
Update 1/1/2009: These positions have been filled, thanks to all those who responded.













