Sindice Blog

  • Bring on them pings!

    Posted on Jun 9, 2009 with 1 comment

    In standard web search, we seldom need to obtain as results pages that were published just a few minutes ago.

    This is very different for semantic index that want to help you build applications that respond to “information out there”, possibly as quickly as possible. A way to make this happen is to efficiently process large amounts of “Pings”

    Pinging Sindice is simple… submit your url and find your page in the index 15 minutes later. Go to http://sindice.com/main/submit and paste in your url, or use the ping api described at http://sindice.com/developers/api

    At least that’s the theory. In practice this has historically never been working too well in Sindice.

    Up to two weeks ago.

    As normal, the story starts earlier. Three months ago I answered “yes” when Giovanni asked “can you fix our ping manager?”. I didn’t know what I was putting myself up for… but it’s fixed now and Giovanni is a happy man.

    Here’s how it works:

    1. Pings are received by a servlet and inserted into a mysql database with a status of ‘queued’.
    2. Every 5 minutes the ping manager wakes up and looks for any queued pings. For each ping:- change the ping status to ‘in_progress’
      - fetch the ping using any23
      - apply TBox reasoning, recursively loading referenced ontologies fetching any missing ontologies as required.
      - apply ABox reasoning
      - update the Sindice page repository
      - update the Sindice master index (SIREn)
      - change the ping status to ‘processed’.
    3. Every 20 minutes copy the master (writable) index to the slave (read-only) index.

    Some of the changes we’ve made recently to improve the ping manager are:

    1. A precise monitoring of each stage of the ping manager into our master control panel
    2. Reports and triggers when something goes wrong
    3. We added a ‘slow queue’ for large sets of pings. While we typically receive about 5000 pings each day, we have received as many as 100,000 from a single site! In these cases we move the large set of pings into the ‘slow queue’ allowing them to be processed with a lower priority than individual pings or smaller data sets.
    4. We’ve made many refactorings to process large ontologies and to limit ontology fetching recursion.

    As a work in progress, the upcoming improvements include:

    1. Batch processing of large sets of pings. We will fetch the pings with sindice crawler and process them in our hadoop cluster. Note that this is what is already happening now when Sindice processes large sites via semantic sitemaps ( http://sw.deri.org/2007/07/sitemapextension/ ) or blocks of crawled data. Once this is done, the single machine limit is overcome in terms of pings volume.
    2. Loopback from the ping manager into the crawler. We will use the pings as seeds for the sindice crawler which will navigate sites honouring the robots.txt file.
    3. Auto refresh of pings. We will refresh our index automatically after your site changes in order to maintain up-to-date results.

    Fifteen years in the software game and I’m still learning every day. Here are three important lessons I have learned or re-learned in the past few months:

    1. Caching is magic. Applying reasoning is a fairly expensive operation, but we’ve found that very many documents use the same set of ontologies. By caching each set of TBox reasoning results we save tons of computational overhead, far more than anticipated. Right now we are using JCS (http://jakarta.apache.org/jcs/) which is working great. Later I hope we can move memcached (http://www.danga.com/memcached/) in order to a) easily share the cache between servers and b) automatically expire the cached results.
    2. Reason in a sandbox. Renaud’s contextual reasoning approach[1] provides good and reliable results without any risks of contamination. Renaud is working on a blog article that will explain in practice what it does and what what you can do to make your date work nice with it.
    3. Java profiles are very useful. Yourkit is great. (http://www.yourkit.com/java/profiler/ )

    So don’t forget or be shy: make sure your RDF producing app gets indexed right away by pinging us when new information is made available.

    Keep them pings coming! We are now ready for plenty of them.

    And special thanks to Michele, Renaud, Stephen, Giovanni and the rest of the great team here at DERI, it’s a pleasure working with ye!


    [1]  R. Delbru, A. Polleres, G. Tummarello and S. Decker. Context Dependent Reasoning for Semantic Documents in Sindice. In Proceedings of the 4th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS). Kalrsruhe, Germany, 2008. [slides]

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • New Cluster and Sindice downtime (2 June 2009)

    Posted on Jun 2, 2009 with no comments

    We’re getting today a new mighty UPS to support our refurbished-but-still-cool 40 machines/80TB map reduce cluster for large dataset processing. The cluster is based on hardware coming from a decommissioned ICHEC cluster  (thanks again guys).

    For purely numerical computations, as they were previously used, these machines are not considered cost effective anymore. For IO Bound scenarios, however, they have still a lot to give.  So we fit them with plenty of new HDs and we look forward to scale up our current operations and try entirely new things as well.

    Unfortunately, as a result of this new installation, the DERI data center will be shut down for a few hours which includes the 10 or so current Sindice servers.

    Sorry for the inconvenience and the late notification. Thanks in advance to Michele and Stephen who’ll be taking care of this tonight.

    Will post pictures and some stats once its all in place :-)

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Physically attending next

    Posted on May 30, 2009 with no comments

    .. time to upgrade the  “meet us at” information on the homepage. It has been stuck for some time.

    Rest assure however that for how static the homepage unfortunately seems, works inside are probably more active than ever and we’ll be posting about it this coming week.

    In the meanwhile you might be wondering were we can be met in person.

    Your first choice should be, of course,  loverly Galway Ireland,  DERI Institute, ground floor, first corridor to the left. Hotel and restaurant prieces are seriously going down due to the recession so it’s even more pleasant to stay for a weekend.

    As an alternative, if you’re in Europe, we’re often around. For different projects we often visit  Brussels, Rome, Barcelona and London. As far as US goes, we plan to be at the following venues:

    Interested to talk about the Web of Data and how to leverage it? so are we :-)

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice downtime, new staff and a cool use case

    Posted on Feb 4, 2009 with 2 comments

    Dear all, apologies for the downtime this past 2 days. Basically 2 AC went down in the DERI data-center which caused excessive moisture which caused the UPS to blow smoke (a UPS way to say “go on without me”), which brought the data-center down. :-)

    Systems are now restored and we’re back in operations. In an unrelated news i’d like to welcome our 2 new full time engineers at the DERI Data Intensive Infrastructure group (http://di2.deri.org) Robert Fuller and Stephen Mulcahy. They’ll be multiplexing between several projects at DI2 but the effects will be visible on Sindice soon, e.g. with the new Sindice release (Beta 2) possibly in a couple of months.

    Ah i was happy to find this in the wild: Sindice used to search Twine pages having certain topics. That’s the idea folks: putting RDF into web pages is now so simple and direct with RDFa and more and more sites are doing it (e.g. also to beat microformat accessibility issues). Producing RDFa from your own application is straightforward and thanks to Sindice all your instances can be connected in a snap (and for free) with each others and with others that are using the same markup.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice: Developer and Cluster Infrastructure Responsable needed

    Posted on Dec 5, 2008 with no comments

    To further develop Sindice and for a new project for large scale analysis of Web Data, the data intensive infrastructure group ad DERI (yes us) has currently opening for 2 positions:

    • Scientific Developer. We’re looking for a highly skilled and motivated individual to help us develop research demonstrators and prototypes.  We require advanced and proven experience in Java technology stack. Experience with Lucene, Hadoop and or Ajax development are strong plus.
    • Linux Cluster Responsible.  We are building a 500 core cluster starting soon, the candidate will be a wiz in building, improving, and managing this, its users and the experiments that it will run.

    We offer:

    • Higly competent team members to work with.  A very supportive environment.
    • A work which is stricly connected to research on new topics, new projects and can allow opportunities for experimentation and deployment of new ideas.
    • For academia, a very competitive salary.
    • Location is the DERI institute in Galway Ireland. Galway is a very pittoresque city in the West of Ireland.

    Positions available immediately, salary on application, please write to giovanni.tummarello@deri.org

    Update 1/1/2009: These positions have been filled, thanks to all those who responded.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice Awarded by Saltlux and ESTC

    Posted on Oct 24, 2008 with no comments

    We’re happy to report that Sindice and the Sindice Team have been awarded recently in two separate occasions.

    Sindice was awarded 1st prize at the European Semantic Web The first prize trophy at the European Semantic Technology conference Business Idea competitionTechology Conference ,

    Business Idea Contest ( http://www.estc2008.com/index.php/contest ) unfortunately it is not reported on the homepage just yet, but you can find a nice account here.
    Stefan Decker presented on behalf of the team and apparently the pitch was very well received in fact.  He receved the prize itself during the ESTC gala dinner.

    In a somehow related news, a  few weeks earlier, “Sindice.com: A
    Documented-oriented Lookup Index for Open Linked Data” , by E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn and G. Tummarello recevied the Saltlux prize for the best 2007 DERI paper.

    As a team, we wish to thank all for such recognitions.

    As it should be clear by now, the Semantic Web is all but an easy thing to get started.

    We at Sindice have taken our own route and decided to invest a lot of time in getting a quality infrastructure up for everybody’s use. In an academic setting, this is a somehow bold choice. As we know Academia is mostly based on publications and therefore actual infrastructures (That cost a LOT of time, sweats and have much more uncertainty associated than, say, a conference submission) are generally speaking an investment that few do.

    It is therefore particularly nice, and useful in fact, to receive these recognitions. They do and will help us continue our commitment toward a more and more useful Web of Data.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice Beta 1 index

    Posted on Jun 20, 2008 with no comments

    The Sindice Beta 1 index is now online.

    Apart from the exciting geek wizardries (e.g. the crawling, the reasoning and the indexing are now all performed in pipelined Hadoop jobs), there are plenty of interesting new feature to play with:

    • Much improved support for free text queries. Great improvements in the ranking of the results
    • Much improved support for structured queries: ask for documents where a value appears specifically at certain properties. Examples: “self confessed researchers” or “who claims to knows Richard“?
    • Mixed filtering operations allow restrictions on datasets or classes (e.g. try Washington as a person)
    • Microformats! Supports for plenty of new microformats has been added and previous support improved.
    • Pings are working fine, so anything pinged to us will show just after a short while.

    Please notice, however, that Sindice results are not per se “answers” but instead they’re pointers to documents which should contain the answer you seek (which is on the other hand the strong point of SWSE).

    How to turn them into answers? 2 ways:

    a) Process the results from your applications. See examples in Java, Ruby and PHP here. This will not only get you the piece of information that you need but will also naturally filter out the possibly non relevant document returned (as Sindice query expressivity is necessarely more limited than that of full SPARQL).You can also process Microformats directly in RDF, just access our cached-reasoned-RDF version of it.

    b) Bug us to improve and provide the functionality you need as readily consumable API such as e.g. our Sindice Sioc API (which nicely returns a JSON object, no further processing needed).

    Sorry for some issues on the visualization end still not working (e.g. visualization of most microformats) but the to see the data, all you really have to do is hit that “cache” button and you’ll see the bare triple, which is all you wanted anyway right? :-) . Neat visualization will come.

    Also some datasets are still missing; just watch the number of RDF sources grow as more large datasets offering semantic sitemaps are indexed into the index.

    What to say? We are much looking forward to help you implement your cool web of data applications and enhancements. Just drop us a message here and we’ll be right there with you.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • An Exciting Hard Hat area

    Posted on May 30, 2008 with no comments

    As you might have noticed by the look of the site we’re now in a very exciting transition phase.  Here is a short summary of the main news with more posts about the details of it to follow in the next days:

    The new look: its for developers!

    Now we finally don’t look like Google anymore :-) .  We’re working on making Sindice a great place for developers to go, discover and experiments with querying the web of data.

    We perfectly know how noisy and creatively complex the Web of Data can be and that’s why we’ve prepared a cosy Web of Data forum where to ask questions and exchange tips.

    But what data is out there? See it all in the Web Data Map.  (Alpha!! :-) ).  Watch it as we update it nightly with new datasources pinged or picked up by the crawlers. Soon to come, the map will also be providing example s of interesting queries that spawn across datasets and entity types.

    Once you know what you’re looking for, then use our new query language to find the Semantic Sources you need.

    STATUS: Less than beta. :-)   but isn’t this exciting? so keep your hard hat on and thanks for bearing with both the issues (report!) and with our hearty enthusiasm is seeing the next generation of the Web coming to life.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice in use

    Posted on Feb 20, 2008 with 3 comments

    Sindice is really meant to be used by your project, and for us it could’t be any nicer than seeing this happening more and more :-) .

    Thanks to Sergio Fernández, the SWAML project (which converts mailing list archives to RDF using SIOC and friends) now uses Sindice to find the URIs for email authors, using our IFP lookup on their email addresses.Also, Alexandre Passant developed a Drupal plugin that uses Sindice to find an appropriate URIs for each of your post tags.

    To interact with Sindice, there is now a small Python library, and even a Sindice module for SWI Prolog.More developers-related information can now be found in our http://www.sindice.com/dev, including the new RPC ping API.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice @ 20+ Millions and Openings

    Posted on Nov 22, 2007 with 5 comments

    Works on Sindice are proceeding at full speed and so is the indexing of the Semantic Web.

    Sindice now indexes over 20+ millions Semantic Web documents (21,5 m as i type) and will index your submitted RDFs in usually less than 30 minutes. This great result is entirely due to the dedication of the Sindice development team.

    Some of the geeky bits :-) . We have now the version 2 of the indexing pipeline up and running (Renaud, Eyal, Michele).

    The Sindice indexing pipeline does a job which is all but trivial. And does it at an amazing speed.

    Basically each document is integrated by recursively resolving the URIs of the properties and classes in use, thus calculating a “Web closure” of the explicitly or implicitly imported ontologies. Once this is performed, reasoning happens using RDFS and some OWL ( e.g. FunctionalProperty, TransitiveProperty, sameAs, inverseOf, InverseFunctionalProperty, SymmetricProperty). Sindice has done this for each of the 22 million source independently,in less than 3 weeks (plus the actual indexing and all sort of other processes) on a relatively small cluster (4-6 xeon cores). Not bad? :-)

    Thanks to this processing, we can be as precise and complete as possible in solving tasks such computing the IFP index, composing human legible descriptions of documents and powering at best the forthcoming entity based APIs.

    Notably, all large datasets (e.g. the huge UniProt) are now proudly processed using our brand new Hadoop based Semantic Sitemap processor, specific courtesy of Holger Stenzhorn who has joined the team last month.

    Sindice is Hiring!

    In the context of the EU project OKKAM, to start Jan 2008, we are now looking for candidates who’re interested in developing highly scalable and innovative Semantic Web infrastructures and applications. Positions include Interns, Masters, Ph.D, and Postdocs and Scientific Developers.

    While we of course highly value academic brilliance, we’re expecially looking for candidates who, like us, believe that it is through clever but hard core software engineering and development that we can make the difference on the Semantic Web.

    Successful candidates will be rewarded with top salaries and working conditions.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)