<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Sindice now supports Efficient Data discovery and Sync</title>
	<atom:link href="http://blog.sindice.com/2010/07/09/sindice-now-supports-efficient-data-discovery-and-sync/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.sindice.com/2010/07/09/sindice-now-supports-efficient-data-discovery-and-sync/</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Thu, 12 Jan 2012 17:51:59 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Giovanni Tummarello</title>
		<link>http://blog.sindice.com/2010/07/09/sindice-now-supports-efficient-data-discovery-and-sync/comment-page-1/#comment-631</link>
		<dc:creator>Giovanni Tummarello</dc:creator>
		<pubDate>Fri, 09 Jul 2010 08:50:18 +0000</pubDate>
		<guid isPermaLink="false">http://blog.sindice.com/?p=264#comment-631</guid>
		<description>I have had a few questions so i post them here as well.
&quot;what is more precisely the difference with crawling&quot; ? 

Crawling (vs fetching a list) implies following links. Here on the other hand no A links are ever followed.  Sindice data synchronization is site stateful, here is the way it works:

a) discovery phase .. where a site is analyzed to see if it would support the efficient sync or which level of it (e.g. sitemaps can have &quot;update frequency&quot; and it would work anyway
b) there is a stateful syncronization process,
   b1)  first the site is acquired entirely with politeness looking at the URLs pointed at in the sitemaps
   b2) once &quot;fetched all&quot;, then the sitemap only is fetched every 24h, and only the &quot;last modified&quot; &gt; last time we fetched URLs are downloaded. (and those marked with a change frequency)
c) also RSS and Atoms are used if listed in the sitemap.

this feat is technically demanding due to several elements e.g. the overall &quot;sustainable&quot; processing makes the operations highly parallel, the variety of forms and sizes sitemaps can have require handling of extreme sizes (hundreds of megabytes of nested gzipped files) in parallel in automatic etc.

Notice that there is some crawling, but it is restricted to something to discover new sites (therefore stopping after few pages) and to crawl pure RDF style. 

&lt;a href=&quot;http://sindice.com/developers/publishing&quot; rel=&quot;nofollow&quot;&gt;The specs&lt;/a&gt; are also good source of more infos on what to expect and how to do things like having your site reindexed from scratch.</description>
		<content:encoded><![CDATA[<p>I have had a few questions so i post them here as well.<br />
&#8220;what is more precisely the difference with crawling&#8221; ? </p>
<p>Crawling (vs fetching a list) implies following links. Here on the other hand no A links are ever followed.  Sindice data synchronization is site stateful, here is the way it works:</p>
<p>a) discovery phase .. where a site is analyzed to see if it would support the efficient sync or which level of it (e.g. sitemaps can have &#8220;update frequency&#8221; and it would work anyway<br />
b) there is a stateful syncronization process,<br />
   b1)  first the site is acquired entirely with politeness looking at the URLs pointed at in the sitemaps<br />
   b2) once &#8220;fetched all&#8221;, then the sitemap only is fetched every 24h, and only the &#8220;last modified&#8221; > last time we fetched URLs are downloaded. (and those marked with a change frequency)<br />
c) also RSS and Atoms are used if listed in the sitemap.</p>
<p>this feat is technically demanding due to several elements e.g. the overall &#8220;sustainable&#8221; processing makes the operations highly parallel, the variety of forms and sizes sitemaps can have require handling of extreme sizes (hundreds of megabytes of nested gzipped files) in parallel in automatic etc.</p>
<p>Notice that there is some crawling, but it is restricted to something to discover new sites (therefore stopping after few pages) and to crawl pure RDF style. </p>
<p><a href="http://sindice.com/developers/publishing" rel="nofollow">The specs</a> are also good source of more infos on what to expect and how to do things like having your site reindexed from scratch.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

