Sindice Blog

  • Sig.ma Enterprise Edition (EE) available

    Posted on Aug 24, 2011 with no comments

    The original Sig.ma

    The http://sig.ma service was created as a demonstration of live, on the fly Web of Data mashup. Provide a query and Sig.ma will demonstrate how the Web of Data is likely to contain surprising structured information about it (pages that embed RDF, RDFa, Microdata, Microformats)

    By using the Sindice search engine Sig.ma allows a person to get a (live) view of what’s on the “Web of Data” about a given topic. For more information see our blog post. For academic use please cite Sig.ma as in [1].

    Introducing  Sig.ma Enterprise Edition (EE)

    We’re happy today to introduce Sig.ma EE , a standalone, deployable, customisable version of Sig.ma.

    Sig.ma EE is deployed as a web application and will perform on the fly data integration from both local data source and remote services (including Sindice.com) ; just like Sig.ma but mixing your chosen public and private sources.

    Sig.ma EE currently supports the following data providers:

    • SPARQL endpoints (Virtuoso, 4store )
    • YBoss + the Web (Uses YBoss to search then Any23 to get the data)
    • Sindice (with optionally the ability to use Sindice Cache API for fast parallel data collection)

    It is very easy to customise visually and to implement new custom data providers (e.g. for your relational or CMS data) by following documentation provided with Sig.ma EE.

    Want to give it a quick try? Sig.ma EE now also powers the service on the http://sig.ma homepage so you can add a custom datasource (e.g. your publicly available SPARQL endpoint) directly from the “Options” menu you get after you search for something. It is very easy to implement new custom data source (e.g. for your relational or web CMS data) by implementing interfaces following the examples.

    Sig.ma EE is open source and available for download. The standard license under which Sig.ma EE is distributed is the GNU Affero General Public License, version 3. Sig.ma EE is also available under a commercial licence. Please contact us to discuss further.

    Links

    If you would like to share any feedback with our team, at this point please use the sindice-dev google group.

    [1] Giovanni Tummarello, Richard Cyganiak, Michele Catasta, Szymon Danielczyk, Renaud Delbru, Stefan Decker “Sig.ma: Live views on the Web of Data”, Journal of Web Semantics: Science, Services and Agents on the World Wide Web – Volume 8, Issue 4, November 2010, Pages 355-364

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice, its startup company and 12 billion+ live triples SPARQL endpoint

    Posted on Jun 14, 2011 with 3 comments

    Today we’re happy to share two big news:

    1) We’re launching today our startup company set to drive commercialization of Sindice services. See press release below. Hurray!

    2) We’re making public our main Sindice SPARQL endpoint, containing the entire Sindice dataset. This currently indexes over 12 billion triples and is live updated as data comes in. This is the result of over 12 months of engineering. See technical post (below the press release) :)

    ——————————————

    Sindice.com: Leading Semantic Data Technology Providers Establish Sindice Ltd.

    Sindice launches database-like access to the Web of Data for developers and businesses

    Galway, Ireland – 14th June 2011 – For immediate release

    An ever growing number of Web sites contain rich meta-data about the page contents, e.g. product prices and features in shopping sites, opening hour information for restaurants, or artist details for concerts. This data is also set to explode as search engines recently agreed on metadata terminology with the schema.org initiative. The resulting  “Web of Data” allows for new mobile applications, browser functionality, or search engines. A critical bottleneck for business solutions, however, is a consolidated and customizable way of accessing all of this data at Web scale.

    This business need is the target for Sindice.com, the main service provided by Sindice LTD. With the rich meta-data from more than 260 million pages in its index, Sindice.com currently offers the world’s largest repository of continuously updated Web of Data content, and it’s making it available to current and new generation horizontal and vertical market applications. Key service features include: a focused crawler which allows quick but sustainable data indexing and delta detection, and proprietary data indexes which are updated in realtime.

    Today we also announce the latest addition to the Sindice APIs: a Web Scale SPARQL access point. A long awaited feature for data oriented Web developers, this functionality allows third parties to  develop novel applications for desktop, mobile, browser, or browser hosted extensions that leverage structured data from a vast collection of Web sites in parallel, eliminating the need to crawl, extract, transform, cleanse, and load on a per Web application basis.

    Sindice LTD Solidly Backed by Leading Semantic Data Technology Providers

    Sindice LTD is a newly established enterprise with its headquarters in Galway, Ireland. It is a joint venture by the founders of the Sindice.com team (DERI research institute at the National University of Ireland Galway and the FBK research institute from Trento Italy) led by Dr. Giovanni Tummarello and Dr. Renaud Delbru,  OpenLink Software Inc., the makers of the Virtuoso Database for InterWeb scale Linked Data and other leading-edge data integration technology, and Hepp Research GmbH, leading experts in RDFa technology for e-commerce applications, already supported by most major search engines. Sindice LTD will have privileged access to relevant expertise and intellectual property from all of its members.

    “Sindice.com turns any existing Web site – offering semantic markup – into a table in a giant Web-scale database”, says Dr. Giovanni Tummarello, the CEO of Sindice LTD. “Based on this, Sindice.com offers advanced search functionality, data cleansing and data consolidation services and private dataspaces. Sindice.com is also focusing on quite practical products set to immediately reward those who have and are investing in semantic markup already e.g. adopting GoodRelations for ecommerce pages, Opengraph or the new Schema.org.”, he adds.

    Professor Martin Hepp, the inventor and lead developer of the GoodRelations standard for e-commerce, stresses the potential for targeted shopping, location-based services, and B2B scenarios: “The potential of semantic markup for e-commerce is now firmly recognized by all three major search companies on the planet – after Yahoo (2008) and Google (2010), Bing has recently also announced plans to support the GoodRelations standard. Today’s search engines, however, harvest only the tip of the iceberg of this data solely for a better rendering of the search results. Sindice’s technology allows much more sophisticated novel commerce applications”. Prof. Hepp is backing Sindice LTD via his Hepp Research GmbH, a semantic-data consulting firm.

    “Highly scalable and semantically rich linked data spaces has been a pragmatic pursuit at OpenLink Software for a long time”, says Kingsley Idehen, its Founder and CEO.  “Tuning and delivering linked data spaces as big, data-oriented cloud services is extremely challenging. With Sindice, we deliver live Data as a Service (DaaS), in a way that will provide tremendous strategic advantages to the use of Semantic technologies at InterWeb and Enterprise scales”, he explains. “Web-scale semantic dataspaces have been discussed for ages”, he continues, “but tuning the technology and allowing cloud scalability is still extremely difficult. Sindice LTD will provide effortlessly scalable live semantic dataspaces to enable a new wave of data intelligence in applications and within enterprises”

    Contact Details

    Sindice LTD
    DERI IDA Business Park Lower Dangan, Galway Ireland

    http://sindice.com

    info@sindice.com

    OpenLink Software Inc.
    10 Burlington Mall Road, Suite 265, Burlington, MA 01803, U.S.A.

    http://virtuoso.openlinksw.com/

    general.information@openlinksw.com

    Hepp Research GmbH
    Karlstrasse 8, 88212 Ravensburg, Germany
    contact@heppresearch.com

    http://www.heppresearch.com/

     

    ———————

    Ok this said, lets get to the technicals

    Our SPARQL endpoint sparql.sindice.com is now open in public beta.

    Lets take it out for a spin!

    e.g. lets take this page http://www.deri.ie/about/team/ which happens to be RDFa marked up ( check it). What are the most common names of people working in DERI?

    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    SELECT  distinct ?name, count(?name) as ?conte
    WHERE {
    GRAPH <http://www.deri.ie/about/team/> {
    ?person
    a foaf:Person ;
    foaf:givenName ?name;
    foaf:phone ?phone .
    }
    } ORDER BY  DESC(?conte) ?name

    http://sparql.sindice.com/sparql?default-graph-uri=&query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0ASELECT++distinct+%3Fname%2C+count%28%3Fname%29+as+%3Fconte%0D%0AWHERE+{%0D%0A++GRAPH+%3Chttp%3A%2F%2Fwww.deri.ie%2Fabout%2Fteam%2F%3E+{%0D%0A++++%3Fperson%0D%0A++++a+foaf%3APerson+%3B%0D%0A+++++foaf%3AgivenName+%3Fname%3B%0D%0A++++++++foaf%3Aphone+%3Fphone+.%0D%0A++}+%0D%0A}+ORDER+BY++DESC%28%3Fconte%29+%3Fname%0D%0A&format=text%2Fhtml&debug=on

    Or why don’t we look up the phone numbers of everybody named “Brian” ?

    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    SELECT  ?person ?phone
    WHERE {
    GRAPH <http://www.deri.ie/about/team/> {
    ?person a foaf:Person ;
    foaf:givenName “Brian” ;
    foaf:phone ?phone .
    }
    }

    http://sparql.sindice.com/sparql?default-graph-uri=&query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0ASELECT++%3Fperson+%3Fphone%0D%0AWHERE+{%0D%0AGRAPH+%3Chttp%3A%2F%2Fwww.deri.ie%2Fabout%2Fteam%2F%3E+{%0D%0A%3Fperson+a+foaf%3APerson+%3B%0D%0Afoaf%3AgivenName+%22Brian%22+%3B%0D%0Afoaf%3Aphone+%3Fphone+.%0D%0A}%0D%0A}&format=text%2Fhtml&debug=on

    but the cool stuff obviously comes when you start joining other datasets e.g. papers we wrote, places attended etc.

    E.g. just a few months ago the the linked data workshop took place. Its homepage, http://events.linkeddata.org , is also marked with RDFa .

    So we can ask Sindice, Is there anyone from the current DERI team page who presented a paper at that workshop and if so what is the title of the paper and his current deri profile picture picture?

    which is naive SPARQL sounds something like this:

    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX dc: <http://purl.org/dc/elements/1.1/>
    SELECT  ?name, ?workshop_paper_title, ?deri_picture
    WHERE {
    GRAPH <http://events.linkeddata.org/ldow2011/> {
    ?ldowPerson a foaf:Person;
    foaf:name ?name.
    ?paper dc:creator ?ldowPerson;
    dc:title ?workshop_paper_title.

    }
    GRAPH <http://www.deri.ie/about/team/> {
    ?deriPerson a foaf:Person;
    foaf:givenName ?givenName;
    foaf:familyName ?familyName;
    foaf:img ?deri_picture.
    }

    FILTER (regex(?name, ?givenName))
    FILTER (regex(?name, ?familyName))
    }

    http://sparql.sindice.com/sparql?default-graph-uri=&query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0APREFIX+dc%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2F%3E%0D%0ASELECT++%3Fname%2C+%3Fworkshop_paper_title%2C+%3Fderi_picture%0D%0AWHERE+{%0D%0A++++GRAPH+%3Chttp%3A%2F%2Fevents.linkeddata.org%2Fldow2011%2F%3E+{%0D%0A++++++%3FldowPerson+a+foaf%3APerson%3B%0D%0A+++++++++++++++++++++foaf%3Aname+%3Fname.%0D%0A+++++++%3Fpaper+dc%3Acreator+%3FldowPerson%3B%0D%0A++++++++++++++++++dc%3Atitle+%3Fworkshop_paper_title.%0D%0A%0D%0A+++++}%0D%0A++++GRAPH+%3Chttp%3A%2F%2Fwww.deri.ie%2Fabout%2Fteam%2F%3E+{%0D%0A+++++++++%3FderiPerson+a+foaf%3APerson%3B%0D%0A++++++++++++++++foaf%3AgivenName+%3FgivenName%3B%0D%0A+++++++++++++++foaf%3AfamilyName+%3FfamilyName%3B%0D%0A+++++++++++++++foaf%3Aimg+%3Fderi_picture.%0D%0A+++++}%0D%0A+++++FILTER+%28regex%28%3Fname%2C+%3FgivenName%29%29%0D%0A+++++FILTER+%28regex%28%3Fname%2C+%3FfamilyName%29%29%0D%0A}&format=text%2Fhtml&debug=on

    But the best is: All the data is LIVE, updated every few minutes, currently holding and refreshing more than 12 billion triples. These are less triple than what Siren – our frontend – currently hold, because we feed Siren also with materialized reasoning, by the way.

    Imagine the implications: any website compliant with sindice effective data acquisition practices can becomes a datasource for you to use and interlink. You can therefore join linked data sources, content management systems, ecommerce websites, offering microformats etc at will at will and pretty much in real-time as new data appears or is updated on the web.

    Of course this is a beta so there will certainly be issues to report, we look forward to heard feedback. Rest assured we’re constantly working to improve the service and on our way to a pretty exciting roadmap.

    Gio on behalf of the team.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice reindexed: find your datasets (much faster)

    Posted on May 31, 2011 with no comments

    Having streamlined several procedures inside Sindice, rebuilding the sindice index from scratch now takes just a few hours.

    Over the weekend, we built a new Sindice index based on the latest updates of Siren and improvements to the pipelines. This is now in production and sports the following enhancements:

    Ranking

    • no more big docs first (sorry guys, this was an issue in the last weeks)
    • properties are weighted differently

    Preprocessing

    • Improved support of encoded URIS: Decode encoded characters in URIs prior to indexing. The two versions of the URIs, the decoded and encoded one, are indexed. As a consequence, if you look for an URI (or keyword) with any special characters (e.g., Knud_Möller), this will match also encoded URIs (e.g., http://dblp.l3s.de/d2r/resource/authors/Knud_M%C3%B6ller).
    • Improved tokenisation of URI localname: Previously an URI localname was tokenised, but the non-tokenised version of the localname was not indexed. For example, given the URI http://rdf.data-vocabulary.org/#startDate, we now index as part of the local name: start, date, and startDate (the later one was not possible previously).
    • Improved support of mailto URI: Previously the URI mailtotest@test.com was improperly tokenised and indexed. Now you can search either for mailto:test@test.com or test@test.com. Both will match the mailto URI.

    Query Language and processing

    • Improved support of special character in URI. E.g. previously, a tilde ‘~’ in a URI was invalidating the query.
    • Various bug fixes as reported by users.
    • Improved query processing for Ntriple Query. Huge performance benefits in certain situations.

    Index Data Structure

    • Improved index compression.

    These are the technicals.

    In practice the thing I love the most is that my favorite queries (the group by dataset ones = open web dataset finding queries) now work easily 10 times faster :) .

    As always, follow the development of our core open source Semantic Information Retrieval Engine SIREn on github: https://github.com/rdelbru/SIREn.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Searching infinite amounts of Web Data: The new Sindice Index and Frontend

    Posted on May 15, 2011 with 3 comments
    Several goals have kept the the Sindice Team constantly busy in the past year or so. Luckly we’re now getting close to their deployment and today we’re happy to begin by introducing SIREn, the new Sindice core index,  its supporting new frontend and the API. 

    SIREn: Sindice’s own semantic search engine

    SIREn (Semantic Information Retrieval Engine, SIREn) is the new engine serving queries on our frontend and API.  The difference between SIREn and any of your well known RDF databases is that it uses the same principles found in web search engines rather than techniques derived by the database world, namely it is an Information Retrieval based engine.

    Typically, information retrieval is associated to full text queries, and this is supported by SIREn too. On top of this, however, SIREn adds support for certain structured queries, allowing for example to look for documents that contain specific triple patters or text within only given properties.
    So, while on the one hand SIREn does not implement full SPARQL, and never will, likely, on the other you get benefits which are highly appealing for a search engine:

    • Theoretically Infinite scalability in a cluster setting – very low hardware demand per amounts of data indexed (as compared to RDF databases)
    • Fast, incremental indexing. No slowdowns as data size grows.
    • Advanced ranking of query results, top K

    On top of this SIREn is not programmed from scratch, rather it is built as a plug in of Apache Solr, thus inheriting all the enterprise strength features its provides.
    To give you an idea of the hardware requirements to index an excess of 20b triples, it all fits in 1 master server machine – which we then split in 2 for query throughput and duplicated in master slave configuration for service smoothness during updates.

    Keeping in mind that we’re talking about very different query capabilities, we can report that in terms of performance, a single machine with SIREn can index 20b triples in 1 day on 2 machines. As a term of comparison, without getting into precise details, let us assure you that this – at the moment –  is 4 to 6 times faster than what we have practically seen achieve with the best DB related technologies at hand, running on twice as much hardware – while indexing half the data (10B).

    SIREn is Open Source, released under Apache licence. Scientific details please refer to “R. Delbru, S. Campinas, G. Tummarello. Searching Web Data: an Entity Retrieval and High-Performance Indexing Model.” To appear in Journal of Web Semantics, 2011.

    TestDriving SIREn in action: the new sindice frontend and API

    The best way to see in action (most of) the features of SIREn is going for a test drive with the new sindice front end, serving queries on the live updated Sindice dataset (230M RDF documents at the time of writing), e.g. from the “advanced” view:

    From here you can create complex queries. e.g. combining keyword queries with more advanced queries such as those requiring certain triple patterns (e.g. <s> <p> <o>  → * foaf:knows http://g1o.net/foaf.rdf#me) or having other requisites such as domain name, format, data etc. The full documentation is available online.

    In this release we then also have:

    • Results filters, which show on the left in the result pages, to allow quick query refinement.
    • Group by dataset – this is a first version. It turns document search into a dataset search, showing you how many results grouped per second level domain (often equivalent with dataset) .  For example, you are looking for datasets containing events, you can find out that www.songkick.com provides more than 700K documents containing event-related information.
    • An improved ranking for search results. The main ranking logic is based on the TF-IDF methodology. However, we have added some particular rules under the hood to boost certain results. If your keywords is matching the url of the document, its label, or one of the class or property URIs, the document will be ranked higher.
    • Better support for multi-lingual characters (based on UAX#29 Unicode Text Segmentation, thanks to Lucene) and diacritical marks. With respect to diacritical marks, you can now search for either ‘café’ or ‘cafe’. Sindice will return results for documents about either “café” or “cafe”. In the first case (‘café’), results for documents containing ‘café’ will be ranked higher.

    To build applications on top of Sindice, the entire capabilities of SIREn are available via a new API (v2).

    Also, if you’re around SemTech this July, you can talk to us directly at our presentations, one indeed about Siren and the other where we’ll give a sneak peak to Sindice Site Services – a cool service still in alpha.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice migration

    Posted on Nov 26, 2010 with no comments

    This is mainly a test post to verify that the Sindice blog continues to work after migrating it to a new server. But it is also a good opportunity to briefly mention the upgrades we are making to the Sindice infrastructure. I’m happy to report that Sindice has been suffering from some growing pains over the last while – as a result of this, we are in the process of migrating to an improved infrastructure increasing both the capacity of individuals servers and introducing failover support for critical parts of the infrastructure. These measures combined, should, over the coming months significantly increase the amount of data we can both fetch from the web of data and, in turn, process, transform and publish back to the semantic web community. Watch this space for more information over the next few months on our expanding capabilities.

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice now supports Efficient Data discovery and Sync

    Posted on Jul 9, 2010 with 1 comment

    So far semantic web search engines and semantic aggregation services have been inserting datasets by hand or have been based on “random walk” like crawls with no data completeness or freshness guarantees.

    After quite some work, we are happy to announce that Sindice is now supporting effective large scale data acquisition with *efficient syncing* capabilities based on already existing standards (a specific use of  the sitemap protocol).

    For example if you publish 300000 products using RDFa or whatever you want to use (microformats,  303s etc), by making sure you comply to the proposed method, Sindice will now guarantee you

    a) to crawl your dataset completely (might take some time since we do this “politely”)

    b) ..but only crawl you once and then get just the updated URLs on a daily bases! (so timely data update guarantee)

    So this is not “Crawling” anymore, but rather a live “DB like” connection between remote, diverse dataset all based on http. in our opinion this is a *very* important step forward for semantic web data aggregation infrastructures.

    The specification we support (and how to make sure you’re being properly indexed) are published here  (pretty simple stuff actually!)

    http://sindice.com/developers/publishing

    and results can be seen from websites which are already implementing these (you might be already doing that indeed without knowing..)

    http://sindice.com/search?q=domain:www.scribd.com+date:last_week&qt=term

    Why not make sure that your site can be effectively kept in sync today?

    As always  we look forward for comments, suggestions and ideas on how to serve better your data needs (e.g. yes, we’ll also support Openlink dataset sync proposal once the specs are finalized). Feel free to ask specific questions about this or any other Sindice related issue on our dev forum http://sindice.com/main/forum

    Giovanni,

    on behalf of the team http://sindice.com/main/about. Special credits for this to Tamas Benko and Robert Fuller.

    p.s. we’re still interested in hiring selected researchers and developers

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Sindice planned downtime this weekend

    Posted on Jun 9, 2010 with no comments

    Hi. Due to an expansion of one of our datacentres (and the electrical work that this implies), Sindice and related services such as sig.ma will be down from 1730 GMT+1, 11-Jun-2010 (Friday) to 1730 GMT+1, 12-Jun-2010 (Saturday). This major upgrade will give us increased room to grow the Sindice infrastructure over time. On 27-May-2010 we hit a major milestone in Sindice of having indexed 100 million documents (over 6.5 billion triples) from the semantic web. At current rates of data acquisition, we expect to hit the 200 million document mark before Christmas – so we’ll need that extra room!

    If you have any further queries on this downtime, please leave a comment or contact us via the Sindice Developers group.

    Update: Thanks to the efforts of NUIG’s ISS team sindice.com is back online ahead of schedule. All services (including sig.ma should now be operational).

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Any23 v0.4.0 Released

    Posted on May 27, 2010 with no comments

    The Any23 logo

    Dear All,
    the Sindice FBK team is proud to announce the Any23 0.4.0 release.

    In this new release we paid particular attention in data validation and correction,
    in  particular  we can claim  to extract the  Open Graph Protocol[1]  metadata also
    whether affected by syntactical errors[2].

    We’ve also added full support for the N-Quads[3] format.

    As usual everybody is invited to adopt this new release and
    report any encountered bug[4].

    A live demo is running at [5], please feel free to try it.

    We’re planning the milestone 0.5.0, so if you are waiting for the fix of
    a particular improvement please submit it to us using our issue tracker[4].

    Below an extract of the 0.4.0 release note [6]:
    • The any23-service module has been separated from the any23-core module, the Ant build system has been dropped. [Issue 44]
    • Added support for HTML metadata (RDFa / Microformats) validation and correction (validator). [Issue 77]
    • Added flag to disable the nesting relationship property enrichment. [Issue 67]
    • Improved coverage of Microformat tests. [Issue 65]
    • Improved documentation. [Issue 44]
    • Various code consolidation. [Issues 68, 69, 70, 71, 72, 73, 74, 77]

    Thanks for supporting our work.

    The Any23 Developers Team

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Any23 v0.3.0 Released

    Posted on Apr 23, 2010 with no comments

    The Any23 logo

    Dear All,

    we’re pleased to announce the Any23 0.3.0 release.

    Please keep in mind this is a beta, so everybody using Any23 in a development
    session is invited to migrate to this latest version and report in our issue tracker [1]
    any eventual bug.

    As usual we have a live demo running at [2], please feel free to try it.

    We’re planning the Milestone 0.4, so if you are waiting for the fix of a
    particular issue please verify that it is open, and eventually add a comment to ask more priority.

    To end, below you’ll find an extract of the 0.3.0 release note [3]:

    1. Added detection and enrichment of nested microformats. [Issue #61]
    2. Added detection and support of N-Quads as input and output format. [Issue #7]
    3. General Improvements in RDFa extraction. [Issue #12, Issue #14]
    4. Added support of Turtle embedded in HTML script tag. [Issue #62]
    5. Improvement in encoding support. [Issue #43]
    6. Improvement in Core API. [Issue #27]
    7. Improved support for Species Microformat. [Issue #63]

    Thanks for supporting our work.

    The Any23 Developers Team

    [1] http://code.google.com/p/any23/issues/list
    [2] http://any23.org/
    [3] http://any23.googlecode.com/svn/trunk/any23-core/RELEASE-NOTES.txt

    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)
  • Any23 v0.2 Released

    Posted on Feb 19, 2010 with no comments
    We are proud to announce a new release of Any23Anything to Triples
    Any23 is a Java library that parses RDF from a variety of Web document formats.
    The currently supported input formats are RDFa, RDF/XML, Turtle, N3, N-Triples,
    and a number of Microformats.
    Any23 is an Open Source project originated from the code created within the Sindice project
    and now used both inside sindice and in related projects e.g. Sig.Ma .
    Any23 comes with a handy command-line tool for parsing RDF and converting between formats.
    We have also set up a demo service where you can try any23 online and use a REST API to convert
    between different RDF formats, similar in spirit to triplr.org:
    The major new features in this release are:
    • Redesigned Java API
    • -  Input from string, stream, file, or URI
    • -  Allow choosing which extractors to use
    • -  Report origin of triples (document/extractor) to client processors
    • -  Various processors/serializers for extracted triples
    • Added flexible command-line tool for easy testing
    • Vastly improved website and documentation
    • Media type and encoding detection via Apache Tika
    • Switched RDF library from Jena to Sesame
    • Added Maven build
    • Better RDF extraction from Microformats
    • Extractors come with example file to document typical in- and output
    • Major refactoring
    • Lots and lots of bugfixes
    The following people have contributed to this release:
    Michele Mostarda and Davide Palmisano (FBK, Trento, Italy, Web of Data Unit (WED) );
    Richard Cyganiak and Jurgen Umbrich (DERI, NUI Galway, Ireland);
    Michele Catasta (EPFL, Lausanne, Switzerland), Giovanni Tummarello.
    This release is the first result of the joint effort between Fondazione Bruno Kessler and DERI,
    that recently started working together on Sindice. We strongly believe that Any23 could benefit from the wide
    Open Source community, especially considering the license under which it has been released.
    We think that the new Any23 v0.2, now integrated in the Sindice ingestion pipeline,
    will impact on the quality of the indexed data.
    This release adds other pieces of Open Sources in Sindice, notably the Semantic Information Retrieval Index (SIREN) available at http://siren.sindice.com .
    Digg This
    Reddit This
    Stumble Now!
    Buzz This
    Vote on DZone
    Share on Facebook
    Bookmark this on Delicious
    Kick It on DotNetKicks.com
    Shout it
    Share on LinkedIn
    Bookmark this on Technorati
    Post on Twitter
    Google Buzz (aka. Google Reader)