Fedora Data Server

Using Fedora as a Semantic Data Server

The second Fedora Extension – in addition to the Easy Archiving Tools – developed by the project is  set of Java Libraries that extend some of the semantic web functionalities of the Fedora Digital Repository. Specifically, it allows not only the metadata stored in Fedora to be added to the Mulgara Triplestore, but also any semantic-web ready RDF-XML data that are stored in the archive as well – allowing users of web applications not just to retrieve data sets (in Excel, SPSS or other format) but to explore them interactively.  This requirement emerged from several project research settings in which teachers and students wanted to search not just for resources, but across diverse data sets.

This has been achieved without modifying the distribution itself. This way, users can use a standard Fedora installation and use the extension Java libraries. On the other hand, using the extension libraries requires that this out-of-the box Fedora is combined with a standalone Mulgara triplestore installation instead of using the embedded version which is shipped with the repository.

Starting Point: the Fedora Resource Index

Fedora provides different data indexing mechanisms such as Dublin Core metadata indexing, full text index using search engines such as Solr (based on Lucene) or Apache Lucene, among others. We are particularly interested here in the Resource Index (RI). The RI uses Mulgara triplestore as the index database and it allows the creation of indexes from various types of information in the repository: Dublin Core annotations, object relationships information and even full text indexing in the form of RDF graphs.

The Fedora Resource Interface

This module enables to express object relationships, based on Fedora Object Model, in a machine-readable way so that the relationship information can then be indexed in the triplestore. This type of information builds on RDF standards and vocabularies developed within the Semantic Web community such as RDF Schema (RDFS) but it also includes its own relationships ontology, developed as an extension of the RDFS ontology, which enables to express more Fedora specific collection/resources relationships such as ‘isMemberOf’, ‘hasMember’, ‘isAnnotationOf’, etc.

Fedora includes a Search interface that uses the RI functionality and the user can then query the triplestore to obtain information from the repository in various output formats such as n3, spo (subject, predicate, object) or RDF.  But any application  built on the Resource Index can only read metadata – not actual data – even if it is already in semantic-web-ready RDF.

The ‘Double Loop’ Extension

The main limitations of the Fedora RI is that it only allows to index Dublin Core Metadata records and the objects relationships information. We have addressed this with a Fedora extension library, which adds the possibility of indexing the previously mentioned sets of information but also enables to add other types of data from the repository such as metadata annotations from other standards (e.g. DCMI Terms) or even complete datasets available in suitable formats such as RDF.

The approach implemented keeps the significant functionalities of the Fedora RI but fulfils the present limitations such as the management of semantic-ready data by its aggregation into the triplestore. By using a standalone instance of Mulgara we are able to aggregate  new datasets in the triplestore , to run queries across datasets from the repository combined with data coming from other sources and lastly, we can use inferencing mechanisms to add new statements into the datastore by using multiple inference engines such as the ones provided by Mulgara or third party ones like Jena framework.

We refer to this extension as the ‘Double Loop’ – first, metadata are incorporated into the Triplestore, and, on the basis of what these say about formats of the data they describe, a second ‘loop’ brings data in as well.  Metadata and data can then both be exposed to web applications or via SPARQL endpoints.

In Pictures!

Fedora basic configuration – no triplestore

1: Default Fedora Configuration

By default, Fedora provides two different interfaces to query and access the repository. The first of these uses a relational database that indexes both the custom metadata records and the Dublin Core included in the Metadata section of every object stored in the repository. The main idea of this initial configuration is that just the metadata annotations are indexed and the search is performed based on that information. The second interface uses the OAI-PMH protocol to provide live feeds of the contents of the repository by exposing only the information of the Dublin Core metadata records. This way the metadata annotations can be publicly visible to other repositories or query services.

Even though there will be metadata and possibly semantic web-ready data (the ‘Inline RDF’ in the illustration) the full benefit of these may not be realised.

Fedora with Mulgara and RI search

2: Fedora configuration with Mulgara and Resource Index (RI)

Fedora uses an ontology to represent the different relationships between objects, which allows both to categorise the different datasets stored in the repository (and to build collections of resources) and to access the metadata using semantic-web oriented interfaces.

However, this functionality is optional and it can be accessed by enabling the “Resource Index” module and configuring the triplestore instance (Mulgara by default) that will store the metadata annotations and object-relationships information as RDF statements.

At this point, a new interface is exposed (RI search) which queries an internal Mulgara triplestore via a set of different query languages: SPARQL, SPO, iTQL.  Now we have some semantic web functionalities – but only extending to some of the available metadata – and our RDF data is still unexploited.

Fedora using Extension Libraries

3: Fedora configuration with “double loop” extension

The ‘double loop’ library extension keeps the significant functionalities of the previous configurations but enables us to search any data in RDF format in addition to the metadata and relationship information already stored in the RI Mulgara graph. The only requirement is to store these data in the repository in a suitable format: RDF serialised in XML.

The extension library queries the Fedora RI graph of the metadata (the ‘first loop) to extract the identifiers of those objects stored in the repository that present contents in semantic-ready formats. Once these objects have been identified, the search results are parsed to construct REST queries to the repository (by using the API-A search interface) that will retrieve the contents of these objects. Finally, the RDF contents of each object are then uploaded into a new Mulgara data structure by using the HTTP Sparql endpoint provided by Mulgara triplestore. Now we have a triplestore with a SPARQL endpoint to which applications (such as those built with SIMILE Exhibit) can access and present users not just with a semantic search of metadata, but the data themselves.