Wf4Ever Research Object Ontologies and Vocabularies Primer

This document is to provide an accessible introduction to the Wf4Ever RO Model so that readers can understand "what" the RO Model provides and "how" the RO Ontologies and Vocabularies can be used to describe an aggregation object that represents scientific experiments in a structured format.

Overview of the RO Model

This section provides an overview of the RO Model. It does not cover all the details of the model. The RO Model Specification [[RO-MODEL]] provides precise definitions to be used.

Generally speaking the Wf4Ever RO model can describe the following three aspects of information:

Basic aggregation structure of an RO.
Workflow-centric Research Objects.
Annotations to an RO and its components.

Describe Basic Aggregation Structure of an RO

A Research Object is simply an aggregation of resources and annotations about them. The figure below provides an overview of the Wf4Ever RO Model, which includes the following constructs:

ResearchObject, represents an aggregation of resources. It acts as an entry point to the research object.
Resource, represents a resource that can be aggregated within a research object. As shown in the figure below, a Resource can be a Dataset, Paper, Software or Annotation. Typically, a ResearchObject aggregates multiple Resources.
Annotation, used for describing research objects, their aggregated resources, as well as the relationship between resources.

The description of a RO, such as its structure and annotations, is provided in a manifest file. This manifest file can be aggregated as part of this RO. Examples of descring a basic RO in a manifest file can be found in Section 3.1.

Describe Workflow Centric Research Objects

A special class of research object that is the primary interest of our specification are workflow-centric research objects, which refer to research objects that aggregate workflows, or more specifically workflow templates.

A workflow template is a network in which the nodes are processes and the edges represent data links that connect the output of a given process to the input of another process, specifying that the artifacts produced by the former are used to feed the latter.
A process is used to describe a class of actions that when enacted give rise to process runs. Processes specify the software component (e.g., web service) responsible for undertaking those actions.

A workflow is often executed following a template which describes each step involved in the whole execution. Templates can be designed by scientists (users) with the purpose of being able to execute the same workflow many times with different inputs for their tests, as "live-tutorials" of how some data infrastructure can be more efficiently used, etc. There are two types of templates:

Abstract workflow templates, which have some of the steps of the workflow not bound to a specific tool.
Concrete workflow templates, which have all the steps specified.

In the RO model, we are able to describe these templates plus their relationships with the executions by using the wfdesc vocabulary.

As well as workflow templates, workflow-centric research objects contain information about workflow runs, which are obtained by enacting workflow templates, and provenance of the results obtained from the runs. Examples of describing a workflow research object can be found in Section 3.2 and 3.5.

Describe Annotations to an RO and its Components

In the RO model the Annotation Ontology is used as a generic vocabulary to allow annotations to research objects, their resources, and their relationships. Three kinds of elements are used to specify annotations:

Annotation, represents the annotation itself.
Target, used to specify the resource or research object subject to annotation.
Body, the body comprises a description of the target in the form of a set of RDF statements, which can be specifying the date of creation of the target, or its relationship with other resources or research objects.

Annotations may be provided primarily for human consumption (e.g. a description of a hypothesis that is tested by a workflow-based experiment), or for machine consumption (e.g. a structured description of the provenance of results generated by a workflow run). Both kinds of annotations are accommodated using Annotation Onology structures. Examples of expressing annotations to an RO and its components can be found in Section 3.3 and 3.4.

Examples

The Wf4Ever RO model is implemented as a suite of three ontologies, which include:

The ro ontology: which provides basic structure for the description of aggregated resources and the annotations that are made on those resources.

The wfdesc ontology: which allows describing workflows. It is targeted at providing an abstraction that can be mapped to different particular workflow systems.

The wfprov ontology: which provides terms for describing provenance information about actual executions of workflows.

These ontologies were built upon existing vocabularies as much as possible, including OAI ORE (Object Exchange and Reuse) [[ORE]] and the Annotation Ontology [[AO]].

The following sections show how this suite of RO ontologies can be used to describe a basic workflow-centric RO. A "Hello World" RO is used as the running example to demonstrate how each part of the RO model can be used to describe this RO, its components and annotations to the RO as a whole and to each of its component.

The following namespaces are used in the examples:

@prefix ro:     http://purl.org/wf4ever/ro#
@prefix wfdesc: http://purl.org/wf4ever/wfdesc#
@prefix wfprov: http://purl.org/wf4ever/wfprov#
@prefix dct:    http://purl.org/dc/terms/
@prefix ore:    http://www.openarchives.org/ore/
@prefix ao:     http://purl.org/ao/
@base           http://purl.org/wf4ever/ro-primer#

A Basic Research Object

The "Hello World" example RO, as shown in the following example RDF, aggregates a collection of resources, including the simple Hello World workflow template, its intermediate and final data results resulting from the run of this workflow, as well as annotations to this RO and its components.

<> a ro:ResearchObject, ore:Aggregation ;
    ore:aggregates <helloworld.t2flow> ;
    ore:aggregates <artifact/hello> ;
    ore:aggregates :ann1 ;
    dct:created "2011-12-02T15:01:10Z"^^xsd:dateTime ;
    dct:creator [ a foaf:Person; foaf:name "Stian Soiland-Reyes" ] .

<helloworld.t2flow>    rdf:type ro:Resource .
<artifact/hello>       rdf:type ro:Resource .

A Basic Workflow Research Object

The Wf4Ever RO model is aimed to be a workflow-centric RO model. It provides specific constructs to describe information about workflows and their runs. For example, the above example can be revised and expressed more precisely using the wfdesc vocabulary and the wfprov vocabulary.

<> a ro:WorkflowResearchObject ;    
    ore:aggregates <helloworld.t2flow> ;
    ore:aggregates <artifact/hello> ;
    ore:aggregates :ann1 ;
    dct:created "2011-12-02T15:01:10Z"^^xsd:dateTime ;
    dct:creator [ a foaf:Person; foaf:name "Stian Soiland-Reyes" ] .
    
<helloworld.t2flow>    rdf:type   wfdesc:Workflow .
<artifact/hello>       rdf:type   wfprov:Artifact .

Provide Annotations to an RO

Structured annotations can be added to research objects and their aggregated resources. The Wf4Ever RO model uses the Annotation Ontology 2.0 to express these annotations. They are declared as separate RDF resources, which can independently be aggregated. AO can be used to describe the link between these RDF resources (the annotation body) and the resources they annotate (annotated resource). It also provides an anchor point for describing who made the annotation and when.

Multiple annotation bodies (ao:body) can annotate the same resource (ao:annotatesResource), and a single annotation (ao:Annotation) can annotate multiple resources. Each annotation should have a dct:creator and dct:createdAt to specify who created the annotation and when. Details about how to use the Annotation Ontology can be found in the above link.

Annotate an RO as a Whole

Annotations on a research object can apply to the aggregation as a whole. This can for instance be its title, description, general classifications or relations to other research objects. For example, the following RDF shows how we can provide additional descriptions about the HelloWorld research object. We keep these annotations in a separate resource, and we also provide provenance information about these annotations.

# http://www.example.com/ro1/manifest
@prefix ro: <http://purl.org/wf4ever/ro#> .
@prefix ore: <http://www.openarchives.org/ore/terms/> .
@prefix ao: <http://purl.org/ao/> .
@prefix dct: <http://purl.org/dc/terms/> .

<>  a ro:ResearchObject, ore:Aggregation ;
    ore:aggregates      <helloworld.t2flow>, <stiansComments.ttl>, :ann1  ;
    ore:isDescribedBy   <manifest> .
    
:ann1 a ro:Annotation, ao:GraphAnnotation ; 
      ao:annotatesResource <> ; # The RO
      ao:body              <stiansComments.ttl> ;
      dct:creator          _:stian ;
      dct:createdAt        "2011-07-14T15:02:14Z"^^xsd:dateTime .
    
<helloworld.t2flow>    a ro:Resource, ore:AggregatedResource .    
<stiansComments.ttl> a ore:AggregatedResource .

The actual annotation body is stored in a separate resource, and can be made accessible via a URL.

# http://www.example.com/ro1/stiansComments.ttl.
<>   dct:title "A lovely research object"@en ; 
     dct:description "An example of how I think ROs should look like"@en .

Annotations about a workflow should be kept in a separate resource. By separating these annotations in a separate resource, they can be aggregated by an RO and provenance of these annotations can also be made, like who created them when. This is an important pattern that should be followed when creating RO annotations. These annotations can also be reused as long as they are universally true outside the context that they were initially created, i.e. the annotations about a workflow remain true no matter in which RO it could be aggregated.

Annotate a Workflow

Annotations are useful for typing and describing the resources that form a research object. And these annotations to a component of an RO, for example workflow in this case, can be either globally true, i.e. remaining true independent of any specific RO, or locally true, i.e. within the context of a particular RO. Here we give two examples to show how both cases can be expressed.

# http://www.example.com/ro1/manifest
<> a ro:ResearchObject, ore:Aggregation ;
    ore:aggregates     <helloworld.t2flow>, <workflowMetadata.ttl>, :ann2  ;
    ore:isDescribedBy  <manifest> .
    
:ann2 a ro:Annotation, ao:GraphAnnotation ;
      ao:annotatesResource  <helloworld.t2flow> ;
      ao:body               <workflowMetadata.ttl> ;
      dct:creator           _:stian ;
      dct:createdAt         "2011-07-14T15:31:14Z"^^xsd:dateTime .
    
<helloworld.t2flow>         a ro:Resource, ore:AggregatedResource, a ro:Workflow .    
<workflowMetadata.ttl>    a ore:AggregatedResource .

The same as the above example, the actual annotation body is stored in a separate resource, such as http://www.example.com/ro1/workflowMetadata.ttl, and can provide the following annotations about the workflow .

###  http://www.example.com/ro1/workflowMetadata.ttl
<helloworld.t2flow> a wfdesc:Workflow .
                    dct:description "Workflow for sequence analysis"@en . 
#

This example will both establish that <helloworld.t2flow> actually is typed as a workflow (that was asserted by Stian), and give a general description about the workflow. This information could for instance come from where originally the workflow was retrieved, such as the social workflow sharing web site myExperiment, or the workflow definition itself.

A second annotation gives the description of the workflow with respect to this HelloWorld research object - by annotating the workflow proxy instead of <helloworld.t2flow>. In OAI-ORE a Proxy allows users to make statements about an aggregated resource in the context of a particular aggegation (such as ro:ResearchObject). By adopting this construct it is possible to say, for instance, why the workflow exists within this research object - this annotation would not necessarily apply to the workflow aggregated in a different research object, although it may be retrieved anyway.

# http://www.example.com/ro1/manifest
# ...
<> ore:aggregates :ann3 ;
    ore:aggregates    <helloworld.t2flow>, <whyThisWorkflow.ttl>, :ann3  ;
    ore:isDescribedBy <manifest> .
    
:workflowProxy a ore:Proxy ;
            ore:proxyFor <helloworld.t2flow> ;
            ore:proxyIn <> .
    
:ann3 a ro:Annotation, ao:GraphAnnotation ;
      ao:annotatesResource  :workflowProxy ;
      ao:body               <whyThisWorkflow.ttl> ;
      dct:creator           _:stian ;
      dct:createdAt         "2011-07-14T16:21:14Z"^^xsd:dateTime .
#

# http://www.example.com/ro1/whyThisWorkflow.ttl
:workflowProxy dct:description "Best workflow I could find for now"@en;
#

Note: It would be valid, but not required, to also include :ann3 ao:annotatesResource <helloworld.t2flow> in the manifest, because the annotation body here is still "somewhat about" <helloworld.t2flow>, even though that URI does not appear directly within the graph.

Annotate Other Types of Components of an RO

Relationships between the resources that constitute an RO are encoded using annotations, by having an ro:Annotation with multiple ao:annotatesResource and relate them within a single annotation body. The wfprov ontology for instance will allow you to describe that a particular artifact was the output of a workflow run. It can also be used to link the output with the input of the run, although no inputs were used in the simple running example, the "Hello World" worklfow.

### such as output of a workflow run
<output>    ore:aggregatedBy   <> .

<>  a ro:ResearchObject ;
    ore:aggregates <output>, <run_helloworld>, :ann2 .
      
:ann2    a ao:GraphAnnotation ;
         ao:annotatesResource <output> ; ## link to a resource
         ao:annotatesResource <run_helloworld> ; ## with another resource
         ao:body              <output_annotations.ttl> .
        
### annotation to an output data
### <output_annotations.ttl> contains:

<output> dct:title             "Output from an example workflow"@en;
         foaf:topic            go:ProteinSequence ;
         wfprov:wasOutputFrom  <run_helloworld> .

Describe Provenance of an RO and its Components

Various types of provenance information can be associated with an RO and its components, depending on the granularity we are dealing with: the provenance of the RO itself (e.g., its evolution, updates and modifications made to the main structure of the RO), the provenance of each of its components (e.g., the workflow template, papers or other sources aggregated in the RO, their evolution, etc.), or the provenance of the workflow results.

Provenance of an RO

This type of provenance can describe either simple attribution information about an RO or refer to the changes and evolution "suffered" by the RO from its creation to its current state. Attribution metadata of the RO, such as creator, or date of release, can be captured through annotations, as shown in Section 3.3.

During the evolution of an RO it can have several versions that refine the previous ones and may be on different states, being alive, archived, published, etc. This information about the versioning of ROs and their evolution is currently being captured in a separate roevo ontology.

Provenance of a Workflow

Workflows are one of the most important items in an RO aggregation. As happened with ROs, workflows can evolve, be updated and reused in different ROs. In the RO model, this kind of provenance is captured by annotations, as shown in Section 3.1. In particular, we could extend the example shown in the aforementioned section to provide additional provenance metadata about the workflow <helloworld.t2flow> (the new triples have been highlighted in the example):

# http://www.example.com/ro1/workflowMetadata.ttl>
<helloworld.t2flow> a wfdesc:Workflow ;
     dct:description "Workflow for sequence analysis"@en ;
     dct:created "2011-12-02T15:01:10Z"^^xsd:dateTime ;
     dct:creator  _:stian ;
     dct:version   "2";
     dct:replaces <helloworld0.1.t2flow> .

Additional provenance-related vocabularies can also be used by users, such the PAV vocabulary, or the ongoing standard PROV Ontology from the W3C Provenance Working Group.

Provenance of a Workflow Result

Provenance of workflow results refer to the record of some or all the actions that occurred during a workflow execution that led to the existence of these results. The provenance can describe the inputs and intermediate results that final outputs were derived from, when the derivation started and ended, etc. This is modeled by the wfprov vocabulary.

The next example shows how an output of an experiment (<output>) was produced by a process run (<processrun-workflow>) that used a dataset (<input>) as input. <processrun-workflow> is the only process run taking place in the workflow execution <workflowrun-workflow>.

### annotation to an output data
### <output_annotations.ttl> contains:

<output>  rdf:type wfprov:Artifact;
             wfprov:wasOutputFrom  <processrun-workflow> .
             
<input> rdf:type wfprov:Artifact, wf4ever:Dataset.

<processrun-workflow>   
                rdf:type    wfprov:ProcessRun ;
                wfprov:usedInput <input>;
                wfprov:wasPartOfWorkflowRun <workflowrun-workflow> .
                
<workflowrun-workflow> 
                rdf:type        wfprov:WorkflowRun .

Workflow Description

With a focus of describing workflows the Wf4Ever Workflow RO model is also able to describe workflow-specific concepts, such as workflow templates plus their relationships with the executions. This is supported by the wfdesc vocabulary

Following the example presented in the previous section, now we want to link the executed process to the correspondant step in the template (<templProcess>), since it may contain additional information about how to proceed in order to reproduce the results. In the next example, we highlight the triples added to make such a connection.

### annotation to a process that generated the output data
### contained in <output_annotations.ttl>:
<processrun-workflow>   
                rdf:type                    wfprov:ProcessRun ;
                wfprov:usedInput            <input>;
                wfprov:wasPartOfWorkflowRun <workflowrun-workflow> ;
                wfprov:describedByProcess  <templProcess>.
                
<templProcess> a wfdesc:Process;
                  wfdesc:hasInput <i1>;
                  wfdesc:hasOutput <o1>.

Now we also link the artifacts to their correspondant descriptions:

                                
### annotation to an output data
### contained in <output_annotations.ttl>:

<output>  rdf:type wfprov:Artifact;
             wfprov:wasOutputFrom  <processrun-workflow>;
             wfdesc:describedByParameter <o1>.
             
<input> rdf:type wfprov:Artifact, wf4ever:Dataset;
            wfdesc:describedByParameter <i1>.
            
<i1> a wfdesc:Parameter, wfdesc:Input;
       wdesc:hasArtifact <templArtifactI>.
       
<o1> a wfdesc:Parameter, wfdesc:Output ;
       wdesc:hasArtifact <templArtifactO>.
       
<templArtifactI> a wfdesc:Artifact.
<templArtifactO> a wfdesc:Artifact.

The wfdesc ontology can be used to describe the template of a workflow. The example below illustrates how the "Hello World" workflow can be described.

<helloworld.t2flow> a wfdesc:WorkflowTemplate ;
  wfdesc:hasSubProcess :hello .

:hello a wfdesc:Process ;
  wfdesc:hasOutput :greeting .

Introduction