RDF schema and path finder

Rationale

Biological entities, such as genes and proteins, are inherently connected and can form complex networks when aggregating data from multiple sources. Data providers are increasingly moving towards the Resource Description Framework (RDF) and a Linked Data approach for publishing data. RDF provides a mechanism to describe schemas that can capture the nature of the relations that hold between entities. Effectively documenting an RDF schema is crucial for users wanting to explore or query a linked dataset, however, adequate tooling to support both the visualization and exploration of datasets is currently lacking.

This projects has two phases. During the Google Summer of Code part, we aim to finish phase 1 “Schema Visualization” and advance as much as possible to phase 2 “Path Finder”

Schema Visualization Different approaches exist for visualizing RDF (see this overview). However, not to much has been done regarding visualization of schemas. This is the first problem we want to address.

Path Finder Knowing a model schema is a good starting point, but more it is not enough when multiple datasets are involved. For traversing multiple datasets, direct links or short paths become convenient, they make it easier searching and data integration. RelFinder is useful to find paths between two entities in the same dataset but it does not work yet across multiple datasets. A more complex issue arises when only the starting point is known; for instance, from a chemical, what kind of entities in protein or gene datasets can I reach? That is the second problem we want to address.

Approach

Schema Visualization Initially, we need to extract the high level entities used in an RDF dataset. For instance, multiple Gene Ontology (GO) terms migh be in used, but at the end all of them belong to the same broad type “GO term” thus in a schema we only want to see a link to “GO terms”. Once the schema parser has been obtained, then we can move to the visualization part.

For instance, a Gene Expression Atlas schema looks like:

Fig. 1: Gene Expression Atlas RDF scheme

Path finder We need to create direct links or short paths across multiple datasets in an automtic/semi-automatic way. You can find more info here.

Challenges

Schema Visualization
  • Handling large dataset in RDF schema
  • Finding a common way to extract RDF schemas from multiple heterogeneous datasets
  • Avoiding collision and overlapping in visualization
Path finder
  • Handling large dataset in RDF schema
  • Creating direct links or short paths across multiple datasets

Involved toolkits or projects

  • Projects: BioJS, EBI-RDF, UniProt-RDF
  • Toolkits: Gulp, browserify, jQuey, D3, mustache and any other JS library that the student might find convenient to address the problem.

Needed skills

  • Good understanding of JavaScrip development
  • Basic understanding of RDF and Linked Data
  • Some understanding of bioinformatics data might be an advantage but is not required

Mentors

Leyla García (UniProt, EMBL-EBI) and Simon Jupp (Samples, Phenotypes and Ontologies, EMBL-EBI)