This repository contains the demonstrator of the vocabulary hub created for the deployEMDS project.
It provides the following interfaces:
- Data Portal: Loads DCAT-AP feeds and contained datasets, ad enables their mapping into RDF using YARRRML.
- RDF Portal: Displays the dataset distributions available as RDF, their linked profiles, and the option to export them according to a certain profile using the alignment pipelines.
- Alignment Pipelines: Displays the current alignment pipelines based on SPARQL Construct queries available in the system, as well as the option to add additional pipelines by providing a SPARQL Construct query to the system.
- Dataset Profile Registry: Provides an overview of the loaded dataset profiles, and their connected datasets and pipelines in the system
Prior to running the demo, we need to setup some functionality via docker
To run the local YARRRML mapping service, run the following docker compose script as docker-compose.yml
services:
yarrrml-map:
image: ghcr.io/dexagod/yarrrml-to-rml-service-docker:latest
ports:
- "3000:3000"
environment:
- PORT=3000
- DEFAULT_SERIALIZATION=nquadsTo run the local Oxigraph service, run the following docker compose script as docker-compose.yml
services:
oxigraph:
image: oxigraph/oxigraph:latest
container_name: oxigraph
ports:
- "7878:7878"
command: ["serve", "--bind", "0.0.0.0:7878", "--location", "/tmp/oxigraph", "--cors"]
restart: "no"To add an (LDES) DCAT-AP feed to the system
- Navigate tot he Data Portal page
- Click "Add Feed" button
- Add the URL of the (LDES) DCAT-AP feed
- The system will automatically discover and list available datasets
For the demo, the feed at https://pod.rubendedecker.be/scholar/projects/deployEMDS/feeds/results-feed was added, for which the appropriate mappings have been pre-filled in the input fields.
Notes:
For the demo, added feeds and pipelines are stored internally in the webpage, and will need to be re-loaded when re-launching the application. When adding a feed, please select "Traffic Counting DCAT-AP Feed" as the target feed, and reload the webpage after doing the mapping, since there is a small loading issue that I will try to fix still.
Once feeds are added, you can:
- Search datasets by title, description, or publisher
- Filter the datasets on keywords
- Select datasets for mapping
Now, the loaded datasets that are not yet published in an RDF format, can be mapped to RDF using a YARRRML -> RML -> RDF pipeline in the Dataset RML Mapping component.
- Select the dataset(s) to map in the Dataset Browser component.
- Add a YARRRRML mapping
- Select a mapping service to perform the YARRRML -> RML -> RDF conversion for your chosen dataset(s)
- Add a mapping target location, where the mapped RDF dataset should be POSTed.
- Select the feed to which this new distribution of the dataset should be added.
- Either select a profile, or create a new profile, under which the resulting RDF document is published
- After updating the feed, it is refreshed to load the new dataset.
Notes:
The current implementation treats every input resource as a source "data.json". This is for demo purposes, and a solution should be found as to how loaded resources can automatically map to the defined sources in avaialble mappings.
The mapping service can be found at https://github.com/Dexagod/yarrrml-to-rml-service-docker. You can run this locally and use the default URL.
Similarly to the Data Portal page, datasets can be filtered using:
- Search on datasets by title, description, or publisher
- Feed using the feed component
- Filtering on the used keywords
- Selection of the resulting datasets in the dataset browser component.
Exporting the chosen datasets, is done with the Export Datasets component. Here, the selected datasets are exported by
- loading the selected datasets
- (optional) mapping the resulting datasets into the target profile using the available pipelines
- loading the resulting datasets and mappings into the target graphstore (default URL is setup for a local oxigraph service hosted in docker)
- the target named graph in whcih the resulting datasets should be loaded can be changed, or left on the default graph
- Select the "Add pipeline" button in the Pipeline Sources component
- Enter a name for the new pipeline
- Select a source and target profile between which the pipeline performs an alignment
- Select the pipelines feed to which the new pipeline should be added
- Add relevant keywords
- Add the SPARQL Construct query that performs the profile alignment
- Select "Add pipeline" to finish the process
The Pipeline Browser component enables the browsing of the available pipelines in the vocabulary hub feeds. This component shows the source and target profiles of the pipeline, as well as the used SPARQL Construct query to perform the alignment.
Notes:
Since the alignment happens through a docker container performed at the client or dataspace service, other methods than SPARQL Construct can be employed for this alignment.
This page keeps track of the dataset profiles used in the published datasets and alignment pipelines. The concept of a Dataset Profile is used to provide a comprehensive description of a dataset, based on the availability of both the used ontologies, and associated SHACL shape assigned to the contents of a dataset.
The Vocabulary Hub operates on a point between the client and server. The server component of the Vocabulary Hub keeps track of a set of "feeds", that are persisted, maintained and updated on the server. This includes the tracked dcat-ap feeds, dataset profile alignment pipeline feeds, and any other data that should be persisted at ecosystem level.
In terms of performing alignments, based on the availability of semantic data, dataset profile metadata and alignment pipelines, this can happen both at the edge by the client, or by distributed services available in the data ecosystems.
The resulting resources of RML mapping or Semantic Alignment mapping processes can be re-published to the data space as alternative distributions of the same datasets using DCAT.
The role of the Vocabulary Hub is to work in tandem with the existing data space actors to facilitate the publishing and integration of semantically rich data in the data space. This demonstrator centralized different parts of this process into a single Web interface, that can be separated into different components in the data space.
-
Data portal The data portal interface loading the dcat-ap feeds in the ecosystem represents the role of the data catalog in the data spaces ecosystem. Here, datasets are added, shared and published. The mapping service represents a data published (or automated service in the data catalog) doing a semantic mapping of a published dataset, and publishing this mapped semantic dataset as a DCAT distribution of the original dataset, while including information about the semantic mapping as a dataset profile, which can either be pushed directly to a vocabulary hub component, or can be pulled indirectly by the vocabulary hub from the catalog pushing this metadata.
-
RDF portal The rdf portal interface provides an overview of the semantically enriched datasets available in the vocabulary hub, linking their used dataset profiles, and allowing the exporting of the available datasets based on a target profile description and the available alignment pipelines. This represents the combined role of multiple components: the data catalog storing the dataset metadata from which distributions of relevant dataset in an RDF format are retrieved, the vocabulary hub where the dataset profiles and alignments (in this case SPARQL Construct queries) between these profiles are stored, and the data consumer that runs imports the datasets and executes the alignments according to their data requirements.
-
Alignment pipelines The alignment pipelines interface provides an overview of the alignment information that is registered in the vocabulary hub. The execution of these pipelines takes the form of the data consumer retrieving the SPARQL Construct queries used to convert from a source to a target profile, and execute them over the source inputs from the data catalog according to the pipeline source and target profiles, and insert the resulting RDF in the aligned profile in their local graph store.
-
Dataset profile registry The dataset profile registry gives an overview of the used profiles in the dataset metadata and pipelines available in the data space. These can be persisted in the vocabulary hub service, or discovered ad hoc by processing the dataset and alignment metadata available in the vocabulary hub and data catalog.