Table of Contents
Welcome to the SoilWise Technical Documentation!
SoilWise Technical Documentation currently consists of the following sections:
- Technical Components
- APIs
- Infrastructure
- Governance
- Glossary
- Printable version - where you find all sections composed in one page, that can be easily printed using Web Browser options
Release notes
Date | Action |
---|---|
30. 4. 2024 | v1.0 Released: For D1.3 Architecture Repository purposes |
27. 3. 2024 | Technical Components restructured according to the architecture from Brugges Technical Meeting |
27. 3. 2024 | v0.1 Released: Technical documentation based on the Consolidated architecture |
10. 2. 2024 | Technical Documentation was initialized |
Technical Components ↵
Introduction
The SoilWise Repository (SWR) architecture aims towards effecient facilitation of soil data management. It seamlessly gathers, processes, stores, and disseminates data from diverse sources. The system prioritizes high-quality data dissemination, knowledge extraction and interoperability while user management and monitoring tools ensure secure access and system health. Note that, SWR primarily serves to power Decision Support Systems (DSS) rather than being a DSS itself.
The presented architecture represents an outlook and a framework for ongoing SoilWise development. As such, the implementation will follow intrinsic (within the SoilWise project) and extrinsic (e.g. EUSO development Mission Soil Projects) opportunities and limitations. The presented architecture is the first release out of two planned. Modifications during the implementation will be incorporated into the final version of the SoilWise architecture due M42.
This section lists technical components for building the SoilWise Repository as forseen in the architecture design. As for now, the following components are foreseen:
- Harvester
- Repository storage
- Meta-data Catalogue
- Metadata validation
- Transformation and Harmonistation Components
- Interlinker
- NLP & Large Language Model
- Map Server
- User interface: Dashboard
- User Management and Access Control
- Monitoring
A full version of architecture diagram is available at: https://soilwise-he.github.io/soilwise-architecture/.
Harvester
Important Links
Project: Metadata ingestion
The Harvester component is dedicated to automatically harvest sources to populate SWR with metadata on datasets and knowledge sources.
Automated metadata harvesting concept
Metadata harvesting is the process of ingesting metadata, i.e. evidence on data and knowledge, from remote sources and storing it locally in the catalogue for fast searching. It is a scheduled process, so local copy and remote metadata are kept aligned. Various components exist which are able to harvest metadata from various (standardised) API's. SoilWise aims to use existing components where available.
The harvesting mechanism relies on the concept of a universally unique identifier (UUID) or unique resource identifier (URI) that is being assigned commonly by metadata creator or publisher. Another important concept behind the harvesting is the last change date. Every time a metadata record is changed, the last change date is updated. Just storing this parameter and comparing it with a new one allows any system to find out if the metadata record has been modified since last update. An exception is if metadata is removed remotely. SoilWise Repository can only derive that fact by harvesting the full remote content. Disucssion is needed to understand if SWR should keep a copy of the remote source anyway, for archiving purposes. All metadata with an update date newer then last-identified successfull harvester run are extracted from remote location.
Resource Types
Metadata for following resource types are foreseen to be harvested:
- Data & Knowledge Resources (Articles/Datasets/Videos/Software/Services)
- Projects/LTE/Living labs
- Funding schemes (Mission-soil)
- Organisations
- Repositories/Catalogues
These entities relate to each other as:
flowchart LR
people -->|memberOf| o[organisations]
o -->|partnerIn| p[projects]
p -->|produce| d[data & knowledge resources]
o -->|publish| d
d -->|describedIn| c[catalogues]
p -->|part-of| fs[Fundingscheme]
Origin of harvested resources
Datasets
Datasets are to be primarily imported from the ZENODO, INSPIRE GeoPortal, BonaRes as well as Cordis. In later iterations SoilWise aims to also include other projects and portals, such as national or thematic portals. These repositories contain a huge number of datasets. Selection of key datasets concerning the SoilWise scope is a subject of know-how to be developed within SoilWise.
Knowledge sources
With respect to harvesting, it is important to note that knowledge assets are heterogeneous, and that (compared to data), metadata standards and particularly access / harvesting protocols are not generally adopted. Available metadata might be implemented using a proprietary schema, and basic assumptions for harvesting, e.g. providing a "date of last change" might not be offered. This will, in some cases, make it necessary to develop customized harvesting and metadata extraction processes. It also means that informed decisions need to be made on which resources to include, based on priority, required efforts and available capacity.
The SoilWise project team is still exploring which knowledge resources to include. An important cluster of knowledge sources are academic articles and report deliverables from Mission Soil Horizon Europe projects. These resources are accessible from Cordis, Zenodo and OpenAire. Extracting content from Cordis, OpenAire, and Zenodo can be achieved using a harvesting task (using the Cordis schema, extended with post processing). For the first iteration, SoilWise aims to achieve this goal. In future iterations new knowledge sources may become relevant, we will investigate at that moment what is the best approach to harvest them.
Catalogue APIs and models
Catalogues typically offer standardised APIs as well as tailored APIs to extract resources. Typically, the tailored APIs offer extra capabilities which may be relevant to SoilWise. However in general we should adopt the standardised interfaces, because it allows us to use of the shelf components with high TRL.
Standardised APIs are available for harvesting records from:
- Catalogue Service for the Web (OGC:CSW)
- Protocol metadata harvesting (OAI-PMH)
- SPARQL
- Sitemap.xml
Semantic Web specifications for metadata
This section briefly reviews the specifications issued by the World Wide Web Consortium (W3C) for the encoding of metadata. The most relevant of these is DCAT, however, as is paramount in the Semantic Web, this ontology is meant to be used together with other specifications.
The web ontologies review here not only complement each other but in many cases
even overlap. Such is the nature of the Semantic Web, a single instance can be
simultaneously be declared as a foaf:Person
, a vcard:Individual
and a
prov:Agent
, in doing so creating semantic links to multiple ontologies,
greatly boosting its meaning and effectiveness.
Dublin Core
The Dublin Core Metadata Element Set (DCMES), better known simply as Dublin Core, was the first metadata infrastructure produced within the Semantic Web. It owns its name to a city in Ohio, wherein its foundations were laid in 1995. Dublin Core was formalised as ISO-15836 in 2003 and is maintained by the Dublin Core Metadata Initiative (DCMI), a branch of the Association for Information Science and Technology (ASIS&T). A revision was published in 2017 (ISO-15836-1:2017).
The first complete release of Dublin Core dates back to 2000, comprising fifteen metadata terms meant to describe physical and digital resources, independently of context. In its first iterations, these terms were loosely defined, without specification on their application to resources. In 2012 a RDF model was released, thereafter known as DCMI Metadata Terms. Still it kept its flexibility, not imposing constraints on the resources with which the terms can be used.
The DCMI Metadata Terms are organised within four modules, digested below:
-
Elements: the original set of elements published in 2000, specified as RDF properties. Among other concepts, includes
dc:contributor
,dc:creator
,dc:date
,dc:identifier
,dc:language
,dc:publisher
,dc:rights
anddc:title
. -
Terms: it includes the original fifteen elements but adds classes and restrictions on their use. This module also specifies relations between its elements, meant for a more formal application of the standard. Of note are the classes
dcterms:BibliographicResource
,dcterms:LicenseDocument
,dcterms:Location
anddcterms:PeriodOfTime
. Among the predicates, it definesdcterms:license
anddcterms:provenance
can be highlighted. -
DCMI Type: defines a further set of resource classes that may be described with Dublin Core metadata terms. This set includes classes such as
Collection
,Dataset
,Image
,PhysicalObject
,Service
,Software
,Sound
andText
. -
Abstract Model: meant to document metadata themselves and generally not expected to be applied by end users.
FOAF
Friend of a Friend (FOAF) was the first web ontology expressing personal relationships in OWL. It specifies axioms describing persons, how they relate to each other and to resources on the internet. From a personal profile described with FOAF, it is possible to automatically derive information such as the set of people known to two different individuals. As an early metadata specification, FOAF has been popular to relate and describe people associated with web resources. The ActivityPub specification, the basis of the Fediverse, was influenced by FOAF.
Among the concepts specified by FOAF feature Person
, Agent
, Organization
,
Group
, Document
, PersonalProfileDocument
, Image
, OnlineAccount
and
Project
. These are related by a comprehensive collection of data and object
properties whose meaning is mostly straightforward to understand.
VCard
In 2014, the W3C developed an ontology mapping elements of the vCard business card standard to OWL, abstracting persons, organisations and contacts. The vCard web ontology specifies a set of classes and properties, but without limiting ranges and domains on the latter. vCard is meant to be used together with other metadata ontologies, particularly Friend of a Friend (FOAF).
The main classes in vCard representing contactable entities are Individual
,
Organisation
and Group
. Within contact means classes are found Address
,
EMail
, Location
and Phone
(the later specialised in various sub-classes).
A collection of object properties relates these two kinds of classes together,
with a further set of data-type properties providing the concrete
definition of each contact instance.
DCAT
The Data Catalog Vocabulary (DCAT) is the de facto Semantic Web standard for metadata, maintained by the W3C. Its main purpose is to catalogue and identify data resources, re-using various concepts from other ontologies. In particular, terms from Dublin Core and classes from FOAF and VCard are part of the specification. DCAT is not restricted to representing metadata of knowledge graphs, it even encompasses the concept of multiple representations for the same data. Among the most relevant classes specified by DCAT are:
-
Resource
: any concrete thing on the Web, in principle identifiable by a URI.Dataset
,DataService
andCatalog
are sub-classes ofResource
. -
Dataset
: a collection of data, published or curated by a single entity. In general represents a knowledge graph that may be encoded and/or presented in different ways, and even be available from different locations. -
DataService
: an operation providing data access and/or data processing. Expected to correspond to a service location on the internet (i.e. an endpoint). -
Distribution
: a particular representation of aDataset
instance. More than one distribution may exist for the same dataset (e.g. Turtle and XML for the same knowledge graph). -
Catalog
: a collection of metadata on related resources, e.g. available at the same location, or published by the same entity. A catalogue should represent a single location on the Web. -
CatalogRecord
: a document or internet resource providing metadata for a single dataset (or other type of resource). It corresponds to the registration of a dataset with a catalogue. -
Relationship
: specifies the association between two resources. It is a sub-class of theEntityInfluence
class in the PROV ontology.
PROV
PROV defines a core domain model for provenance, to build representations of the entities, people, and processes involved in producing a piece of data or thing in the world. This specification is meant to express provenance records, containing descriptions of the entities and activities involved in producing, delivering or otherwise influencing a given object. Provenance can be used for many purposes, such as understanding how data was collected so it can be meaningfully used, determining ownership and rights over an object, making judgements about information to determine whether to trust it, verifying that the process and steps used to obtain a result that complies with given requirements, and reproduce how something was generated.
PROV defines classes at a high level of abstraction. In most cases, these classes must be specialised to a specific level in order to be useful. The most relevant classes are:
-
Entity
: physical, digital, conceptual, or other kind of thing. Examples of such entities are a web page, a chart, or a spellchecker. Provenance records can describe the provenance of entities. -
Activity
: explains how entities come into existence and how their attributes change to become new entities. It is a dynamic aspect of the world, such as actions, processes, etc. Activities often create new entities. -
Agent
: takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place. An agent can be a person, a piece of software, an inanimate object, an organisation, or other entities that may be ascribed responsibility. -
Role
: a description of the function or the part that an entity played in an activity. It specifies the relationship between an entity and an activity, i.e. how the activity used or generated the entity. Roles also specify how agents are involved in an activity, qualifying their participation in the activity or specifying for which aspect each agent was responsible.
Other domain models for metadata
Schema.org
schema.org is an ontology developed by the main search engines to enrich websites with structured content about the topics described on that page (microdata). schema.org annotations (microdata) are typically added using an embedded json-ld document but can also be added as RDF-a.
The relevant entities in the schem.org ontology are DataCatalog and Dataset
Schema.org is used in repositories such as dataone.org and Google Dataset Search.
Datacite
Datacite is a list of core metadata properties chosen for accurate and consistent identification of a resource for citation and retrieval purposes, along with recommended use instructions. Datacite is common in academic tools such as Datacite, Dataverse, Zenodo, and OSF.
ISO19115
TC211 developed the initial version of ISO19115 in 2003, and a followup in 2014. A working group is currently preparing a new version. It is a metadata model to describe spatial resources, such as datasets, services and features. Part of and related to this work are the models for data quality ISO19157, services ISO19119 and data models ISO19110.
An XML serialisation of the models is available in ISO19139:2007. Although withdrawn, iso19139:2007 is still the de-facto metadata standard in the geospatial domain in Europe, being used by the INSPIRE Directive as a harmonisation mean for all geospatial environmental evidence.
Ecological Metadata Language (EML)
EML defines a comprehensive vocabulary and a readable XML markup syntax for documenting research data. It is in widespread use in the earth and environmental sciences, and increasingly in other research disciplines as well. EML is a community-maintained specification and evolves to meet the data documentation needs of researchers who want to openly document, preserve, and share data and outputs. EML includes modules for identifying and citing data packages, for describing the spatial, temporal, taxonomic, and thematic extent of data, for describing research methods and protocols, for describing the structure and content of data within sometimes complex packages of data, and for precisely annotating data with semantic vocabularies. EML includes metadata fields to fully detail data papers that are published in journals specializing in scientific data sharing and preservation.
Thesauri
Keywords referenced from metadata preferably originate from common thesauri. The section below provides a listing of relevant thesauri in the soil domain.
INSPIRE registry
The INSPIRE registry provides a central access point to a number of centrally managed INSPIRE registers. The content of these registers are based on the INSPIRE Directive, Implementing Rules and Technical Guidelines.
GEneral Multilingual Environmental Thesaurus (GEMET)
GEMET is a source of common and relevant terminology used under the ever-growing environmental agenda.
Agrovoc
AGROVOC Multilingual Thesaurus, including definitions from the World Reference Base on Soil description.
GLOSIS web ontology
GLOSIS codelists is a community initiative originating from the GSP GLOSIS initiative, including soil properties, soil description codelists, and soil analysis procedures.
GBIF
GBIF maintains thesauri for ecological phenomena such as species.
Persistence identification
The Uniform Resource Identifier (URI) is one of the earliest and most consequent specifications of the Semantic Web. Originally meant to identify web resources, it became a central piece of the Web of data concept with the Resource Description Framework (RDF). In time researchers understood not only its relevance in providing unique and universal identifiers to data, but also the importance of their longevity, past the lifetime of projects, organisations or institutions. Thus emerged the concept of Persistent Unique Identifier (PI or PID, i.e. a URI valid "forever") and its recognition as the foundation of the Semantic Web and the FAIR initiative.
Within the SoilWise project, the persistent identification of metadata records (and eventually the resources that metadata describe) is therefore a fundamental aspect. The process or technology responsible for issuing and assigning PIDs in SoilWise has heretofore been known as Persistent Identifier Mint. In principle, it will rely on a third party for the allocation and resolution of PIDs, which will then be redirected to the SoilWise Repository. These PIDs can be used internally to identify metadata records, knowledge graphs in quad-stores, and even as alternative identifiers of external resources. The paragraphs below enumerate various options in this regard.
ePIC
Persistent Identifiers for eResearch (ePIC) was founded in 2009 by a consortium of European partners in order to provide PID services for the European Research Community, based on the Handle system. Consortium members signed a Memorandum of Understanding aiming to provide the necessary resources for the long term reliability of its PID services (allocation, resolution, long-term validity). ePIC has since expanded into an international consortium, open to partners from the research community worldwide.
ARK Alliance
The ARK Alliance is an global, open community supporting the ARK infrastructure on behalf of research and scholarship. This institution provides Archival Resource Keys (ARKs), that serve as persistent identifiers, or stable, trusted references for digital information objects, physical or abstract. ARKs are meant to provide researchers (and other users) with long term access to global scientific and cultural records. Since 2001, some 8.2 milliard ARKs have been created by over 1000 organisations — libraries, data centres, archives, museums, publishers, government agencies and vendors. The ARK Alliance strives for seamless access to its PID services, on an open, non-paywalled and decentralised paradigm.
DataCite
DataCite was founded in 2009 on the principle of being an open stakeholder-governed community that is open to participation from organisations worldwide. This initiative was formed with the aim of safeguarding common standards worldwide to support research, thereby facilitating compliance with the rules of good scientific practice. DataCite maintains open infrastructure services to ensure that research outputs and resources comply with the FAIR principles. DataCite’s services are foundational components of the scholarly ecosystem. Among these services are the creation and management of PIDs.
The FREYA project
The FREYA project funded by the European Commission under the Horizon 2020 programme, active between 2017 and 2020. It aimed to build the infrastructure for persistent identifiers as a core component of open science, in the EU and globally. FREYA worked to improve discovery, navigation, retrieval, and access to research resources. New provenance services enabled researchers to better evaluate data and make the scientific record more complete, reliable, and traceable. The FREYA Knowledge Hub was designed to help users understand what persistent identifiers are, why they exist, and how to use them for research. It includes comprehensive guides and webinars to help start working with PIDs.
Architecture
Below are described 3 options for a harvesting infrastructure. The main difference is the scalability of the solution, which is mainly dependent on the frequency and volume of the harvesting.
Traditional approach
Traditionally, a harvesting script is triggered by a cron job.
flowchart LR
HC(Harvest configuration) --> AID
AID(Harvest component)
RW[RDFwriter] --> MC[(Triple Store)]
AID --> RS[(Remote sources)]
AID --> RW
RS --> AID
Containerised approach
In this approach, each harvester runs in a dedicated container. The result of the harvester is ingested into a (temporary) storage, where follow up processes pick up the results. Typically, these processes use existing containerised workflows such as GIT CI-CD, Google Cloud run, etc.
flowchart LR
c[CI-CD] -->|task| q[/Queue\]
r[Runner] --> q
r -->|deploys| hc[Harvest container]
hc -->|harvests| db[(temporary storage)]
hc -->|data cleaning| db[(temporary storage)]
Microservices approach
The microservices approach uses a dedicated message queue where dedicated runners pick up harvesting tasks, validation tasks and cleaning tasks as soon as they are scheduled. Runners write their results back to the message queue, resulting in subsequent tasks to be picked up by runners.
flowchart LR
HC(Harvest configuration) -->|trigger| MQ[/MessageQueue\]
MQ -->|task| AID
AID --> MQ
MQ -->|task| DC
DC --> MQ
MQ -->|write| RW[RDFwriter]
AID(Harvest component)
RW --> MC[(Triple Store)]
AID --> RS[(Remote sources)]
In the beginning of the SoilWise development process, SoilWise will focus on the second approach.
Foreseen functionality
A harvesting task typically extracts records with update-date later then the last-identified successfull harvester run.
Harvested content is (by default) not editable for the following reasons:
- The harvesting is periodic so any local change to harvested metadata will be lost during the next run.
- The change date may be used to keep track of changes so if the metadata gets changed, the harvesting mechanism may be compromised.
If inconsistencies with imported metadata are identified, we can add a statement to the graph of such inconsistencies. We can also notify the author of the inconsistency so they can fix the inconsistency on their side.
To be discussed is if harvested content is removed as soon as a harvester configuration is removed, or when records are removed from the remote endpoint. The risk of removing content is that relations within the graph are breached. Instead, we can indicate a record has been archived by the provider.
Typical functionalities of a harvester:
- Define a harvester job
- Schedule (on request, weekly, daily, hourly)
- Endpoint / Endpoint type (example.com/csw -> OGC:CSW)
- Apply a filter (only records with keyword='soil-mission')
- Understand success of a harvest job
- overview of harvested content (120 records)
- which runs failed, why? (today failed -> log, yesterday successfull -> log)
- Monitor running harvestors (20% done -> cancel)
- Define behaviours on harvested content
- skip records with low quality (if test xxx fails)
- mint identifier if missing ( https://example.com/data/{uuid} )
- a model transformation before ingestion ( example-transform.xsl / do-something.py )
Duplicates / Conflicts
A resource can be described in multiple Catalogues, identified by a common identifier. Each of the harvested instances may contain duplicate, alternative or conflicting statements about the resource. SoilWise Repository aims to persist a copy of the harvested content (also to identify if the remote source has changed). The Harvester component itself will not evaluate duplicities/conflicts between records, this will be resolved by the Interlinker component.
An aim of this exercise is also to understand in which repositories a certain resource is advertised.
Technology options
geodatacrawler, written in python, extracts metadata from various sources:
- Local file repository (metadata and various data formats)
- CSV of metadata records (each column represents a metadata property)
- remote identifiers (DOI, CSW)
- remote endpoints (CSW)
Google cloud run is a cloud environment to run scheduled tasks in containers on the Google platform, the results of tasks are captured in logs
Git CI-CD to run harvests, provides options to review CI-CD logs to check errors
RabbitMQ a common message queue software
Integration opportunities
The Automatic metadata harvesting component will show its full potential when being in the SWR tightly connected to (1) SWR Catalogue, (2) Data download & Upload pipelines and (3) ETS/ATS, i.e. test suites.
Repository Storage
Important Links
Project: Storage
The SoilWise Repository is expected to fulfil the following functions:
Technology
Various storage options exist, dedicated usage scenarios usually have an optimal storage option. Maintenance will also be considered as part of the choice.
- Relational databases provide performant filtering and aggregation options that facilitate the performance of data APIs. Relational databases have a fixed data model.
- Search engines, such as SOLR/Elastic search. Search engines provide even higher performance and introduce faceted search (aggregations) and ranking customisation.
- File (& bucket) repositories, which are slow and non-queryable but very flexible in the data model, scalable and persistent.
- Graph and triple stores, which are very fitted to store relations between random entities and can reason over data in multiple domain models.
- Versioning systems (such as git), which are very slow and not queryable but ultimately persistent/traceable. Less optimal for binary files.
Storage of artefacts
Data model
‘To which data model shall I align?’ is the central question of data harmonisation efforts and data interoperability in general. SoilWise is aware of the fragmentation of soil data and the lack of harmonisation. As such, the SWR will, in the first project iteration cycle, focus on two major pan-European/global data modelling efforts within the soil domain.
- GloSIS (Global Soil Information System) is the name for the system and the soil data model, also named the GloSIS domain model. The GloSIS domain model published as a UML class diagram is not publicly available, being in the FAO repositories under the CC
license. Nevertheless, the GloSIS web ontology is publicly available implementation with the Web Ontology Language (OWL). The GloSIS web ontology employs a host of Semantic Web standards (SOSA, SKOS, GeoSPARQL, QUDT); GloSIS lays out not only a soil data ontology but also an extensive set of ready-to-use code lists for soil description and physio-chemical analysis. Various examples are provided on the provision and use of GloSIS-compliant linked data, showcasing the contribution of this ontology to the discovery, exploration, integration and access of soil data. - INSPIRE (INfrastructure for SPatial InfoRmation in Europe) aiming to create a spatial environmental data infrastructure for the European Union. A detailed data specification for the soil theme was published by the European Commission in 2013, supported by a detailed domain model documented as a UML class diagram.
Other (potentially) relevant data models are:
- World Reference base (WRB) maintains the code lists, which are the source of GLOSIS codelists, but the WRB online presence is currently limited.
- Landuse
- Land management practices
- monitoring facilities
- Landcover
Open issues
Many data models are used for data harmonisation and interoperability within the soil domain. The following data models may also be potentially relevant for the SWR:
- SOTER: the Global and National Soils and Terrain Digital Databases (SOTER) was chronologically the first global soil spatial data harmonisation/interoperability initiative of the International Society of Soil Science (ISSS), in cooperation with the United Nations Environment Programme, the International Soil Reference and Information Centre (ISRIC) and the FAO. Albeit lacking an abstract formalisation (SOTER pre-dates both UML and OWL), the ancient SOTER databases remained a reference for developing subsequent soil information models.
- ISO 28258, “Soil quality — Digital exchange of soil-related data” as one of the key achievements of the GS Soil project. This standard produced a general framework for exchanging soil data, recognising a need to combine soil with other kinds of data. ISO 28258 is documented with a UML domain model, applying the O&M framework to the soil domain. An XML exchange schema is derived from this domain model, adopting the Geography Markup Language (GML) to encode geospatial information. The standard was conceived as an empty container, lacking any kind of controlled content. It is meant to be further specialised for actual use (possibly at a regional or national scale).
- ANZSoilML, the Australian and New Zealand Soil Mark-up Language (ANZSoilML), results from a joint effort by CSIRO in Australia and New Zealand’s Manaaki Whenua to support the exchange of soil and landscape data. Its domain model was possibly the first application of O&M to the soil domain, targeting the soil properties and related landscape features specified by the institutional soil survey handbooks used in Australia and New Zealand. ANZSoilML is formalised as a UML domain model from which an XML schema is obtained, relying on the ComplexFeature abstraction that underlies the SOAP/XML web services specified by the OGC. A set of controlled vocabularies was developed for ANZSoilML, providing values for categorical soil properties and laboratory analysis methods. More recently, these vocabularies were transformed into RDF resources to be managed with modern Semantic Web technologies.
Moreover, GloSIS and INSPIRE data models fully support only vector data. GloSIS has not developed a data model for gridded data yet, and several issues were reported to the INSPIRE data model for gridded data.
GloSIS and INSPIRE soil are oriented to Observations and Measurements of OGC, with the arrival of the samples objects in the new version of O&M, now named Observations Measurements & Samples. Soilwise can probably contribute to the migration of the soil models to the new OMS version.
Soil health vocabulary
Understand if Soil health codelists as developed in the Envasso and Landmark projects, can be adopted by the online soil community, for example, as part of the Glosis ontology, INSPIRE registry or EUSO. Research is needed to evaluate if a legislative body is available to confirm the definitions of the terms.
Storage of metadata
- Metadata is best stored on a git versioning system to trace its history and facilitate community contributions.
- Metadata is best stored in a graph database or triple store to validate interlinkage and facilitate harmonisation.
- Metadata is best queried from a database or search engine. Search engines, by default, offer ranking and faceting capabilities, which are hard to reproduce on databases, but search engines come at a high cost in terms of maintenance and memory use.
- All collected metadata will be archived once per year.
- Besides raw metadata, the results of the metadata validation process will be stored along with override values.
Storage of knowledge
- Storage (or non-storage) of knowledge is highly dependent on the type of knowledge, how it is to be used and the available resources for storage.
- As a minimum SWR stores metadata describing knowledge assets (unstructured content) – see section storage of metadata.
- Knowledge that expresses links between data and knowledge assets is best stored in a graph DB or an RDF DB, depending also on the application requirements.
- Knowledge that expresses semantics is best stored as RDF in an RDF DB, to be able to reason over semantic relationships.
- When knowledge needs to be reasoned over using LLMs, it is preferably processed and stored in a vector DB, potentially linked to relevant text fragments (for explainable AI).
- Querying knowledge is best done from an indexed DB or search engine (see section metadata) or from a vector DB (through chatbot / LLM applications).
Knowledge graph - Triple Store
The knowledge graph is meant to add a formal semantics layer to the metadata collected at the SWR. It mirrors the XML-based metadata harnessed in the Catalogue Server but uses Semantic Web standards such as DCAT, Dublin Core, VCard or PROV. This metadata is augmented with links to domain web ontologies, in particular GloSIS. This semantically augmented metadata is the main pillar of knowledge extraction activities and components.
Besides metadata on knowledge assets, the knowledge graph is also expected to host the results of knowledge extraction activities. This assumes knowledge to be semantically loaded, i.e. linking to relevant domain ontologies. The identification of appropriate ontologies and ontology mappings thus becomes an essential aspect of this project, bridging together various activities and assets.
It is important to recognise the knowledge graph as an immaterial asset that cannot exist by itself. In order to be usable the knowledge graph must be stored in a triple store, thus highlighting the role of that component in the architecture. In its turn the triple store provides another important architectural component, the SPARQL end-point. That will be the main access gateway to the knowledge graph, particularly through other technological components and software.
The Large Language Model foreseen in this project will be trained on the knowledge graph, thus forming the basis for the Chatbot component of the user interface. The knowledge graph will further feed the facilities for machine-based access to the SWR: a knowledge extraction API and a SPARQL end-point.
Technology
- DCAT, Dublin Core, VCard, PROV, GloSIS, see chapter Semantic Web specifications for metadata and Data model
Storage of data
Processed data
- Data that changes often (due to continuous ingested data feeds) are best stored in a database.
- Snapshots of data feeds or data processing results are best stored as files on a repository or bucket, and the file location (in combination with an identification proxy, like DOI) provides a unique identification of the dataset.
- API access to larger datasets best uses a scalable database or files in a cloud native (scalable) format. Data is exported to such formats before exposure via APIs (from git, triple stores, files, etc). in some cases, a search engine is the most relevant API backend.
High-value data
- Full dataset download or Single band data (access by bbox, not by property) is best stored as files on a scalable file infrastructure using cloud native formats, where the file location provides the identification.
- Data that is frequently filtered or aggregated on attribute value is best stored on a relational database or search engine.
Temporary store for uploaded data
Temporary data storage may be necessary as a caching mechanism to achieve acceptable performance (e.g. response time and throughput), e.g. for derived and harmonised data sets. For any data that is supposed to be stored temporarily, there shall be a flag that indicates its validity until it shall be cleaned up. The monitoring system shall check whether any such flags are present that should have been cleaned up already.
Technology
- PostgreSQL is a common open-source database platform with spatial support. A database dump of a Postgres database, as a backup or to publish FAIR versions at intervals, is not very user-friendly. A conversion to SQLite/GeoPackage (using GDAL/Hale) facilitates this case.
- The most popular search engine is Elastic Search (also used by JRC in INSPIRE), but has some challenges in its license. Alternative is SOLR.
- File repositories range from Amazon/Google to a local NFS with Webdav access.
- Graph database Neo4J, Triple store, Jena Fuseki (Java) or Virtuoso (C) both have spatial support.
- GIT is the most used versioning system these days, with the option to go for SAAS (Github, Bitbucket) or on-premise (Gitlab). GitHub seems the most suitable option, as other groups such as OGC and INSPIRE are already there, which means users already have an account, and we can cross-link issues between projects.
Backup and versioning
For any data, there shall be at least two levels of backups. Volume snapshots shall be the preferred mode of backups. These volume snapshots should be stored in a different location and should enable fast recovery (i.e. in less than 4 hours during business hours) even if the location where the SWR is operated is entirely unavailable. These volume snapshots should be configured in such a way that at no point in time, more than 1 hour of new data/changed data would be lost. Volume backups should be retained for 30 days.
A second level of backups can be more granular, e.g., storing all data and metadata assets, as well as configuration and system data as encrypted files in an object store such as AWS S3. This type of backup allows for a more specific or partial recovery for cases where data integrity was damaged, where there was a partial data loss or another incident which does not necessitate restoring the system. This could also include explicit backups (dumps) of the database systems that are part of the SWR. It is tolerable for these backups to be updated once per day.
If there is data that requires full versioning or historisation, it is recommended to store it in a version control system.
Finally, there should be a restore exercise at least once per year, where a fresh system is set up from both types of backups.
Metadata Catalogue
The metadata catalogue is a central piece of the architecture, collecting and giving access to individual metadata records. In the geo-spatial domain, effective metadata catalogues are developed around the standards issued by the OGC, the Catalogue Service for the Web (CSW) and the OGC API Records.
Besides this essential compliance with international standards, metadata catalogues usually provide other important management functionalities: (i) metadata record editing, (ii) access control, (iii) records search, (iv) resource preview, (v) records harvesting, etc. More sophisticated metadata catalogues approach the functionalities of a Content Management System (CMS). The remainder of this section reviews two popular open-source geo-spatial metadata catalogues: GeoNetwork and pycsw.
GeoNetwork
This web-based software is centred on metadata management, providing rich edition forms. The editor supports the ISO19115/119/110 standards used for spatial resources and also Dublin Core. The user can upload data, graphics, documents, PDF files and any other content type to augment metadata records. Among others, GeoNetwork supports:
- multilingual metadata record edition,
- validation system,
- automated suggestions for quality improvement,
- publication of geo-spatial layers to software compliant with OGC services (e.g. GeoServer).
GeoNetwork implements the following protocols:
- OGC CSW
- OAI-PMH
- OpenSearch
- Z39.50
The metadata harvesting feature is quite broad, able to interact with the following resources:
- OGC-CSW 2.0.2 ISO Profile
- OAI-PMH
- Z39.50 protocols
- Thredds
- Webdav
- Web Accessible Folders
- ESRI GeoPortal
- Other GeoNetwork node
Besides the core metadata management functions, GeoNetwork also provides useful monitoring and reporting tools. It is able to easily synthesise the content of the catalogue with statistics and graphics. A system status is also available to the system administrator.
Use cases
The GeoNetwork project started out in 2001 as a Spatial Data Catalogue System for the Food and Agriculture organisation of the United Nations (FAO), the United Nations World Food Programme (WFP) and the United Nations Environmental Programme (UNEP). Other relevant projects and institutions using GeoNetwork include:
- European Marine Observation and Data Network (EMODnet)
- Federal Geographic Data Committee of the United States (FGDC)
- Scotland’s catalogue of spatial data
- National Centers for Coastal Ocean Science of the United States (NCCOS)
pycsw
pycsw is a catalogue component offering an HTML frontend and query interface using various standardised catalogue APIs to serve multiple communities. Pycsw, written in python, allows for the publishing and discovery of geospatial metadata via numerous APIs (CSW 2/CSW 3, OpenSearch, OAI-PMH, SRU), providing a standards-based metadata and catalogue component of spatial data infrastructures. pycsw is Open Source, released under an MIT license, and runs on all major platforms (Windows, Linux, Mac OS X).
- Technology: python
- License: MIT
- OSGeo project
Functionality
Functionality of the pycsw is from the SoilWise perspective identical to GeoNetwork's functionality.
- query metadata
- M: filter by (configurable set of) properties (AND/OR/NOT, FullTextSearch, by geography)
- M: Sorting and pagination
- S: aggregate results (faceted search)
- W: customise ranking of the results
- OGC:CSW, OGCAPI:Records, OAI-PMH
- Search engine discoverability / Schema.org
- Link to data download / data preview
Use cases
pycsw is a core component of GeoNode and is the core of the CKAN spatial extension, used for example by FAO. pycsw is used in various projects:
In preparation:
- Soils for Africa
Metadata validation
Important Links
Project: Metadata validation
In terms of metadata, SoilWise Repository aims for the approach to harvest and register as much as possible (see more information in the Harvester Component). Catalogues which capture metadata authored by data custodians typically have a wide range of metadata completion and accuracy. Therefore, the SoilWise Repository aims to employ metadata validation mechanisms to provide additional information about metadata completeness, conformance and integrity. Information resulting from the validation process are aimed to be stored together with each metadata record in a relation database and updated after registering a new metadata version. After metadata processing and extension (see the Interlinker component), this validation process can be repeated to understand the value which has been added by SWR.
The metadata validation component comprises the following functions:
In the next iterations, SoilWise will explore the utilization of on-demand metadata validation, which would generate reports for user-uploaded metadata.
Metadata structure validation
The initial steps of metadata validation, foreseen so far, comprise:
- Markup (Syntax) Check: Verifying that the metadata adheres to the specified syntax rules. This includes checking for allowed tags, correct data types, character encoding, and adherence to naming conventions.
- Schema (DTD) Validation: Ensuring that the metadata conforms to the defined schema or metadata model. This involves verifying that all required elements are present, and relationships between different metadata components are correctly established.
Metadata completeness validation
Completeness of records can be evaluated by:
- contains the required and/or advised metadata elements of SWR
- contains the required elements endorsed by the adopted metadata standard itself
Completeness according to SWR and completeness according to the adopted model results in quality indicators of a resource description.
Methodology
Various technologies use dedicated mechanisms to validate inputs on type matching and completeness
- XML (Dublin core, iso19115, Datacite) validation - XSD schema, potentially extended with Schematron rules
- json (OGC API - Records/STAC) - json schema
- RDF (schema.org, dcat) - SHACL
We will explore applicability of ISO19157 Geographic Information – Data quality (i.e. the standard intended for data validations) for metadata-based validation reports.
Metadata ETS/ATS checking
Abstract Executable Test Suites (ATS) define a set of abstract test cases or scenarios that describe the expected behaviour of metadata without specifying the implementation details. These test suites focus on the logical aspects of metadata validation and provide a high-level view of metadata validation requirements, enabling stakeholders to understand validation objectives and constraints without getting bogged down in technical details. They serve as a valuable communication and documentation tool, facilitating collaboration between metadata producers, consumers, and validators. ATS are often documented using natural language descriptions, diagrams, or formal specifications. They outline the expected inputs, outputs, and behaviours of the metadata under various conditions.
Example: INSPIRE ATS for Soil (see Annex A)
Executable Test Suites (ETS) are sets of tests designed according to ATS to perform the metadata validation. These tests are typically automated and can be run repeatedly to ensure consistent validation results. Executable test suites consist of scripts, programs, or software tools that perform various validation checks on metadata. These checks can include:
- Data Integrity: Checking for inconsistencies or errors within the metadata. This includes identifying missing values, conflicting information, or data that does not align with predefined constraints.
- Standard Compliance: Assessing whether the metadata complies with relevant industry standards, such as Dublin Core, MARC, or specific domain standards like those for scientific data or library cataloguing.
- Interoperability: Evaluating the metadata's ability to interoperate with other systems or datasets. This involves ensuring that metadata elements are mapped correctly to facilitate data exchange and integration across different platforms.
- Versioning and Evolution: Considering the evolution of metadata over time and ensuring that the validation process accommodates versioning requirements. This may involve tracking changes, backward compatibility, and migration strategies.
- Quality Assurance: Assessing the overall quality of the metadata, including its accuracy, consistency, completeness, and relevance to the underlying data or information resources.
- Documentation: Documenting the validation process itself, including any errors encountered, corrective actions taken, and recommendations for improving metadata quality in the future.
Example: INSPIRE ETS validator
Open issues
- ETS for GloSIS are not existing and need to be configured
Shacl
Shacl is is in general intended for semantic web related validations; however, it's exact scope will be determined during the SoilWise developments.
Technology
Transformation and Harmonisation Components
These components make sure that data is interoperable, i.e. provided to agreed-upon formats, structures and semantics. They are used to ingest data and transform it into common standard data, e.g. in the central SWR format for soil health.
The specific requirements these components have to fulfil are:
- The services shall be able to work with data that is described explicitly or implicitly with a schema. The services shall be able to load schemas expressed as XML Schemas, GML Application Schemas, RDF-S and JSON Schema.
- The services shall support GML, GeoPackage, GeoJSON, CSV, RDF and XSL formats for data sources.
- The services shall be able to connect with external download services such as WFS or OGC API, Features.
- The services shall be able to write out data in GML, GeoPackage, GeoJSON, CSV, RDF and XSL formats.
- There shall be an option to read and write data from relational databases.
- The services should be exposed as OGC API Processes
- Transformation processes shall include the following capabilities:
- Rename types & attributes
- Convert between units of measurement
- Restructure data, e.g. through, joining, merging, splitting
- Map codelists and other coded values
- Harmonise observations as if they were measured using a common procedure using PTF.
- Reproject data
- Change data from one format to another
- There should be an interactive editor to create the specific transformation processes required for the SWR.
- It should be possible to share transformation processes.
- Transformation processes should be fully documented or self-documented.
Implementation Technologies
We plan to deploy the needed capabilities to the SWR using two technologies:
- GDAL is a very robust conversion library used in most FOSS and commercial GIS software. It provides a wealth of format conversions and can handle reprojection. In cases where no structural or semantic transformation is needed, a GDAL-based conversion service would make sense.
- hale studio is a proven ETL tool optimised for working with complex structured data, such as XML, relational databases, or a wide range of tabular formats. It supports all required procedures for semantic and structural transformation. It can also handle reprojection. While Hale Studio exists as a multi-platform interactive application, its capabilities can be provided through a web service with an OpenAPI.
In some cases, the two services may be chained in a single workflow.
Interlinker
Important Links
Project: Metadata ingestion
Interlinker component comprises of the following functions:
Automatic metadata interlinking
To be able to provide interlinked data and knowledge assets (e.g. a dataset, the project in which it was generated and the operating procedure used) links between metadata must be identified and registered ideally as part of the SWR Triple Store.
- Explicit links can be directly derived from the data and/or metadata. E.g. projects in CORDIS are explicitly linked to documents and datasets. For those linkages, the harvesting process needs to be extended, calling this component to store the relation in the knowledge graph. It should accommodate "vice versa linkage" (if resource A links to B, a vice versa link can be added to B).
- Implicit links can not be directly derived from the (meta)data. They may be derived by spatial or temporal extent, keyword usage, or shared author/publisher. In this case, AI/ML can support the discovery of potential links, including some kind of probability indicator.
Duplicates identification
In the context of Persistent Identifiers (PIDs), duplication refers to the occurrence of multiple identifiers pointing to the same digital object or resource. As SWR will be ingesting datafiles from multiple data sources, this is an aspect that has to be taken into account.
We have no knowledge of existing technologies we can integrate as a component of the platform. This functionality will be setup within the platform. The methodology applied to identify duplicates will be by comparing multiple (meta)data attributes like File Name, File Size, File Type, Owner, Description, Date Created/Modified. Natural Language Processing techniques like Bag-of-words or Word/Sentence Embedding algorithms can be used to convert textual attributes into vectors, capturing semantic similarity and relationships between words. Each datafile will be characterized by its attributes and represented in a continuous vector space together with the other datafiles. Similarity algorithms (e.g. cosine similarity, euclidean distance, etc.) are then applied to identify datafiles with a similarity above a certain threshold, which is then considered to be duplicated. If necessary, a business rule will be integrated, taking the "completeness" of the datafile into account as to be able to determine which PID and datafile to keep and which to discard.
Technology
This process can be automated in the platform using automated (Python) scripts running within the platform's data processing environment. A second approach is to use data processing functionalities and AI algorithms integrated into a database, e.g. the Neo4J Graph Database and Neo4J Graph Data Science Similarity algorithms (Node Similarity, K-Nearest Neighbours, ...). This requires the data to exist in the graph database as linked data, either importing from the SWR knowledge graphs or using such a graph database technology (e.g. Neo4J) as the SWR knowledge graph technology.
- two levels inspection (coarse = dataset level, fine = objects/attributes? level)
- read existing data in terms of size, identical identifiers (data, metadata level)
- identify duplicite values
Link liveliness assessment
Persistent content is considered to be stored in a trustworthy, persistent repository. We expect those storages to store the asset compliant with the applicable legally and scientifically required terms and periods for storage of the content, and to use a DOI or other persistent URI for persistent identification. These can be safely referred to from the SoilWise catalogue. For long-term preservation and availability of data and knowledge assets, SWR relies on the repository holders and their responsibility to keep it available.
Non-persistent data and knowledge are the ones that are not guaranteed to persist by the repository or data and knowledge holder and/or do not guarantee a persistent URI for reference for at least 10 years. In practice, many non-persistent knowledge sources and assets exist that could be relevant for SWR, e.g. on project websites, in online databases, on the computers of researchers, etc. Due to their heterogeneity in structure and underlying implementing technologies, etc., it is not possible nor desirable to store those in the SWR, with the exception of high-value data/knowledge assets.
Foreseen functionality
- Assess if resources use proper identifiers to reference external items.
Metadata (and data and knowledge sources) tend to contain links which, over time, degrade and result in
File not found
experiences. - By running availability checks on links mentioned in (meta)data, for each link an availability indicator (available, requires authentication, intermittent, unavailable) can be calculated.
- Alternatively, an availability check can be performed at the moment a user tries to open a resource.
Technology
Providers of Identifiers:
- ePIC ePIC API providing a software stack for a PID service
- DOI
- w3id.org persistent identification at namespace/domain level
- R3gistry Germany for namespaces, codelists, identifiers. Similar exist for Austria, Italy, Spain, Slovakia, Netherlands
Liveliness checks:
- GeoHealthCheck a library which checks at intervals the availability of OGC APIs up to collection/item level, should be extended to drill down from CSW endpoint to record level and check links in individual records
- Geocat has developed a linkage checker for iso19139:2007 metadata for INSPIRE geoportal, available at icat, which includes link checks in OWS capabilities.
- Python link checker checks (broken) links in html
- ...
NLP and Large Language Models
LLM (and less complex Natural Language Processing (NLP) approaches) can be used to perform tasks in metadata optimisation (e.g. identify similarities, resolve conflicts, populate gaps, classify or summarize resources).
LLM can also power a chatbot interface in which a user asks questions to the bot on what type of resources they are looking for and the bot suggests options that can lead to improved search results (finding more relevant resources).
Precondition
- Prompt engineering and Retrieval-Augmented Generation (RAG) are approaches for preparing text to be used as input (prompt) for a generative AI model. These techniques help to tune the usually very generic foundational LLMs to generate more specific responses with less change of halucinations. RAG, in particular, should run post harvest, but pre inclusion into the knowledge graph (to prevent the full knowledge graph is analysed at every insert).
- Embeddings are numerical (vector) representations of words, phrases, or larger text fragments (or even images) and have become a key part for text analysis. Small-size embeddings can be calculated on-the-fly, but larger size (capturing more complex semantic or linguistic characteristics), as used in RAG, take time to compute and thus are best stored. Vector databases are specifically being developed for this purpose, allowing fast processing and comparing of embeddings. No preferred vector database can be selected currently, as they are under active development, we'll experiment with a number of them and select the best suited.
Metadata optimization
A component which uses NLP/LLM to improve metadata
- identify similarities
- very high similarity; indication that the record (despite the different identifier) is likely the same resource
- high similarity; suggest it as a
similar
resource (linkage)
- resolve conflicts
- if two records contain conflicting statements about a resource, try to derive from context which statement is correct
- populate gaps
- if important properties are not populated (contact, title), try to derive it from context (with e.g. Named-Entity Recognition)
- classify resources (add thematic keywords/tags)
- Based on context, understand which thematic keywords/tags are relevant (soil threats, soil functions, soil health indicators). Keywords/tags should be related to provided codelist or can be suggested as a potential new one to be added.
- summarize resources
- If a record lacks an abstract or has a too short abstract, ask LLM to derive an abstract from the resource itself
- derive spatial or temporal extent from content
- if no spatio-temporal extent is given, derive it from the resource itself or from context if possible
For each AI derived property, indicate that it has been derived by AI. (Need to be discussed how this can be indicated, e.g. with attributes / relations in the knowledge graphs?)
- Translate the Title, Abstract elements into English, French and German
Empower a chatbot for user support in defining (and answering) a relevant catalogue question
A chatbot is a natural language user interface to engage users in identifying what they are looking for and even provide a suggestion for an answer. Advanced LLMs provide improved text processing capabilities that can serve more usable human-machine interfaces.
Map Server
MapServer is an Open Source platform for publishing spatial data to the web using standardised APIs defined by Open Geospatial Consortium, such as WMS, WFS, WCS, OGC API-Features. Originally developed in the mid-1990s at the University of Minnesota, MapServer is released under an MIT-style license and runs on all major platforms (Windows, Linux, Mac OS X). MapServer is not a full-featured GIS system, nor does it aspire to be.
Technology
A docker image for mapserver is maintained by Camp2Camp. The important aspect here is that the image uses a minimal build of GDAL, which defines the source formats consumable by the MapServer (in line with section Transformation and Harmonistation Components. If formats such as Geoparquet or Geozarr are relevant, building a tailored image is relevant.
The configuration of the MapServer is managed via a config file. The config files reference metadata, data and styling rules. Various tools exist to create MapServer config files:
- geocat bridge is a QGIS plugin to create mapfiles from QGis projects
- Mappyfile is a python library to generate mapfiles by code
- mapserver studio a saas solution to edit mapfiles
- mapscript is a python library to interact with the MapServer binary
- pygeodatacrawler is a tool by ISRIC generating mapfiles from various resources
- vs code mapfile plugin
Read more about MapServer at EJPSoil wiki.
Alternatives to Mapserver are:
- Geoserver
- QGIS server
- pygeoapi (pygeoapi uses MapServer internally to provide map rendering)
User Interface: Dashboard
Important Links
Project: Dashboard
The term dashboard is used with various meanings, in the scope of Soilwise the following uses are relevant:
- A search interface on metadata, search results are typically displayed in a paginated set of web pages. But alternatives, such as a map or chatbot, could be interesting.
- A set of diagrams showing an overview of the contents and usage of the catalogue; for example a diagram of the percentage of records by topiccategory, number of visits by an EU Member State.
Other parts of the dashboard are:
- Metadata authoring and harvest configuration components
- Data download & export options
SoilWise Dashboard is intended to support the implementation of User stories, deliver useful and usable apps for various stakeholders, provide interface for user testing and present data and knowledge in useable way.
Search interface on metadata
A typical example of a catalogue search interface is the current ESDAC catalogue.
Ranking, relations, full-text search, and filtering
Optimal search capabilities are provided by the catalogue backend, this component leverages these capabilities to provide users with an optimal user interface to effectively use those capabilities.
The EJPSoil prototype, foreseen to be re-used within SoilWise, uses a tailored frontend, focusing on:
- Paginated search results, sorted alphabetically, by date
- Minimalistic User Interface, to prevent a technical feel
- Option to filter by facets
- Option to provide feedback to publisher/author
- Readable link in the browser bar, to facilitate link sharing
- Preview of the dataset (if a OGC:Service is available), else display of its spatial extent
What can be improved:
- Show relations to other records
- Better distinguish link types; service/api, download, records, documentation, etc.
- Indication of remote link availability/conformance
- If a record originates from (multiple) catalogues, add a link to the origin
- Ranking (backend)
Technology
- Jinja2 templates (html, css) as a tailored skin on pycsw/pygeoapi, or dedicated frontend (vuejs, react)
- Leaflet/OpenLayers/MapLibre
Chatbot
Large Language models (LLM) enriched with SoilWise content can offer an alternative interface to assist the user in finding and accessing the relevant knowledge or data source. Users interact with the chatbot interactively to define the relevant question and have it answered. The LLM will provide an answer but also provide references to sources on which the answer was based, in which the user can extend the search. The LLM can also support the user in gaining access to the source, using which software, for example.
Map viewer
A light-weight client map viewer component will be employed:
- as a frontend of Map Server component to visualize provided WMS, WFS, WCS layers
- as an integrated part of the Catalogue to visualize primarily the geographical extent of data described in the metadata record and a snapshot visualization of the data
- to provide the full preview of data, when a link to web service or data browse graphics (preview image) is available
A dedicated mapviewer, such as TerriaJS, can support users in accessing relevant data which has been prepared for optimal consumption in a typical GIS Desktop environment. For example maps of spatial distribution of soil properties or health indicators over Europe. A typical example is Soilgrids.
An interesting aspect of a community like EUSO is the ability to prepare and share a map with stakeholders to trigger some discussion on phenomena at a location.
Examine the need for viewing novel formats such as Vector tiles, COG, GeoZarr, GeoParquet directly on a map background. The benefit of these formats is that no (OGC) service is required to facilitate data visualisation.
Technology
TerriaJS is an environment to share maps (react+leaflet+cesium), but also create maps and share them with stakeholders.
Overview of catalogue content
Traditional dashboards
The INSPIRE Geoportal increased its usage with their new dashboard-like interface for each EU Member State the number of published datasets per topic is upfront in the application. Dashboards on catalogue content provide mechanisms to generate overviews of that content to provide such insight.
Technology
The EJP Soil Catalogue Dashboard is based on Apache Superset, with direct access to the PostgreSQL database containing the catalogue records. GeoNode recently implemented Superset, enabling users to define their diagrams on relevant data from the catalogue (as an alternative to a map viewer).
Dashboarding functionality is available in Geonetwork, using the Elastic Search Kibana dashboarding. Similar functionality for the pycsw needs to be investigated, verified respectively.
The source data for the dashboards is very likely enriched with AI-generated indicators. LLMs also seem able to provide overviews of sets of content.
Manual data & metadata authoring
Important Links
Project: Metadata authoring
The SWR provides two major ways of data & metadata authoring
- in an automatized manner, as described in the Harvester component;
- in a manual mode, as described within this Manual data & metadata authoring component.
Note that option (1) is the preferred one from the SWR point of view as it allows to massively tackle metadata and knowledge of remotely available resources, including Catalogue servers of Mission Soil projects, Zenodo, Cordis, INSPIRE Geoportal and others.
A manual mode comprises four levels of data and metadata upload (note that the following points 1 and 3 work for both data and metadata, while points 2 and 4 work solely for metadata):
- Manual upload of one data file and/or metadata record as a file: a user selects a file from the local drive or types in a publicly available URL and defines the (meta)data structure (for more details, see Data model component) and optionally other relevant information like target user group, open/restrict publication of this new data/metadata record etc. The Manual data & metadata upload component imports the file, assigns a UUID if needed, and stores the data in Zenodo, while metadata are stored in the Storage component.
- Manual upload of one metadata record as a source code: the functionality of the Manual data & metadata upload component is almost identical to the previous option (file upload). The only difference is that a user is copying a source code rather than a file upload.
- Manual batch upload: this option allows a user to import a set of data files and/or metadata records in the form of e.g. XML or JSON files. The following parameters need to be defined:
- directory, full path on the server’s file system of the directory to scan and try to import all the available files;
- file type, e.g. XML, JSON (note that the full list of supported file types needs to be elaborated yet);
- Manual connection to a Web service and semi-automatic extraction of available metadata: a user types in a publicly available URL pointing to a service metadata document, e.g. GetCapabilities response of OGC Web service. The Manual data & metadata upload component extracts metadata that can be copied in line with the desired metadata structure, i.e. a metadata profile. The Manual data & metadata upload component assigns a UUID if needed, and stores the metadata in the Storage component. Most likely, the metadata extracted from a service metadata document (e.g. GetCapabilities) would not be sufficient to address all the mandatory metadata elements defined by the desired metadata structure, i.e. a metadata profile. In such a case, a manual fill in would be needed
The diagram below provides an overview of the workflow of metadata authoring
flowchart LR
G[fa:fa-code-compare Git] -->|mcf| CI{{pygeometa}}
CI -->|iso19139| DB[(fa:fa-database Database)]
DB --> C(Catalogue)
C --> G
The authoring workflow uses a GIT backend, additions to the catalogue are entered by members of the GIT repository directly or via pull request (review). Records are stored in iso19139:2007 XML or MCF. MCF is a subset of iso19139:2007 in a YAML encoding, defined by the pygeometa community. This library is used to convert the MCF to any requested metadata format.
A webbased form is available for users uncomfortable with editing an MCF file directly.
With every change on the git repository, an export of the metadata is made to a Postgres Database (or the triple store).
Foreseen functionality
- GUI and backend for online form
- validation of inserted values
- storing inserted metadata record
Technology
- pycsw and GeoNetwork includes capabilities to harvest remote sources, it does not include a dashboard to create and monitor harvesters
- A Git based participatory metadata workflow as introduced in EJP Soil and foreseen for SoilWise as a follow-up
- Users should be endorsed to register their metadata via known repositories, such as Zenodo, CORDIS, INSPIRE, ... at most register the identifier (DOI, CSW) of the record at EUSO, metadata will be mirrored from those locations at intervals
- Data can be maintained in a Git Repository, such as the EJP Soil repository, preferably using a readably serialisation, such as YML
- In EJP Soil we experiment with the metadata control file format (MCF), a subset of iso19139
- A web editor for MCF is available at osgeo.github.io
- Users can also submit metadata using a CSV (excel) format, which is converted to MCF in a CI-CD
- A web based frontend can be developed to guide the users in submitting records (generate a pull request in their name)
- Validation of inserted values
- A CI-CD script which runs as part of a pull request triggers a validation and rejects (or optimises) a record if it does not match our quality criteria
Integration opportunities
The Manual data & metadata upload component will show its full potential when being in the SWR tightly connected to (1) SWR Catalogue, (2) metadata validation, and (3) Requires authentication and authorisation.
Open issues
The Manual data & metadata upload component shall be technologically aligned with the SWR Catalogue and Harvester. Both considered software solutions, i.e. GeoNetwork and pycsw support the core desired functionality all these three SWR components.
The above-described mechanisms showed the “as is” manual metadata upload. Nevertheless, it is foreseen that the SWR will also support “on-the-fly transformation towards interoperability”. Such functionality is currently under discussion. The desired functionality aims to assist data producers and publishers with a pipeline that allows them to map their source data & metadata structures into target interoperable data & metadata structures. An example of this is an example of an upload of interpreted soil data and their on-the-fly transformation into a structure defined by the INSPIRE Directive, application schema from data specification on Soil respectively. A data & metadata upload pipeline supporting “on-the-fly transformation towards interoperability” will be described in greater detail later in line with SoilWise developments.
Data download & export
A UI component could be made available as part of the SWR Catalogue application which facilitates access to subsets of data from a data download or API. A user selects the relevant feature type/band, defines a spatial or attribute filter and selects an output format (or harmonised model). The component will process the data and notify the user when the result is available for download. The API-based data publication is described as part of APIs.
User Management and Access Control
User and organisation management, authorisation and authentication are complex, cross-cutting aspects of a system such as the SoilWise repository. Back-end and front-end components need to perform access control for authenticated users. Many organisations already have infrastructures in place, such as an Active Directory or a Single Sign On based on OAuth.
The general model we apply is that:
- a user shall be a member of at least one organisation.
- a user may have at least one role in every organisation that they are a member of.
- a user always acts in the context of one of their roles in one organisation (similar to Github contexts).
- organisations can be hierarchical, and user roles may be inherited from an organisation that is higher up in the hierarchy.
The basic requirements for the SWR authentication mechanisms are:
- User authentication, and thus, provision of authentication tokens, shall be distributed ("Identity Brokering") and may happen through existing services. Authentication mechanisms that are to be supported include OAuth, SAML 2.0 and Active Directory.
- An authoritative Identity Provider, such as an eIDAS-based one, should be integrated as well.
- There shall be a central service that performs role and organisation mapping for authenticated users. This service also provides the ability to configure roles and set up organisations and users. This central service can also provide simple, direct user authentication (username/password-based) for those users who do not have their own authentication infrastructure.
- There may be different levels of trust establishment based on the specific authentication service used. Higher levels of trust may be required to access critical data or infrastructure.
- SWR services shall use Keycloak or JSON Web Tokens for authorization.
- To access SWR APIs, the same rules apply as to access the SWR through the UI.
In later iterations, the authentication and authorisation mechanisms should also be used to facilitate connector-based access to data space resources.
Sign-up
For every registered user of SWR components, an account is needed. This account can be created in one of three ways:
- Automatically, by providing an authentication token that was created by a trusted authentication service and that contains the necessary information on the organisation of the user and the intended role (this can e.g. be implemented through using a DAPS)
- Manually, through self-registration (may only be available for users from certain domains and/or for certain roles)
- Through superuser registration; in this case the user gets issued an activation link and has to set the password to complete registration
Authentication
Certain functionalities of the SWR will be available to anonymous users, but functions that edit any of the state of the system (data, configuration, metadata) require an authenticated user. The easiest form of authentication is to use the login provided by the SWR itself. This log-in is username-password based. A second factor, e.g. through an authenticator app, may be added after the first iteration.
Other forms of authentication include using an existing token.
Authorisation
Every component has to check whether an authenticated user may invoke a desired action based on that user's roles in their organisations. To ensure that the User Interface does not offer actions that a given user may not invoke, the user interface shall also perform authorisation.
Roles are generally defined using Privileges: A certain role may, for example, read
certain resources, they may edit
or even delete
them. Here is an example of such a definition:
A standard user
may only read
and edit
their own User
profile, and read the information from their organisation. Once a user has been given the role dataManager
, they may perform any CRUD operation on any Data
that is in the scope of their organisation
. They are also granted read
access to publication Theme
configurations on their own and in any parent organisations.
Further implementation hints and Technologies
The public cloud hale connect user service can be used for central user management.
Monitoring
All components and services of the SWR are monitored at different levels to ensure robust operations and security of the system. There will be a central monitoring service for all components that are part of the SWR.
In particular, monitoring needs to fulfill the following requirements:
- For each node, its general state and resource utilisation (RAM, CPU, Volumes) shall be monitored.
- For each container, its general state, e.g. resource consumption (RAM, CPU, Volumes, Transfer, Uptime) shall be monitored.
- For each service, there shall be a health check that can be used to test if the service is responsive and functional, e.g. after a restart.
- If issues that cannot be recovered from automatically occur or which lead to a longer-term degradation of services, messages shall be sent to the operators via channels such as Slack, PagerDuty, or Jira.
- The monitoring system shall provide availability statistics.
- The monitoring system should provide usage statistics.
- The monitoring system may provide a UI element that can be embedded into other components to make usage transparent.
- The monitoring system should provide a dashboard to help system operators with understanding the state of the SWR and to debug incidents, including possible security incidents.
- The monitoring system shall collect warning and error logs to provide guidance for system administrators.
- The monitoring system shall offer the possibility to filter logged interactions based on the https status code, e.g. to identify 404's or 500's.
System context and implementation hints for monitoring
- Monitoring connects with: all modules
- Technologies used: Grafana, Portainer, Prometheus
- External intgrations: Jira, Slack, PagerDuty, ...
Ended: Technical Components
APIs ↵
Introduction
Important Links
Project: APIs
Soilwise will use 6251a85a-47d0-11ee-be56-0242ac120002:documentation:API:
- Open API
- GraphQL
- SPARQL
- OGC webservices (preferably OGC API generation based on Open API)
6251a85a-47d0-11ee-be56-0242ac120002:documentation:API:
API for knowledge extraction
Important Links
Project: APIs
An API providing machine-based access to the SWR knowledge graph. This API is most likely to conform to an existing meta-data standard, such as the OGC API Records, OpenAPI or GraphQL. However, its responses are RDF documents, for instance encoded with JSON-LD syntax. Other components of the SWR performing knowledge extraction and/or augmentation may also use this API to interact with the knowledge graph.
A point of discussion is if the SPARQL engine offers enough performance to facilitate basic discovery actions (by the catalogue frontend), an alternative was suggested to introduce an Elastic Search or PostgreSQL component, which caches the content retrieved with a SPARQL query. These aspects will be validated in the upcoming iteration.
Technology
Various technology options exist, we will validate these in the upcoming iteration
- pycsw and pygeoapi could be extended to support SPARQL as a backend, which would enable OGC API Records on the SPARQL backend
- The GRLC tool enables an Open Rest API on any SPARQL endpoint
- Various tools exist offering Graphql interface to wrap a SPARQL endpoint, to facilitate the growing GraphQL community
Data preview & download APIs
Important Links
Project: APIs
The API-based (soil) data publication has been chosen as a key channel of the SWR to satisfy user needs in terms of data download & export. So far, the following APIs were selected to be verified in terms of their implementation and usability to SoilWise stakeholders.
Foreseen functionality
- Dataset download, no matter its format or data model.
- Export of a subset of a dataset in various forms, including, e.g. feature collection, tiles or a zone in a gridded coverage.
- Export a collection of datasets or their parts, the most typically a mosaic combining several satellite images.
Data preview APIs
Support the discovery and query operations of an 6251a85a-47d0-11ee-be56-0242ac120002:documentation:API:<__None__>:d3ccb98bc4c394ab8c31e4983c641b40 that provides access to electronic maps in a manner independent of the underlying data store:
- Non geographical resources
- Resources such as PDF, SQlite, Excel, CSV would benefit from an 6251a85a-47d0-11ee-be56-0242ac120002:documentation:API:<__None__>:12a1c80389f6101ec7c9d6ae3ab6853c which will read the remote file and return a section of the file in a common format, for the user interface to display it (or a summary provided by a LLM).
- Geographical resources (benefitting from a map view)
- The OGC Web Map Service (WMS aka ISO 19123) supports requests for map images (and other formats) generated from geographical data,
- The OGC API - Maps supports a REST API that can serve spatially referenced electronic maps, whether static or dynamically rendered, independently of the underlying data store. Note that this standard has not been approved yet; further development may still occur.
- The OGC Web Map Tile Service (WMTS) supports serving map tiles of spatially referenced data using tile images with predefined content, extent, and resolution.
- The OGC API - Tiles supports in the form of a REST API that defines building blocks for creating Web APIs that support retrieving geospatial information as tiles. Different forms of geospatial information are supported, such as tiles of vector features (“vector tiles”), coverages, maps (or imagery) and other types of geospatial information.
Data download APIs
In order to monitor the usage of datasets downloaded from federated sources, it could be relevant to introduce a reverse proxy for those remote sources. This 6251a85a-47d0-11ee-be56-0242ac120002:documentation:API:<__None__>:c3c0521255c6752511c02884ee5ef508 would count the download and then forward the user to that resource.
Also see the section on Knowledge extraction
In some cases, it is relevant not to guide the user to the remote source but to let Soilwise do some preprocessing (filtering, reformatting, reprojection) and provide a more tailored answer to the user question. Similar needs may exist for resources hosted within the SWR.
Various tools exist which provide standardised APIs on various data sources. The following APIs offer the functionality described above.
- For non geographical data
- GraphQL
- SPARQL
- OpenAPI
- For geograpical data
- For vector data
- the OGC Web Feature Service (WFS aka ISO 19142) supports requests for geographical feature data (with vector geometry and attributes),
- the OGC API – Features supports in the form of a REST API the capability to create, modify, and query spatial data on the Web and specifies requirements and recommendations for APIs that want to follow a standard way of sharing feature data.
- Sensorthings API is a good fit for harmonised soil data.
- For raster data
- the OGC Web Coverage Service (WCS) supports requests for coverage data (rasters),
- the OGC API – Coverages supports the download of coverages represented by some binary or ASCII serialisation, specified by some data (encoding) format. Arguably, the most popular type of coverage is a gridded one. Satellite imagery is typically modelled as a gridded coverage, for example. Note that this standard has not been approved yet; further development may still occur.
- For vector data
Open issues
The persistent identification of records within a dataset is not guaranteed on (remote) sources that are disseminated using various APIs, such as OGC OWS services, GraphQL, and OpenAPI. Novel formats such as GeoParquet and COG allow range (subset) requests to a single endpoint and could combine FAIR identification and subset requests. Exploration of their potential for the SWR data download and export remains an open question.
Technology
Non geographic
Reverse proxy can probably be set up at the ingress/firewall level.
Research is needed to understand available technology to provide preview options on various types of resource types.
Geographic data
As described within the Data & Knowledge publication component, MapServer is intended for data publication in the SWR. MapServer is an open-source platform for publishing spatial data to the web using standardised APIs defined by the Open Geospatial Consortium, such as WMS, WFS, WCS, and OGC API-Features.
MapServer is not an optimal solution for providing rich data in a hierarchical structure. For that type of data Sensorthings API (frost server), WFS (deegree), graphql (postgraphile) and SPARQL (virtuoso) are more relevant.
Integration opportunities
SWR fully stands behind the FAIR principles, including persistent identification for the data download, which would result in a full download of the data/knowledge resource.
Discovery APIs
Important Links
Project: APIs
In order to enable resource discovery from SoilWise to a variety of communities. SWR aims to evaluate a wide range of standardised or de-facto discovery APIs
- For GraphQL and sparql, read the article on knowledge extraction
- OAI-PMH + Datacite
- Sitemap.xml + schema.org
- Catalog service for the Web and OGC API records
OAI-PMH + Datacite
With this endpoint, we aim to enable academia and the open data community to interact with the SWR. Records from SWR can be harvested from CKAN and Dataverse software using this interface.
Sitemap.xml + schema.org
The search engine community typically uses the Sitemap and schema.org annotations to crawl the SWR content
Catalog service for the Web and OGC API records
The spatial community typically use CSW and OGC API Records to interact with catalogues. An example scenario is a catalogue search from within QGIS, using the Metasearch panel.
Processing APIs
Important Links
Project: APIs
A typical scenario for a processing API is a validation of a metadata record against a certain schema or a data transformation between schemas. The OGC Web Processing Service and its follow up OGC APIs-processes are the two major candidates that will be further investigated in terms of their potential to the SWR.
SPARQL API
Important Links
Project: APIs
This is the primary access point to the knowledge graph, both for humans, as well as for machines. Many applications and end users will instead interact with specialised assets that use the SPARQL end-point, such as the Chatbot or the API. However, the SPARQL end-point is the main source for the development of further knowledge applications and provides bespoke search to humans.
Consider that this component relates to the knowledge extraction component, which describes alternative mechanisms to access selected parts of the knowledge graph.
See also SWR data model for further details.
Rules and reasoning
Since we're importing resources from various data and knowledge repositories, we expect many duplicities, blank nodes and conflicting statements. Implementation of rules should be permissive, not preventing inclusion, only flag potential inconsistencies.
Technology
A number of proven open source triple store implementations exist. In the next iteration we foresee to use the virtuoso software as a starting point.
Ended: APIs
Infrastructure ↵
Introduction
This section describes the general hardware infrastructure and deployment pipelines used for the SWR. As of the delivery of this initial version of the technical documentation, a prototype pipeline and hardware environment shall continuously be improved as required to fit the needs of the project.
During the development of First project iteration cycle, the assumption is:
- There is no production environment.
- There is a central staging environment.
- The central staging environment is deployed on a single hardware node of sufficient capacity. This hardware node will be provided by project partner weTransform.
- In addition to the hardware node, the staging environment also includes an offsite backup capacity, such as a storage box, that is operated in a different physical location.
- There is no central dev/test environment. Each organisation is responsible for its own dev/test environments.
- The deployment and orchestration configuration for this iteration should be stored as YAML in a GitHub repository.
- Deployments to the central staging environment are done through GitHub Actions (or through a Jenkins or GitLab instance provided by weTransform or other partners). This still has to be decided.
- For each component, there shall be separate build processes managed by the responsible partners that result in the built images being made accessible through a hub (e.g. dockerhub)
After completion of the First project iteration cycle, a production environment will be added. It will be dimensioned according to expected loads.
After completion of the Second project iteration cycle, the staging and production environments may switch to a kubernetes-based orchestration mode if it is deemed necessary and advantageous at that point in time. Kubernetes does add significant complexity and requires substantial experience and maintenance to render benefits.
Containerization
The general assumption of operating and developing the SWR is that we work with a containerised environment. This means that each software component, be it a database or other storage, or a service of soem kind, is compiled into a container image. These images are made available in a hub or repository, so that they can be deployed automatically whenever needed, including to fresh hardware.
GIT versioning system
All aspects of the SoilWise repository can be managed through the SoilWise GIT repository. This allows all members of the Mission Soil and EUSO community to provide feedback or contribute to any of the aspects.
Documentation
Documentation is maintained in the markdown format using McDocs and deployed as html or pdf using GitHub Pages.
An interactive preview of architecture diagrams is also maintained and published using GitHub Pages.
Source code
Software libraries tailored or developed in the scope of SoilWise are maintained through the GIT repository.
Container build scripts/deployments
SoilWise is based on an orchestrated set of container deployments. Both the definitions of the containers as well as the orchestration of those containers are maintained through Git.
Harvester definitions
The configuration of the endpoint to be harvested, filters to apply and the interval is stored in a Git repository. If the process runs as a CI-CD pipeline, then the logs of each run are also available in Git.
Authored and harvested metadata
Metadata created in SWR, as well as metadata imported from external sources, are stored in Git, so a full history of each record is available, and users can suggest changes to existing metadata.
Validation rules
Rules (ATS/ETS) applied to metadata (and data) validation are stored in a git repository.
ETL configuration
Alignments to be applied to the source to be standardised and/or harmonised are stored on a git repository, so users can try the alignment locally or contribute to its development.
Backlog / discussions
Roadmap discussion, backlog and issue management are part of the Git repository. Users can flag issues on existing components, documentation or data, which can then be followed up by the participants.
Release management
Releases of the components and infrastructure are managed from a Git repository, so users understand the status of a version and can download the packages. The release process is managed in an automated way through CI-CD pipelines.
Ended: Infrastructure
Glossary
- Acceptance Criteria
- Acceptance Criteria can be used to judge if the resulting software satisfies the user's needs. A single user story/requirement can have multiple acceptance criteria.
- API
- Application programming interface (API) is a way for two or more computer programs to communicate with each other (source wikipedia)
- Application profile
- Application profile is a specification for data exchange for applications that fulfil a certain use case. In addition to shared semantics, it also allows for the imposition of additional restrictions, such as the definition of cardinalities or the use of certain code lists (source: purl.eu).
- Artificial Intelligence
- Artificial Intelligence (AI) is a field of study that develops and studies intelligent machines. It includes the fields rule based reasoning, machine learning and natural language processing (NLP). (source: wikipedia)
- Assimilation
- Assimilation is a term indicating the processes involved to combine multiple datasets with different origin into a common dataset, the term is somewhat similarly used in psychology as
incorporation of new concepts into existing schemes
(source: wikipedia). But is not well aligned with its usage in the data science community:updating a numerical model with observed data
(source: wikipedia) - ATOM
- ATOM is a standardised interface to exchange news feeds over the internet. It has been adopted by INSPIRE as a basic alternative to download services via WFS or WCS.
- Catalogue
- Catalogue or metadata registry is a central location in an organization where metadata definitions are stored and maintained (source: wikipedia)
- Code list
- Code list an enumeration of terms in order to constrain input and avoid errors (source: UN).
- Conceptual model
- Conceptual model or domain model represents concepts (entities) and relationships between them (source: wikipedia)
- Content negotiation
- Content negotiation refers to mechanisms that make it possible to serve different representations of a resource at the same URI (source: wikipedia)
- Controlled vocabulary
- Controlled vocabulary provides a way to organize knowledge for subsequent retrieval. A carefully selected list of words and phrases, which are used to tag units of information so that they are more easily retrieved by a search (source: Semwebtech). Vocabulary, unlike the dictionary and thesaurus, offers an in-depth analysis of a word and its usage in different contexts (source: learn grammar)
- Cordis
- Cordis is the primary source of results from EU-funded projects since 1990
- CSW
- CSW Catalogue Service for the Web
- Dataverse
- Dataverse is open source research data repository software
- Datacite
- Datacite is a non-profit organisation that provides persistent identifiers (DOIs) for research data.
- Datacite metadata scheme
- Datacite metadata schema a datamodel for metadata for scientific resources
- Digital exchange of soil-related data
- Digital exchange of soil-related data (ISO 28258:2013) presents a conceptual model of a common understanding of what soil profile data are
- Digital soil mapping
- Digital soil mapping is the creation and the population of a geographically referenced soil databases generated at a given resolution by using field and laboratory observation methods coupled with environmental data through quantitative relationships (source: wikipedia)
- Discovery service
- Discovery service is a concept from INSPIRE indicating a service type which enables discovery of resources (search and find). Typically implemented as CSW.
- Download service
- Download service is a concept from INSPIRE indicating a service type which enables download of a (subset of a) dataset. Typically implemented as WFS, WCS, SOS or Atom.
- DOI
- DOI a digital identifier of an object, any object — physical, digital, or abstract
- Encoding
- Encoding is the format used to serialise a resource to a file, common encodings are xml, json, turtle
- ESDAC
- ESDAC thematic centre for soil related data in Europe
- EUSO
- EUSO European Soil Observatory
- GDAL OGR
- GDAL and OGR are software packages widely used to interact with a variety of spatial data formats
- GML
- Geography Markup Language (GML) is an xml based standardised encoding for spatial data.
- GeoPackage
- GeoPackage a set of conventions for storing spatial data a SQLite database
- Geoserver
- Geoserver java based software package providing access to remote data through OGC services
- Global Soil Information System
- Global Soil Information System (GLOSIS) is an activity of FAO Global Soil Partnership enabling a federation of soil information systems and interoperable data sets
- GLOSIS domain model
- GLOSIS domain model is an abstract, architectural component that defines how data are organised; it embodies a common understanding of what soil profile data are.
- GLOSIS Web Ontology
- GLOSIS Web Ontology is an implementation of the GLOSIS domain model using semantic technology
- GLOSIS Codelists
- GLOSIS Codelists is a series of codelists supporting the GLOSIS web ontology. Including the codelists as published in the FAO Guidelines for Soil Description (v2007), soil properties as collected by FAO GfSD and procedures as initally collected by Johan Leenaars.
- Glosolan
- Glosolan network to strengthen the capacity of laboratories in soil analysis and to respond to the need for harmonizing soil analytical data
- HALE
- Humboldt Alignment Editor (HALE) java based desktop software to compose and apply a data transformation to data
- Harmonization
- Harmonization is the process of transforming two datasets to a common model, a common projection, usage of common domain values and align their geometries
- Iteration
- An iteration is each development cycle (three foreseen within the SoilWise project) in the project. Each iteration can have phases. There are four phases per iteration focussing on co-design, development, integration and validation, demonstration.
- JRC
- JRC Joint Research Centre of the European Commission, its Directorate General. The JRC provides independent, evidence-based science and knowledge, supporting EU policies to positively impact society. Relevant policy areas within JRC are JRC Soil and JRC INSPIRE
- Mapserver
- Mapserver C based software package providing access to remote data through OGC services
- Observations and Measurements
- A conceptual model for Observations and Measurements (O&M), also known as ISO19156
- OGC API
- OGC API building blocks that can be used to assemble novel APIs for web access to geospatial content
- Ontology
- Ontology a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject. (source: wikipedia)
- Product backlog
- Product backlog is the document where user stories/requirements are gathered with their acceptance criteria, and prioritized.
- QGIS
- QGIS desktop software package to create spatial vizualisations of various types of data
- REA
- REA is the European Research Executive Agency, it's mandate is to manage several EU programmes and support services.
- Relational model
- Relational model an approach to managing data using a structure and language consistent with first-order predicate logic (source: wikipedia)
- RDF
- Resource Description Framework (RDF) a standard model for data interchange on the Web
- Representational state transfer
- Representational state transfer (REST) a set of guidelines for creating stateless, reliable web APIs (source: wikipedia)
- Requirements
- Requirements are the capabilities of an envisioned component of the repository which are classified as ‘must have’, or ‘nice to have’.
- Rolling plan
- Rolling plan is a methodology for considering the internal and external developments that may generate changes to the SoilWise Repository design and development. It keeps track of any developments and changes on a technical, stakeholder group level or at EUSO/JRC.
- SensorThings API
- SensorThingsAPI (STA) is a formalised protocol to exchange sensor data and tasks between IoT devices, maintained at Open Geospatial Consortium.
- Sensor Observation Service
- Sensor Observation Service (SOS) is a formalised protocol to exchange sensor data between entities, maintained at Open Geospatial Consortium.
- Sprint
- Sprint is a small timeframe during which tasks have been defined.
- Sprint backlog
- Sprint backlog is composed of the set of product backlog elements chosen for the sprint, and an action plan for achieving them.
- Soil classification
- Soil classification deals with the systematic categorization of soils based on distinguishing characteristics as well as criteria that dictate choices in use (source: wikipedia)
- Soilgrids
- Soilgrids a system for global digital soil mapping that uses many profile data and machine learning methods to predict the spatial distribution of soil properties across the globe
- SoilWise Use cases
- The SoilWise use cases are described in the Grant Agreement to understand the needs from the stakeholder groups (users). Each use case provides user stories epics.
- Task
- Task is the smallest segment of work that must be done to complete a user story/requirement.
- UML
- Unified Modeling Language (UML) a general-purpose modeling language that is intended to provide a standard way to visualize the design of a system (source: wikipedia)
- Usage scenarios
- Usage scenarios describe how (groups of) users might use the software product. These usage scenarios can originate or be updated from the SoilWise use cases, user story epic or user stories/requirements.
- User story
- A User story is a statement, written from the point of view of the user, that describes the functionality needed by the user from the SoilWise Repository.
- User story epic
- A User story epic is a narrative of stakeholders needs that can be narrowed down into smaller specific needs (user stories/requirements).
- Validation framework
- Validation framework is a framework allowing good communication between users and developers, validation of developed products by users, and flexibility on the developer’s side to take change requests into account as soon as possible. The validation framework needs a description of the functionalities to be developed (user stories/requirements), the criteria that enable to verify that the developed component corresponds to the user needs (acceptance criteria), the definition of tasks for the developers (backlog) and the workflow.
- View service
- View service is a concept from INSPIRE indicating a service type which presents a (pre)view of a dataset. Typically implemented as WMS or WMTS.
- Web service
- Web service a service offered by a device to another device, communicating with each other via the Internet (source: wikipedia)
- WOSIS
- WOSIS is a global dataset, maintained at ISRIC, aiming to serve the user with a selection of standardised and ultimately harmonised soil profile data
- WMS
- Web Map service (WMS) is a formalised protocol to exchange geospatial data represented as images
- WFS
- Web Feature Service (WFS) is a formalised protocol to exchange geospatial vector data
- WCS
- Web Coverage Service (WCS) is a formalised protocol to exchange geospatial grid data
- XSD
- XML Schema Definition (XSD) recommendation how to formally describe the elements in an Extensible Markup Language (XML) document (source: wikipedia)