Data Ingestion

Importing data to SMS platform can be done both manually and automatically based on the "Entity types" covered by a dataset, " Format and structure" of data and "Data access policy" defined for data to be imported. The latter is important as not all data can be accessed by every user, and different levels of accessibility apply, depending on subscriptions and on permission of the owners of datasets.

Following questions need to be answered before importing data into SMS:

What types of entities are covered by the dataset?
The answer to this question, helps SMS to find the potential points of linking and also to check if the conceptual model should be amended to accommodate new entity types.
What is the format and structure of data to be imported?
The answer to this question, helps SMS to automate the ingestion process if the data format and structure are based on the standard interfaces supported by SMS.
What are the data access policies?
The answer to this question, helps SMS to apply restriction rules when accessing the imported dataset.

Linked Data Creation

There are several steps followed in the lifecycle of linked data to extract and store the imported data into SMS triple store. The lifecycle starts by a basic (syntactic) conversion of data to RDF format without applying any specific vocabularies. This basic conversion is then enriched by applying several linking and enrichment services. Different services and scripts are used to convert unstructured and structured data to RDF. Techniques such as Named Entity Recognition (discussed later in the document) can be employed to extract named entities from textual content. A concrete example is recognizing research institutions and universities in a researcher’s CV (Curriculum vitae), using named entity recognition by linking the CV to databases with background knowledge such as DBpedia. For structured content, the tool will be selected based on the format. For example, OpenRefine can be used to convert spreadsheet data to RDF.

Data Linking and Scientific Lenses

Data linking is the process of creating a relationship between entities that meet preset conditions. If global unique identifiers for entities are available, the linking becomes straightforward. If not, a variety of techniques can be used, from (fuzzy) string matching to deploying attributes available in the different databases. In the link data service that we provide, we emphasis on providing contextual information that help eliminating ambiguity after a relationship between entities is established, and enables re-use. For instance, the GRID, OrgRef, and EUPRO datasets describe organisation entities across various countries, including both public and private research organisations. All of these datasets refer to the “Minnesota Mining and Manufacturing Company” (3M), a large multinational organisation with a substantial patent portfolio. The GRID dataset distinguishes between 3M(United States) and 3M(Canada), while the OrgRef dataset only refers to the single entity 3M. To study these organizations, they need to be aligned across these datasets whenever they are the same. But what does “the same” mean? Suppose one study aims to compare organizations at a global level, whereas a second compares organizations across countries. In the first setting, all occurrences of ‘3M’ in the datasets are considered the same. In the second study, the Canadian and U.S. branches of ‘3M’ are to be considered separately.

In our approach to data linking, we first provide a network of interlinked entities through linksets. These linksets are generated using basic similarity metrics such as exact string similarity, approximate string similarity and geo-similarity. The goal of these linksets is to serve as “lego pieces”, easy for users to combine or modify them to their liking to answer a particular research question. Combining or modifying linksets is made possible using operations such as UNION, TRANSITIVITY or INTERSECTION. The result of a manipulation over one or more linksets is a lens, which stands as a user view over the data.

We propose to enable users to make an informed choice over alignments produced by existing tools. This modifies the generic problem into choose and modify. Our proposal is to reuse existing tools for generating correspondences of as the basis of interlinking.