Importing data to SMS platform can be done both manually and automatically based on the "Entity types" covered by a dataset, " Format and structure" of data and "Data access policy" defined for data to be imported. The latter is important as not all data can be accessed by every user, and different levels of accessibility apply, depending on subscriptions and on permission of the owners of datasets.
Following questions need to be answered before importing data into SMS:
There are several steps followed in the lifecycle of linked data to extract and store the imported data into SMS triple store. The lifecycle starts by a basic (syntactic) conversion of data to RDF format without applying any specific vocabularies. This basic conversion is then enriched by applying several linking and enrichment services. Different services and scripts are used to convert unstructured and structured data to RDF. Techniques such as Named Entity Recognition (discussed later in the document) can be employed to extract named entities from textual content. A concrete example is recognizing research institutions and universities in a researcher’s CV (Curriculum vitae), using named entity recognition by linking the CV to databases with background knowledge such as DBpedia. For structured content, the tool will be selected based on the format. For example, OpenRefine can be used to convert spreadsheet data to RDF.
Data linking is the process of creating a relationship between entities that meet preset conditions. If global unique identifiers for entities are available, the linking becomes straightforward. If not, a variety of techniques can be used, from (fuzzy) string matching to deploying attributes available in the different databases. In the link data service that we provide, we emphasis on providing contextual information that help eliminating ambiguity after a relationship between entities is established, and enables re-use. For instance, the GRID, OrgRef, and EUPRO datasets describe organisation entities across various countries, including both public and private research organisations. All of these datasets refer to the “Minnesota Mining and Manufacturing Company” (3M), a large multinational organisation with a substantial patent portfolio. The GRID dataset distinguishes between 3M(United States) and 3M(Canada), while the OrgRef dataset only refers to the single entity 3M. To study these organizations, they need to be aligned across these datasets whenever they are the same. But what does “the same” mean? Suppose one study aims to compare organizations at a global level, whereas a second compares organizations across countries. In the ﬁrst setting, all occurrences of ‘3M’ in the datasets are considered the same. In the second study, the Canadian and U.S. branches of ‘3M’ are to be considered separately.
In our approach to data linking, we first provide a network of interlinked entities through linksets. These linksets are generated using basic similarity metrics such as exact string similarity, approximate string similarity and geo-similarity. The goal of these linksets is to serve as “lego pieces”, easy for users to combine or modify them to their liking to answer a particular research question. Combining or modifying linksets is made possible using operations such as UNION, TRANSITIVITY or INTERSECTION. The result of a manipulation over one or more linksets is a lens, which stands as a user view over the data.
We propose to enable users to make an informed choice over alignments produced by existing tools. This modiﬁes the generic problem into choose and modify. Our proposal is to reuse existing tools for generating correspondences of as the basis of interlinking.