Data Model

Events

Events are a vessel for data to arrive to the broker. The event data is ingested in order to be broken down to Object Events and eventually into Identifiers, Relationships and Metadata. These entities are then processed and integrated into the existing data to enhance the knowledge that the broker service possesses.

Events data model

Graph

Identifiers represent references to scholarly entities and follow a specific identifier scheme (e.g. DOI, arXiv, URL, etc). Relationships have a type (e.g. isIdenticalTo, hasVersion, cites, etc.) and exist between two identifiers, the source and the target. These are the building blocks for the broker’s graph model

To represent scholarly entities (software, articles, etc.), the concept of Groups is introduced. Groups define a set of Identifiers which are formed based on the Relationships between them. For example, one can define that all Identifiers that have Relationships of type isIdenticalTo form a Group of type Identity and can be considered as a single entity.

Identifer group

One can also define Groups of Groups. For example Identity Groups with Identifiers that have a hasVersion relationship with Identifiers from other Identity Groups, can form a Version Group.

Version group

One can then finally model relationships between scholarly entities (e.g. Paper A cites Software X), by abstracting the low-level Relationships between Identifiers to the Group level and thus form Group Relationships. For example, one can define that Identity Groups of Identifiers that have Relationships of type cites to Identifiers of other Identity Groups, can form a cites Group Relationship.

Group relationship

Metadata

Identifiers, Relationships and Groups can form complex graphs. While this is important for discovering connections between them, it is also valuable to be able to retrieve information about the objects they hold references to. In order to facilitate this information, Group Metadata and Group Relationship Metadata is stored for Groups and Group Relationships respectively.

This metadata can be used for e.g. rendering a proper citation when needed, filtering.

Metadata data model

Persistence

As described in the previous sections, the broker receives raw events that are then processed to produce a graph. The data goes through a transformation pipeline that at various stages requires persisting its inputs and outputs. This persistence takes place in an RDBMS, like PostgreSQL or SQLite.

We can divide the persisted information into three incremental levels:

Information layers
A) Raw data
The raw event payloads that arrive into the system
B) Ground truth
Normalized form of the raw data, representing deduplicated facts about Identifiers and Relationships
C) Processed knowledge
Information that is extracted from the ground truth and is transformed into structured knowledge

Each level depends on all of its predecessors. This means that if there is an issue on e.g. level C, levels A and B are enough to rebuild it. In the same fashion, level B depends only on level A.

Note

All of the above models map to actual database schema tables. For the sake of clarity though, intermediary tables that represent many-to-many relationships between these models (e.g. GroupM2M for Group <-> Group relationships) were not included.