Data Model¶

Events¶

Events are a vessel for data to arrive to the broker. The event data is ingested in order to be broken down to Object Events and eventually into Identifiers, Relationships and Metadata. These entities are then processed and integrated into the existing data to enhance the knowledge that the broker service possesses.

Graph¶

Identifiers represent references to scholarly entities and follow a specific identifier scheme (e.g. DOI, arXiv, URL, etc). Relationships have a type (e.g. isIdenticalTo, hasVersion, cites, etc.) and exist between two identifiers, the source and the target. These are the building blocks for the broker’s graph model

To represent scholarly entities (software, articles, etc.), the concept of Groups is introduced. Groups define a set of Identifiers which are formed based on the Relationships between them. For example, one can define that all Identifiers that have Relationships of type isIdenticalTo form a Group of type Identity and can be considered as a single entity.

One can also define Groups of Groups. For example Identity Groups with Identifiers that have a hasVersion relationship with Identifiers from other Identity Groups, can form a Version Group.

One can then finally model relationships between scholarly entities (e.g. Paper A cites Software X), by abstracting the low-level Relationships between Identifiers to the Group level and thus form Group Relationships. For example, one can define that Identity Groups of Identifiers that have Relationships of type cites to Identifiers of other Identity Groups, can form a cites Group Relationship.

Metadata¶

Identifiers, Relationships and Groups can form complex graphs. While this is important for discovering connections between them, it is also valuable to be able to retrieve information about the objects they hold references to. In order to facilitate this information, Group Metadata and Group Relationship Metadata is stored for Groups and Group Relationships respectively.

This metadata can be used for e.g. rendering a proper citation when needed, filtering.

Persistence¶

As described in the previous sections, the broker receives raw events that are then processed to produce a graph. The data goes through a transformation pipeline that at various stages requires persisting its inputs and outputs. This persistence takes place in an RDBMS, like PostgreSQL or SQLite.

We can divide the persisted information into three incremental levels:

A) Raw data: The raw event payloads that arrive into the system
B) Ground truth: Normalized form of the raw data, representing deduplicated facts about Identifiers and Relationships
C) Processed knowledge: Information that is extracted from the ground truth and is transformed into structured knowledge

Each level depends on all of its predecessors. This means that if there is an issue on e.g. level C, levels A and B are enough to rebuild it. In the same fashion, level B depends only on level A.

Note

All of the above models map to actual database schema tables. For the sake of clarity though, intermediary tables that represent many-to-many relationships between these models (e.g. GroupM2M for Group <-> Group relationships) were not included.

Search¶

Now that we have the above information stored persistently in the system, we need an efficient way to perform queries over it. Doing this directly through the database would seem like a practical, although naive, solution for fetching this information. Our graph representation spreads over many tables, which means that fetching it would require multiple complex joins. On top of that our metadata is stored in JSON/blob-like columns, where filtering is slow and inefficient.

The way to tackle this issue, is to denormalize our data back into a rich document representation that clients of the service can consume with ease. This can be easily done via the use of a document-based store (aka NoSQL) system, like Elasticsearch.

We can create and index the documents using the following strategy:

For each Group Relationship in our system:
1. Fetch its Group Relationships Metadata
2. For its source and target groups:
  
  Fetch the Group Metadata and Identifiers
3. Create a document from the fetched information and index it

By performing the expensive database queries only once in order to index the denormalized documents we have managed to get the best of both worlds: a relationally consistent graph (backed by RDB constraints) which is easy to perform complex queries over (backed by Elasticsearch).

Consistency¶

A downside to this solution is that the state of our document store is not always in sync with what we have in our graph in the database. This issue originates from the fact that changes in the database are automatically protected via foreign-key and unique constraints that cannot be applied with the same ease in a document-based store.

A solution to this is to periodically rebuild the entire index from scratch. This guarantees that Elasticsearch starts from a blank state, with no “orphan” or stale information lying around. Also, using some of Elasticsearch’s features this index rebuilding process can be achieved without affecting the responsiveness of the service.

Data Model¶

Events¶

Graph¶

Metadata¶

Persistence¶

Search¶

Consistency¶

Asclepias Broker

Navigation

Related Topics