Data Model¶
Events¶
Events
are a vessel for data
to arrive to the broker. The event data is ingested in order to be broken down
to Object Events
and
eventually into Identifiers
,
Relationships
and
Metadata. These entities are then processed and integrated into the
existing data to enhance the knowledge that the broker service possesses.
Graph¶
Identifiers
represent
references to scholarly entities and follow a specific identifier scheme (e.g.
DOI
, arXiv
, URL
, etc). Relationships
have a type (e.g.
isIdenticalTo
, hasVersion
, cites
, etc.) and exist between two
identifiers, the source and the target. These are the building blocks for the
broker’s graph model
To represent scholarly entities (software, articles, etc.), the concept of
Groups
is introduced. Groups
define a set of Identifiers which are formed based on the Relationships
between them. For example, one can define that all Identifiers that have
Relationships of type isIdenticalTo
form a Group of type Identity
and
can be considered as a single entity.
One can also define Groups of Groups. For example Identity
Groups with
Identifiers that have a hasVersion
relationship with Identifiers from other
Identity
Groups, can form a Version
Group.
One can then finally model relationships between scholarly entities (e.g.
Paper A cites Software X), by abstracting the low-level Relationships between
Identifiers to the Group level and thus form Group Relationships
. For example, one can define
that Identity
Groups of Identifiers that have Relationships of type
cites
to Identifiers of other Identity
Groups, can form a cites
Group Relationship.
Metadata¶
Identifiers, Relationships and Groups can form complex graphs. While this is
important for discovering connections between them, it is also valuable to be
able to retrieve information about the objects they hold references to. In
order to facilitate this information, Group Metadata
and Group
Relationship Metadata
is stored for
Groups and Group Relationships respectively.
This metadata can be used for e.g. rendering a proper citation when needed, filtering.
Persistence¶
As described in the previous sections, the broker receives raw events that are then processed to produce a graph. The data goes through a transformation pipeline that at various stages requires persisting its inputs and outputs. This persistence takes place in an RDBMS, like PostgreSQL or SQLite.
We can divide the persisted information into three incremental levels:
- A) Raw data
- The raw event payloads that arrive into the system
- B) Ground truth
- Normalized form of the raw data, representing deduplicated facts about Identifiers and Relationships
- C) Processed knowledge
- Information that is extracted from the ground truth and is transformed into structured knowledge
Each level depends on all of its predecessors. This means that if there is an issue on e.g. level C, levels A and B are enough to rebuild it. In the same fashion, level B depends only on level A.
Note
All of the above models map to actual database schema tables. For the sake
of clarity though, intermediary tables that represent many-to-many
relationships between these models (e.g.
GroupM2M
for Group <-> Group
relationships) were not included.
Search¶
Now that we have the above information stored persistently in the system, we need an efficient way to perform queries over it. Doing this directly through the database would seem like a practical, although naive, solution for fetching this information. Our graph representation spreads over many tables, which means that fetching it would require multiple complex joins. On top of that our metadata is stored in JSON/blob-like columns, where filtering is slow and inefficient.
The way to tackle this issue, is to denormalize our data back into a rich document representation that clients of the service can consume with ease. This can be easily done via the use of a document-based store (aka NoSQL) system, like Elasticsearch.
We can create and index the documents using the following strategy:
- For each Group Relationship in our system:
- Fetch its Group Relationships Metadata
- For its source and target groups:
- Fetch the Group Metadata and Identifiers
- Create a document from the fetched information and index it
By performing the expensive database queries only once in order to index the denormalized documents we have managed to get the best of both worlds: a relationally consistent graph (backed by RDB constraints) which is easy to perform complex queries over (backed by Elasticsearch).
Consistency¶
A downside to this solution is that the state of our document store is not always in sync with what we have in our graph in the database. This issue originates from the fact that changes in the database are automatically protected via foreign-key and unique constraints that cannot be applied with the same ease in a document-based store.
A solution to this is to periodically rebuild the entire index from scratch. This guarantees that Elasticsearch starts from a blank state, with no “orphan” or stale information lying around. Also, using some of Elasticsearch’s features this index rebuilding process can be achieved without affecting the responsiveness of the service.