Solr - Graph traversal query

In this document I will show you how I am using some distinct technology layers to display graph data from BV FAPESP, stored inside Solr.

FAPESP´s virtual library (BV FAPESP) is the information system, source of the referential data, of funded projects from São Paulo funding agency. These projects holds relation to each other in a three structure, a specialized form of a graph structure.

BV FAPESP data projects structure

In the relational database, the projects are linked by foreign keys, the most simple form to store the three structure relationship. But the main data source of BV FAPESP, is not a relational database, but the search engine Solr.

Solr is a search engine, that uses the inverted index structure which is very fast at query time.

The initial approach

At the early stages of BV FAPESP, the associated projects were indexed with some kind of information from their parent, to allow the inverted search to get all children of a subset of projects.

Everything is just fine if you only need to index the descendants of a specific set of projects.

To make this work, you have to select the subset of projects that you want to build the three; index a specific field to identify the descendents, and it is done. The problem with this approach is that for every filter that you apply to the subset of parent projects, you have to index some new identifier of this new subset on the descendants.

The diagram below is a generic example that shows this indexing approach to filter descendant objects.

For example, if we have a class of cars, and we have cars separated in groups, group A and B; to the group of cars on group A using one filter, we need to index all of its descendants with an identifier “ident_Group_A”. This is the identifier indexed on Solr.

The same applies to the group B where we have filtered the group using three filters and indexed the descendants with the field 1 with the value “ident_Group_B”

Tagging an object (hard indexing)

When the interface, no matters what it is, lookup for any set of descendants on graph data it uses the indexed field to gather all descendents of a group.

On this approach the entire effort to index data is linear, because for any group of objects (i.e facet filters) you need to make a new distinct index for the group. This is not viable if you need to filter the three and get its descendents dynamically.

The great advantage using this approach is the speed of search due to the already explained inverted index structure.

Generic model

Some time after the usage of the solution described above, a new challenge became to be the development of a generic model to display associated projects of the graph structure from the entire BV FAPESP dataset.

Solr´s version 6.0 released a feature called Graph Traversal to navigate on graph data. The Graph Traversal alone is not the solution, you have to use the Streaming Expressions, a stream processing language, which delivers a complete language to build complex applications for statistics analysis, graph data, big data analysis extended to parallel computing.

The tradeoff between the generic model and the inverted index approach is the speed of search and the coverage of the solution.

The generic model applies indistinctly to any group of objects, while the inverted index approach requires that any group of objects to be indexed.

The speed occurs on the opposite direction. While inverted index has a faster response, the generic model follows a direct index. The graph traversal performs a BFS (Breadth First Search).

Solr allows you to speed up the search with parallel computing, which is not the case here, since BV FAPESP works in a small set of data, less than a million documents.

The whole solution

This topic is just for the sake of the architecture that evolves the front end of the system.

Even in a small set of objects, the difference of speed, to get data between inverted indexed structure and graph traversal structure, is profound. BV FAPESP uses indexed data for the whole site, and graph data is used just to show the amount of projects associated, in a table and in a chart.

The main data that feeds the portal is gathered very fast because it is reversed indexed. The graph data is not that fast, so I choose to call it asynchronously to not slow down the whole web page.

Front-end interface

In the picture below, the green arrow shows the link to fire the Ajax asynchronous call. The orange arrow shows filters to facet data, and this is the point; even in any subset of data you can obtain the graph data.

BV FAPESP web interface

Front-end update

Graph data result-set displays in some places of the web page, as table data and as charts.

Both are updated after Ajax success call returns.

Ajax can also be fired from any place, just copy the same link elsewhere. Since the Ajax gets data, the javascript layer can update anything on the front-end. The chart below uses d3js, that consumes the same data of the table below.

BV FAPESP web interface display of graph data

Graph data display on table

Search This Blog

Extensão da Memória