This page introduces the statistics functionality of analytics.

Computing simple statistics, such as the mean and standard deviations of datasets, is an easy task when considering a few isolated instances. What happens when your knowledge graph becomes so large that it is distributed across many machines? What if the values you are calculating correspond to many different types of things?

Graql analytics can perform the necessary statistics computations. For example, the following query executes a distributed computation to determine the mean age of all of the people in the knowledge graph.

compute mean of age in person;

The Statistics Queries documentation covers the usage of statistics in more detail.

Available Statistics Methods

The following methods are available to perform simple statistics computations, and we aim to add to these as demand dictates. Please get in touch on our discussion page to request any features that are of particular interest to you. A summary of the statistics algorithms is given in the table below.

Algorithm Description
count Count the number of instances.
max Compute the maximum value of an attribute.
min Compute the minimum value of an attribute.
mean Compute the mean value of an attribute.
median Compute the median value of an attribute.
std Compute the standard deviation of an attribute.
sum Compute the sum of an attribute.

For further information see the individual sections below.

Count

The default behaviour of count is to return a single value that gives the number of instances present in the graph. It is possible to also count subsets of the instances in the graph using the subgraph syntax, as described above.

compute count in person;

Mean

Computes the mean value of a given attribute. This algorithm requires the subgraph syntax to be used. For example,

compute mean of age in person;

would compute the mean value of age across all instances of the type person. It is also possible to provide a set of attributes.

compute mean of attribute-a, attribute-b in person;

which would compute the mean of the union of the instances of the two attributes, given the two attribute types have the same data type.

Median

Computes the median value of a given attribute, similar to mean.

compute median of age in person;

would compute the median of the value persisted in instances of the attribute age.

Minimum

Computes the minimum value of a given attribute, similar to mean.

compute min of age in person;

Maximum

Computes the maximum value of a given attribute, similar to mean.

compute max of age in person;

Standard Deviation

Computes the standard deviation of a given attribute, similar to mean.

compute std of age in person;

Sum

Computes the sum of a given attribute, similar to mean.

compute sum of age in person;

When to Use aggregate and When to Use compute

Aggregate queries are computationally light and run single-threaded on a single machine, but are more flexible than the equivalent compute queries described above.

For example, you can use an aggregate query to filter results by attribute. The following aggregate query, allows you to match the number of people of a particular name:

match $x has name 'Bob'; aggregate count;

Compute queries are computationally intensive and run in parallel on a cluster (so are good for big data).

compute count of person;

Compute queries can be used to calculate the number of people in the graph very fast, but you can’t filter the results to determine the number of people with a certain name.

Tags: analytics