GreyCat storage comparison

Introduction

GreyCat is a graph database designed to operate over massive datasets efficiently. In this view, it leverages various techniques and algorithms to compress data effectively. Data binary representation is key not only for disk storage space but also for reducing memory usage during computation. This compression can leverage type information given by developers of locality of time series for instance. However, float values are always sub-optimal due to their nature. Float series can be compressed by algorithms like gorilla however the situation is not ideal for individual values in column or type fields. Fortunately, developers often can define a maximal precision they are ready to accept within the storage and computation. GreyCat v7 leverages this definition with a @precision(0.01) annotation where devs can specify float values with max precision digits.

Storage comparison

To benchmark the effect of this, we will be using the following dataset to compare the storage efficiency of GreyCat v6, GreyCat v7 and Postgres. The data is a CSV file containing weather data from here. The dataset comprises approximately 800,000 rows and occupies 50MB of disk space. Here's a sample of the data:

valid,tmpc,dwpc,relh,drct,sknt,mslp,vsby,skyc1,skyc2,skyc3
1950-01-01 00:00:00,0,-2.22,85.12,20,12,1026.3,3,null,null,null
1950-01-01 03:00:00,-1.11,-2.22,92.23,20,8,1027.4,5,null,null,null
1950-01-01 09:00:00,-2.22,-5,81.11,20,8,1029.9,1.5,null,null,null
1950-01-01 12:00:00,0.56,-3.89,71.82,20,8,1031.4,2.5,null,null,null
1950-01-01 15:00:00,2.22,-3.33,66.97,20,6,1030.8,4,null,null,null
1950-01-01 18:00:00,0,-3.89,75,360,4,1031.9,1.25,null,null,null
1950-01-01 21:00:00,-2.22,-3.33,92.16,20,4,1032.8,1.25,null,null,null
1950-01-02 00:00:00,-4.44,-5,95.57,320,2,1032.8,1,null,null,null
1950-01-02 03:00:00,-5.56,-5.56,100,230,6,1032.1,1.5,null,null,null

Postgres

Here we use plain postgres table as a reference for storage size in a state-of-the-art database.

We will be using the following schema to store the data in Postgres:

CREATE TABLE weather_data(
   valid timestamp
  ,tmpc  FLOAT
  ,dwpc  FLOAT
  ,relh  FLOAT
  ,drct  INTEGER
  ,sknt  FLOAT
  ,mslp  FLOAT
  ,vsby  FLOAT
  ,skyc1 VARCHAR(4)
  ,skyc2 VARCHAR(4)
  ,skyc3 VARCHAR(4)
);

And populate the table using the following command:

\copy weather_data from '/weather_data.csv' CSV HEADER NULL 'null'

printing the size of the table using the following command:

SELECT pg_size_pretty( pg_total_relation_size('weather_data') );

We end up with a table that takes 77MB of disk space.

GreyCat v6

The same schema in GreyCat would look like this:

type WeatherData {
  tmpc: float?;
  dwpc: float?;
  relh: float?;
  drct: int?;
  sknt: float?;
  mslp: float?;
  vsby: float?;
  skyc1: String?;
  skyc2: String?;
  skyc3: String?;
}

We will be storing the data in a GreyCat nodeTime which uses the valid column timestamp as index:

var weather_series: nodeTime<WeatherData>;

Final size on the disk is 40MB.

GreyCat v7

The v7 introduces a new type decorator @precision that allows us to specify the precision of the floating point numbers, this can be used to reduce the size of the data stored in the database significantly.

type WeatherData {
  @precision(0.01)
  tmpc: float?;
  @precision(0.01)
  dwpc: float?;
  @precision(0.01)
  relh: float?;
  drct: int?;
  @precision(0.01)
  sknt: float?;
  @precision(0.01)
  mslp: float?;
  @precision(0.01)
  vsby: float?;
  skyc1: String?;
  skyc2: String?;
  skyc3: String?;
}

Final size on the disk is 20MB.

Conclusion

The storage efficiency of GreyCat, particularly in version 7 with the precision decorator, demonstrates a significant reduction in disk space usage compared to GreyCat v6 and PostgreSQL.

GreyCat is a compelling choice for applications requiring efficient storage of large datasets with numerous floating-point values.

Of course, fine-tuning is always possible for PostgreSQL and GreyCat. However, the goal is to create awareness about the importance of such techniques. This effect, measured as a few mega bytes can save terabytes on real projects.