Binning functions

One of the most often used method for analyzing data in visual analytics is data binning which is basically a form of quantization. The main use-case for this is creating histograms.

To make this easier, the SQL standard (and PostgreSQL) provides function width_bucket which can be used to generate equi-width histogram where each bucket is the same size.

Example:

# SELECT width_bucket(salary, 3000, 9000, 6), count(empno), min(salary), max(salary) FROM empsalary GROUP BY 1 ORDER BY 1;
 width_bucket | count | min  | max  
--------------+-------+------+------
            0 |     2 | 2200 | 2800
            1 |     4 | 3500 | 3900
            2 |     4 | 4200 | 4800
            3 |     5 | 5000 | 5800
            4 |     2 | 6000 | 6900
            5 |     1 | 7000 | 7000
            6 |     1 | 8200 | 8200
            7 |     1 | 9200 | 9200
(8 rows)

It is however often the case that you don’t want the buckets same size buckets. The way to solve this until now was to use CASE statement which in case of having many buckets will produce long and ugly queries and the performance will not be very good. In PostgreSQL 9.5 this will be much easier and thanks to new variant of the width_bucket function that accepts value and a sorted array of upper bounds for each bucket. Unlike the original width_bucket which works only for floats and numerics this new function will work for any sortable datatype (that includes timestamps for example).

Example:

# SELECT width_bucket(salary, ARRAY[3000, 5000, 7000, 8000, 9000]), count(empno), min(salary), max(salary) FROM empsalary GROUP BY 1 ORDER BY 1;
 width_bucket | count | min  | max  
--------------+-------+------+------
            0 |     2 | 2200 | 2800
            1 |     8 | 3500 | 4800
            2 |     7 | 5000 | 6900
            3 |     1 | 7000 | 7000
            4 |     1 | 8200 | 8200
            5 |     1 | 9200 | 9200
(6 rows)

Important part is that the array with thresholds has to be sorted from lowest to highest value otherwise you will get wrong results.

The performance testing done by developers of Orange visualization and analysis tool shown that the queries using this function compared to their old CASE approach are up to 10 times faster (depending on the number of buckets).

Final example demostrates the width_bucket‘s support for date datatype to count the number of days in each season of the year 2014:

# SELECT
#     CASE width_bucket((d * interval '1 day' + date '2014-01-01')::date, ARRAY['2014-03-01', '2014-06-01', '2014-09-01', '2014-12-01']::date[])
#         WHEN 1 THEN 'Spring'
#         WHEN 2 THEN 'Summer'
#         WHEN 3 THEN 'Autumn'
#         ELSE 'Winter' END season,
#     count(*)
# FROM generate_series(0,364) d GROUP BY 1 ORDER BY 1;
 season | count 
--------+-------
 Autumn |    91
 Spring |    92
 Summer |    92
 Winter |    90
(4 rows)

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633.

Advertisements