Augmenting the PostgreSQL Planner with Machine Learning

by Anthony Kleeroper, the University of Manchester

On 27th May, we gave a talk to the Manchester PostgreSQL Meetup group on the work we have been doing as part of the AXLE project. The PostgreSQL users and enthusiasts in the North West of England gave some of their evening to hear the talk which was organised by 2ndQuadrant.

The talk was well received and an interesting discussion followed mostly around the issues of applying Machine Learning to this challenging problem, especially the existence of edge cases. A really useful suggestion from the audience was that the system could make use of system downtime to run its own training queries, targeting areas of the feature space that were sparsely populated.

The talk started with an introduction to AXLE, describing the overall aim, the work being done by each of the project partners and how the different parts will all fit together. The main body of the talk was focused on the problem of Query Performance Prediction – predicting how long a query will take to run before it actually runs. The past approaches to this problem were considered which required a brief introduction to the ideas of Machine Learning. I then presented an outline of the approach we are using as part of AXLE and the plugin we are producing to implement our solution.

The second part of the talk concentrated on the problem of augmenting the planner and helping it select the best plan more often. This is in part based on the need for the planner to know when to offload some of the execution of a query to GPUs or other processing units. In part, though, it is based on the existence of cases where the planner selects a “bad” plan after rejecting one that would have actually been faster. Robert Haas found that 5% of the complaints on the pgsqlperformance mailing list were the result of the planner rejecting a faster plan and selecting a significantly slower one – and that is after excluding cases where the planner cost variables are tuned poorly.

The solution we are proposing as part of AXLE is to augment the planner with cost hooks that allow us to write plugins to predict the execution time of operations with more accuracy than the analytical model used by PostgreSQL at the moment. The planner will then be better placed to select the fastest plan and not reject it because of faulty costing.

A copy of the presentation can be seen here: Augmenting the PostgreSQL Planner with Machine Learning

PostgreSQL Performance

Mark Wong, a performance specialist, presented to a full house in San Francisco on the 10th March 2015. Mark gave a comprehensive update on the performance work being done as part of the AXLE Project. A copy of his presentation can be seen here: Postgresql Performance Presentation-SFPGDay2015

Progress on online upgrade

In last couple of months I’ve been working on online upgrade for very large databases as part of the AXLE project and I would like to share my thoughts on the topic and what progress we have made recently.

Before joining 2ndQuadrant I used to work in Skype where the business would not allow a maintenance window for our databases. This meant no downtime was allowed for deployments, upgrades, etc. That kind of rule makes you change the way you do things. Most changes are small, you don’t do any heavy locks, you have replicas to allow for fast fail-over. But while you can make your releases small and non-blocking, what happens when you need to do a major version upgrade of the PostgreSQL database?

You might be in a different situation, as most companies do have an upgrade window, and so you might afford some downtime during the upgrade. This however brings two problems. For one, no company actually likes the downtimes even if they are allowed. And more importantly once your database grows beyond gigabytes in size into the range of terabytes or hundreds of terabytes, the downtime can take days or even weeks and nobody can afford to stop their operations for that long. The result is many companies often skip important upgrades, making the next one actually even more painful. And the developers are missing new features, performance improvements. They (the companies) sometime even risk running a PostgreSQL version that is no longer supported and has known data corruption or security problems. In the following paragraphs I will talk a little about my work on making the upgrades less time consuming and as result less painful and hopefully more frequent.

Let me start with a little history first. Before PostgreSQL 9.0 the only way to do a major version upgrade was to run pg_dump and restore the dump into an instance running a newer version of PostgreSQL. This method required the structure and all data to be read from the database and written into a file. Then read from the file and inserted into a new database, indexes have to be rebuilt, etc.

As you can imagine this process can take quite some time. Improvements in performance were made in 8.4 for pg_restore with the -j option added where you could specify how many parallel jobs to be run. This makes it possible to restore several tables (indexes, etc) in parallel making the restore process faster for custom format dumps. The 9.3 version added similar option to pg_dump, improving the performance even further. But given how fast data volumes are growing, the parallelization itself is not enough to make any serious gain in the time required for upgrade.

Then in PostgreSQL 9.0 a utility called pg_upgrade arrived. Pg_upgrade dumps just the structures and restores them into the new cluster. But it copies the data files as they are on disk which is much faster than dumping them into logical format and then reinserting. This is good enough for small databases because it means a downtime in range of minutes or hours, a time acceptable for many scenarios. There is also the link mode which just creates hard links (junction points on Windows) which makes this process even faster. But from my personal point of view it is too dangerous to run such setup on a production master server. I will briefly explain why. If something goes wrong, once you start your new server that was upgraded using the link mode, you are suddenly without production database and have to fail-over, or worse, you have to restore from backup. That means you not only failed to upgrade but you just caused additional downtime! Good luck getting approval next time.

Now many people who can’t afford long downtimes for upgrades use the trigger based replication solutions like Slony or Londiste to do the upgrade. This is a good solution because you can replicate your data while the original server is running and then switch with minimal downtime. In practice there are several problems however. One of them is that the trigger based solutions are often clunky to setup, especially if you are doing it only once every couple of years and only to do the upgrade. It is also easy to miss a table or to add tables in wrong order and thus not getting the full copy. I have witnessed this in practice and people doing the upgrade were working with the trigger based replication on daily basis. Another issue is that the trigger based solutions add considerable load on the source database, sometimes making the upgrade impossible due to the database server becoming overloaded once the replication is activated. And last but often not least, it can take very long time for the trigger based replication to actually move the data to the new server. On the last occasion I was involved with an upgrade project, the trigger based solution took around a month to copy the database and catch up with changes. Yes, one month.

With PostgreSQL 9.4 arrives the logical decoding feature which offers a fresh start for designing a new and better online upgrade problem solution. What we did, as part of AXLE project, is to create a tool which combines the logical decoding with the techniques described above. The solution solves most of the problems of previous approaches. The Uni-Directional Replication PostgreSQL extension (UDR for short) does logical replication using logical decoding of the write ahead log (WAL). Thanks to this, the impact on the master server is almost on par with the physical streaming replication, so the additional load caused by ongoing upgrade is minimal on the running system. Also it provides tools to initialize new nodes, both using physical and logical backup. You can even turn existing physical slave to UDR slave. And because it is a logical replication system, it is possible to design it in a way that supports cross-version replication.

What all this means is we can now use UDR in combination with pg_upgrade to do an online upgrade of major PostgreSQL version with minimal downtime, in short amount of absolute time and with minimal impact on the running system.

online-upgrade-stepsAn example how this can look in practice:

  • Do pg_basebackup of existing instance.
  • Setup the UDR replication between original instance and the one created by basebackup.
  • Pg_upgrade the new instance.
  • Let UDR replay the changes that happened in meantime.
  • Switch the traffic to the new instance.

For howto with more detailed instructions see the UDR Online Upgrade guide on PostgreSQL wiki. The UDR sources are available in the 2ndquadrant_bdr repository on PostgreSQL git server (bdr-plugin/next branch).

Finally, since UDR is not just an online upgrade tool but also a replication solution, it can be used for your normal replication needs, instead of the physical streaming replication. Furthermore it provides several advantages like ability to create temporary tables, replicating from multiple OLTP databases into one big data warehouse database, or replicating just part of the database.

My hope is that this effort will mean that downtime considerations are no longer a problem when it comes to upgrading from PostgreSQL 9.4 and above to a new major version.

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633.

Slovenia General Assembly and IAB Meeting October 2014

The AXLE partners met at the University of Ljubljana, Slovenia for the 4th project general assembly and 2nd industrial advisory board meeting on October 8th 2014.

Partner presentations coffee break




A 3 day meeting, the partners spent a day presenting progress to the IAB (Industrial Advisory Board), with a particular emphasis on hardware considerations, so that they could provide us very valuable feedback which informed the next two days. The rest of the meeting was a chance to work closely with each other on specific topics that have emerged over the last few months.

PV and UL collaborate

The decision to have each partner host at least one GA has proved an excellent approach as we have seen new cities and experienced excellent hospitality in all. We visited a Serbian restaurant on our first night, followed by an Italian Pizzeria on our second – not because Slovenian food doesn’t exist, but because it takes so long to prepare traditional fayre! Our host also took us on a winding walking tour of Ljubljana, showing us the delights of a beautiful city. We were all entranced by the view of the moonlit castle from the restaurant terrace – a true Hollywood backdrop!

Ljubljana at night

And with a General Assembly and 7 deliverables due at the end of October, this has been a busy month for the project. It is exciting to see the progress that has been made over the last 12 months!


Open-source simulator release for emerging memory technologies

We are happy to announce the open source release of the AXLE simulation infrastructure for emerging memory technologies. The platform integrates a complete architectural simulator, with an accurate main memory simulator that allows modeling of 3D stacked and non-volatile memory architectures.

The simulator has been tested extensively with single and multi process database management system workloads as well as the SPEC CPU2006 benchmark suite. The code is publicly available at the project’s github site.

Loading Tables and Creating B-tree and Block Range Indexes

Originally posted on 3rd October 2014, this blog can also be found on the 2ndQuadrant blog site: This is the first of two related blogs.

I have been looking at the new Block Range Indexes (BRIN) being developed for PostgreSQL 9.5. BRIN indexes are designed to provide similar benefits to partitioning, especially for large tables, just without the need to declare partitions. That sounds pretty good but let’s look in greater detail to see if it lives up to the hype.

How large? Here’s one data point. Using the TPC Benchmark(TM) H provided dbgen we created data for the lineitem table at the 10GB scale factor, which results in a 7.2GB text file.

We’re going to compare a couple of basic tasks. The first look will be at the impact of inserting data into a table using the COPY command. We will do a simple experiment of creating a table without any indexes or constraints on it and time how long it takes to load the lineitem data. Then repeat with a B-tree index on one column. And finally repeat again with a BRIN index instead of a B-tree index on the same column.

axle-load-testThe above bar plot shows the average times over five measurements. Our baseline of loading the lineitem table without any indexes averaged 5.1 minutes. Once a B-tree index was added to the i_shipdate DATE column, the average load time increased to 9.4 minutes, or by 85%. When the B-three index was replaced by a BRIN index, the load time only increased to 5.6 minutes, or by 11%.

The next experiment is to average how long it takes to create a B-tree index on a table that is already populated with data. Then repeat that with a BRIN index. This will be done on the same i_shipdate DATE column and repeated for a total of five measurements each.


The B-tree index took 95 seconds to build, where the BRIN index 18 seconds to build, an 80% improvement.

That’s very encouraging. The overhead to loading data into a table from a single BRIN index is only 11%, and reduced the total load time by 40% when compared to having a B-tree index. And creating a new BRIN index takes only 20% of the time that a new B-tree index would take. We will have more experiments lined up to see where else BRIN indexes may or may not benefit us.

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n°318633 – the AXLE project –

Binning functions

One of the most often used method for analyzing data in visual analytics is data binning which is basically a form of quantization. The main use-case for this is creating histograms.

To make this easier, the SQL standard (and PostgreSQL) provides function width_bucket which can be used to generate equi-width histogram where each bucket is the same size.


# SELECT width_bucket(salary, 3000, 9000, 6), count(empno), min(salary), max(salary) FROM empsalary GROUP BY 1 ORDER BY 1;
 width_bucket | count | min  | max  
            0 |     2 | 2200 | 2800
            1 |     4 | 3500 | 3900
            2 |     4 | 4200 | 4800
            3 |     5 | 5000 | 5800
            4 |     2 | 6000 | 6900
            5 |     1 | 7000 | 7000
            6 |     1 | 8200 | 8200
            7 |     1 | 9200 | 9200
(8 rows)

It is however often the case that you don’t want the buckets same size buckets. The way to solve this until now was to use CASE statement which in case of having many buckets will produce long and ugly queries and the performance will not be very good. In PostgreSQL 9.5 this will be much easier and thanks to new variant of the width_bucket function that accepts value and a sorted array of upper bounds for each bucket. Unlike the original width_bucket which works only for floats and numerics this new function will work for any sortable datatype (that includes timestamps for example).


# SELECT width_bucket(salary, ARRAY[3000, 5000, 7000, 8000, 9000]), count(empno), min(salary), max(salary) FROM empsalary GROUP BY 1 ORDER BY 1;
 width_bucket | count | min  | max  
            0 |     2 | 2200 | 2800
            1 |     8 | 3500 | 4800
            2 |     7 | 5000 | 6900
            3 |     1 | 7000 | 7000
            4 |     1 | 8200 | 8200
            5 |     1 | 9200 | 9200
(6 rows)

Important part is that the array with thresholds has to be sorted from lowest to highest value otherwise you will get wrong results.

The performance testing done by developers of Orange visualization and analysis tool shown that the queries using this function compared to their old CASE approach are up to 10 times faster (depending on the number of buckets).

Final example demostrates the width_bucket‘s support for date datatype to count the number of days in each season of the year 2014:

#     CASE width_bucket((d * interval '1 day' + date '2014-01-01')::date, ARRAY['2014-03-01', '2014-06-01', '2014-09-01', '2014-12-01']::date[])
#         WHEN 1 THEN 'Spring'
#         WHEN 2 THEN 'Summer'
#         WHEN 3 THEN 'Autumn'
#         ELSE 'Winter' END season,
#     count(*)
# FROM generate_series(0,364) d GROUP BY 1 ORDER BY 1;
 season | count 
 Autumn |    91
 Spring |    92
 Summer |    92
 Winter |    90
(4 rows)

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633.