Slovenia General Assembly and IAB Meeting October 2014

The AXLE partners met at the University of Ljubljana, Slovenia for the 4th project general assembly and 2nd industrial advisory board meeting on October 8th 2014.

Partner presentations coffee break

 

 

 

A 3 day meeting, the partners spent a day presenting progress to the IAB (Industrial Advisory Board), with a particular emphasis on hardware considerations, so that they could provide us very valuable feedback which informed the next two days. The rest of the meeting was a chance to work closely with each other on specific topics that have emerged over the last few months.

PV and UL collaborate

The decision to have each partner host at least one GA has proved an excellent approach as we have seen new cities and experienced excellent hospitality in all. We visited a Serbian restaurant on our first night, followed by an Italian Pizzeria on our second – not because Slovenian food doesn’t exist, but because it takes so long to prepare traditional fayre! Our host also took us on a winding walking tour of Ljubljana, showing us the delights of a beautiful city. We were all entranced by the view of the moonlit castle from the restaurant terrace – a true Hollywood backdrop!

Ljubljana at night

And with a General Assembly and 7 deliverables due at the end of October, this has been a busy month for the project. It is exciting to see the progress that has been made over the last 12 months!

 

Open-source simulator release for emerging memory technologies

We are happy to announce the open source release of the AXLE simulation infrastructure for emerging memory technologies. The platform integrates a complete architectural simulator, with an accurate main memory simulator that allows modeling of 3D stacked and non-volatile memory architectures.

The simulator has been tested extensively with single and multi process database management system workloads as well as the SPEC CPU2006 benchmark suite. The code is publicly available at the project’s github site.

Loading Tables and Creating B-tree and Block Range Indexes

Originally posted on 3rd October 2014, this blog can also be found on the 2ndQuadrant blog site: http://blog.2ndquadrant.com/loading-tables-creating-b-tree-block-range-indexes/. This is the first of two related blogs.

I have been looking at the new Block Range Indexes (BRIN) being developed for PostgreSQL 9.5. BRIN indexes are designed to provide similar benefits to partitioning, especially for large tables, just without the need to declare partitions. That sounds pretty good but let’s look in greater detail to see if it lives up to the hype.

How large? Here’s one data point. Using the TPC Benchmark(TM) H provided dbgen we created data for the lineitem table at the 10GB scale factor, which results in a 7.2GB text file.

We’re going to compare a couple of basic tasks. The first look will be at the impact of inserting data into a table using the COPY command. We will do a simple experiment of creating a table without any indexes or constraints on it and time how long it takes to load the lineitem data. Then repeat with a B-tree index on one column. And finally repeat again with a BRIN index instead of a B-tree index on the same column.

axle-load-testThe above bar plot shows the average times over five measurements. Our baseline of loading the lineitem table without any indexes averaged 5.1 minutes. Once a B-tree index was added to the i_shipdate DATE column, the average load time increased to 9.4 minutes, or by 85%. When the B-three index was replaced by a BRIN index, the load time only increased to 5.6 minutes, or by 11%.

The next experiment is to average how long it takes to create a B-tree index on a table that is already populated with data. Then repeat that with a BRIN index. This will be done on the same i_shipdate DATE column and repeated for a total of five measurements each.

axle-index-create-test

The B-tree index took 95 seconds to build, where the BRIN index 18 seconds to build, an 80% improvement.

That’s very encouraging. The overhead to loading data into a table from a single BRIN index is only 11%, and reduced the total load time by 40% when compared to having a B-tree index. And creating a new BRIN index takes only 20% of the time that a new B-tree index would take. We will have more experiments lined up to see where else BRIN indexes may or may not benefit us.

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n°318633 – the AXLE project – http://www.axleproject.eu

Binning functions

One of the most often used method for analyzing data in visual analytics is data binning which is basically a form of quantization. The main use-case for this is creating histograms.

To make this easier, the SQL standard (and PostgreSQL) provides function width_bucket which can be used to generate equi-width histogram where each bucket is the same size.

Example:

# SELECT width_bucket(salary, 3000, 9000, 6), count(empno), min(salary), max(salary) FROM empsalary GROUP BY 1 ORDER BY 1;
 width_bucket | count | min  | max  
--------------+-------+------+------
            0 |     2 | 2200 | 2800
            1 |     4 | 3500 | 3900
            2 |     4 | 4200 | 4800
            3 |     5 | 5000 | 5800
            4 |     2 | 6000 | 6900
            5 |     1 | 7000 | 7000
            6 |     1 | 8200 | 8200
            7 |     1 | 9200 | 9200
(8 rows)

It is however often the case that you don’t want the buckets same size buckets. The way to solve this until now was to use CASE statement which in case of having many buckets will produce long and ugly queries and the performance will not be very good. In PostgreSQL 9.5 this will be much easier and thanks to new variant of the width_bucket function that accepts value and a sorted array of upper bounds for each bucket. Unlike the original width_bucket which works only for floats and numerics this new function will work for any sortable datatype (that includes timestamps for example).

Example:

# SELECT width_bucket(salary, ARRAY[3000, 5000, 7000, 8000, 9000]), count(empno), min(salary), max(salary) FROM empsalary GROUP BY 1 ORDER BY 1;
 width_bucket | count | min  | max  
--------------+-------+------+------
            0 |     2 | 2200 | 2800
            1 |     8 | 3500 | 4800
            2 |     7 | 5000 | 6900
            3 |     1 | 7000 | 7000
            4 |     1 | 8200 | 8200
            5 |     1 | 9200 | 9200
(6 rows)

Important part is that the array with thresholds has to be sorted from lowest to highest value otherwise you will get wrong results.

The performance testing done by developers of Orange visualization and analysis tool shown that the queries using this function compared to their old CASE approach are up to 10 times faster (depending on the number of buckets).

Final example demostrates the width_bucket‘s support for date datatype to count the number of days in each season of the year 2014:

# SELECT
#     CASE width_bucket((d * interval '1 day' + date '2014-01-01')::date, ARRAY['2014-03-01', '2014-06-01', '2014-09-01', '2014-12-01']::date[])
#         WHEN 1 THEN 'Spring'
#         WHEN 2 THEN 'Summer'
#         WHEN 3 THEN 'Autumn'
#         ELSE 'Winter' END season,
#     count(*)
# FROM generate_series(0,364) d GROUP BY 1 ORDER BY 1;
 season | count 
--------+-------
 Autumn |    91
 Spring |    92
 Summer |    92
 Winter |    90
(4 rows)

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633.

Performance analysis of decision support systems

Adrià Armejach,

Decision Support Systems (DSS) help to drive business decisions via databases with large datasets; which may range from a few hundred gigabytes to several petabytes. These systems are becoming pervasive and many enterprises, including small and medium scale organizations, are interested in deploying such systems to improve their decision making capabilities. The rapid growth in digital data requires a substantial increase in computational performance to process large datasets fast enough so that timely decisions can be made.

In this scope, the AXLE project focuses on addressing full requirements of real datasets, including security as well as performance. One of the objectives is to deliver a database management system (DBMS) that can handle these large volumes of data. For this task the project is working on extending PostgreSQL, an open-source enterprise-grade DBMS. When operating over larger datasets, the response time of the DBMS is likely to increase since it has to process more data. To tackle this issue the project is also focused on providing good performance through the use of novel hardware techniques to try to eliminate the bottlenecks that are present in current server solutions.

In order to improve performance over existing solutions, it is paramount to know what architectural components of current server-grade solutions need to be optimized. To this end, we present a set of performance analysis measurements using a state-of-the-art DSS benchmark over a large dataset running on PostgreSQL.

Methodology

To perform this analysis we run a popular decision support benchmark, TPC-H, where the queries and the data populating the database have broad industry-wide relevance. We configure the benchmark with a scale factor of 100 (100GB), which yields a database of over 200GB after adding the desired indexes. The following table contains a list of the different components present in our test system. We use a server-grade Xeon processor with a large number of memory channels and available DRAM memory.

Component Description
Processor 32nm Intel Xeon E5-2670, 2,60GHz, HT and TurboBoost disabled
Cores 8 out-of-order cores, 4-wide issue and retire
L1 cache 32KB 4-way, split I/D, 4-cycle access latency
L2 cache 256KB 8-way per core, 8-cycle access latency
L3 cache (LLC) 20MB 16-way shared, 28-cycle access latency
Memory 256GB DDR3-1600, 4 channels, delivering up to 51.5GB/s

Table 1: Test system configuration

To measure where time is spent in current architectures when executing this kind of workloads, we employ a recently proposed top-down performance counter methodology that can determine true bottlenecks of a general out-of-order processor. Performance counters are integrated hardware counters that allow to get a measure of a predefined set of events, such as the number of cycles of execution or miss events in the memory hierarchy. The top-down approach used defines core utilization by classifying the core pipeline activity into four basic categories: Retiring, Bad Speculation, Frontend Bound, and Backend Bound. Once a workload is deemed to be bound to a specific category further exploration is done to determine the component within that category that is causing the bottleneck.

Evaluation Results

Figure 1 Top level breakdown for TPC-H queries

Figure 1: Top level breakdown for TPC-H queries

The figure above shows the initial categorization of execution time. The ‘retiring’ category reflects instructions issued that eventually get retired. A 100% retiring component corresponds to hitting the maximum IPC — 4 in this case. Overall we observe that retiring correlates well with IPC (not shown), and is a good cross-validation point with an established metric. Time spent due to incorrect ‘speculation’ is not significant. The ‘frontend’ bound category mainly accounts for the time waiting due to TLB and cache instruction misses, and as can be seen it can be a significant portion of the execution time. However, the overall dominant category is the ‘backend’ bound component, of which the maximum contributor are data cache misses.

Figure 2 Memory level breakdown for TPC-H queries

Figure 2: Memory level breakdown for TPC-H queries

Since the workload is deemed to be backend bound, we further explore this component to determine which of the memory components is the bottleneck. The figure above further divides the backend bound component into stalls that happen at different levels of the memory hierarchy. The component ‘l2_bound’ accounts for L1 data misses that hit L2, and so on. As can be seen, stalls at the L2 and L3 level caches are not significant, meaning that these structures are seeing a low hit ratio. The L3 is not performing well due to the large memory footprints which exceed its capacity. Looking at the chart we can conclude that the workload is external memory bound, i.e, most of the time is spent waiting on requests to the external RAM memory.

Figure 3 Off-chip memory characterization breakdown for TPC-H queries

Figure 3: Off-chip memory characterization breakdown for TPC-H queries

The external memory bound component can be divided further into two categories: bandwidth and latency. By measuring occupancy of pending memory requests we can establish certain thresholds that will categorize if the observed external memory bound component is due to bandwidth restrictions or latency. We define the bandwidth component as time in which there are at least 14 outstanding memory requests, and the rest of the time is attributed to latency. The above figure shows that latency is the dominating factor, and that there are no bandwidth limitations. In fact, we conducted an additional study to calculate the average memory bandwidth utilization, which shows that this remains below 6% of the total available bandwidth (~51GB/s) for all queries. This result certifies that bandwidth is over provisioned and has no negative impact in performance.

Concluding remarks

In order to optimize or improve the performance of a given system it is important to understand its sources of inefficiency. In this study we have shown that when running a well known DSS benchmark like TPC-H with a large dataset, the memory hierarchy of the system is the bottleneck. In particular, the high latency that has to be paid to access the external DRAM memory can account for more than 50% of the execution time. DRAM access latency is high due to slower bus frequencies with respect to the high frequency of the processing units.

The gap in processing speed between external memory and processing units, also known as the memory wall, is not likely to diminish with the use of current technologies. However, the use emerging memory technologies like 3D integration of memory and computational logic, can boost the frequency of the integrated memory and shorten the distance between the processing units and large portions of memory. This has the potential to reduce the latency for a large fraction of accesses that otherwise would have to be serviced by external memory. For these reasons, within the AXLE project we are investigating novel 3D stacking architectures that can deliver higher performance when running DSS workloads on large datasets.

AXLE General Assembly 29th April to 1st May 2014

 The AXLE Project’s third General Assembly was hosted by The University of Manchester over three days in April 2014. The meeting was an opportunity for partners to work together on specific topics that have emerged during the AXLE project. Although the partners are very comfortable working remotely, the face-to-face meetings were very productive. As with previous GAs, the agenda covered partner overviews and work package updates, but with a fair amount of time apportioned to smaller working groups.

PV outlining the dataset architecture

 

 

 

 

 

 

Tom, from Portavita, presenting their dataset to the AXLE partners

The social aspect of the GAs is also important; partners can build relationships over evening meals that enhance their working relationships. And the evening socials are a great way for the host partner to “show case” their environment. On the first evening, the partners went on a walking tour of the Manchester Ship Canal before experiencing traditional English fayre at a canalside pub. And to round off day two, we enjoyed an array of Indian food at a truly “Bollywood” venue in Manchester – a culinary and visual treat.