Decision Support Systems (DSS) help to drive business decisions via databases with large datasets; which may range from a few hundred gigabytes to several petabytes. These systems are becoming pervasive and many enterprises, including small and medium scale organizations, are interested in deploying such systems to improve their decision making capabilities. The rapid growth in digital data requires a substantial increase in computational performance to process large datasets fast enough so that timely decisions can be made.
In this scope, the AXLE project focuses on addressing full requirements of real datasets, including security as well as performance. One of the objectives is to deliver a database management system (DBMS) that can handle these large volumes of data. For this task the project is working on extending PostgreSQL, an open-source enterprise-grade DBMS. When operating over larger datasets, the response time of the DBMS is likely to increase since it has to process more data. To tackle this issue the project is also focused on providing good performance through the use of novel hardware techniques to try to eliminate the bottlenecks that are present in current server solutions.
In order to improve performance over existing solutions, it is paramount to know what architectural components of current server-grade solutions need to be optimized. To this end, we present a set of performance analysis measurements using a state-of-the-art DSS benchmark over a large dataset running on PostgreSQL.
To perform this analysis we run a popular decision support benchmark, TPC-H, where the queries and the data populating the database have broad industry-wide relevance. We configure the benchmark with a scale factor of 100 (100GB), which yields a database of over 200GB after adding the desired indexes. The following table contains a list of the different components present in our test system. We use a server-grade Xeon processor with a large number of memory channels and available DRAM memory.
|Processor||32nm Intel Xeon E5-2670, 2,60GHz, HT and TurboBoost disabled|
|Cores||8 out-of-order cores, 4-wide issue and retire|
|L1 cache||32KB 4-way, split I/D, 4-cycle access latency|
|L2 cache||256KB 8-way per core, 8-cycle access latency|
|L3 cache (LLC)||20MB 16-way shared, 28-cycle access latency|
|Memory||256GB DDR3-1600, 4 channels, delivering up to 51.5GB/s|
Table 1: Test system configuration
To measure where time is spent in current architectures when executing this kind of workloads, we employ a recently proposed top-down performance counter methodology that can determine true bottlenecks of a general out-of-order processor. Performance counters are integrated hardware counters that allow to get a measure of a predefined set of events, such as the number of cycles of execution or miss events in the memory hierarchy. The top-down approach used defines core utilization by classifying the core pipeline activity into four basic categories: Retiring, Bad Speculation, Frontend Bound, and Backend Bound. Once a workload is deemed to be bound to a specific category further exploration is done to determine the component within that category that is causing the bottleneck.
Figure 1: Top level breakdown for TPC-H queries
The figure above shows the initial categorization of execution time. The ‘retiring’ category reflects instructions issued that eventually get retired. A 100% retiring component corresponds to hitting the maximum IPC — 4 in this case. Overall we observe that retiring correlates well with IPC (not shown), and is a good cross-validation point with an established metric. Time spent due to incorrect ‘speculation’ is not significant. The ‘frontend’ bound category mainly accounts for the time waiting due to TLB and cache instruction misses, and as can be seen it can be a significant portion of the execution time. However, the overall dominant category is the ‘backend’ bound component, of which the maximum contributor are data cache misses.
Figure 2: Memory level breakdown for TPC-H queries
Since the workload is deemed to be backend bound, we further explore this component to determine which of the memory components is the bottleneck. The figure above further divides the backend bound component into stalls that happen at different levels of the memory hierarchy. The component ‘l2_bound’ accounts for L1 data misses that hit L2, and so on. As can be seen, stalls at the L2 and L3 level caches are not significant, meaning that these structures are seeing a low hit ratio. The L3 is not performing well due to the large memory footprints which exceed its capacity. Looking at the chart we can conclude that the workload is external memory bound, i.e, most of the time is spent waiting on requests to the external RAM memory.
Figure 3: Off-chip memory characterization breakdown for TPC-H queries
The external memory bound component can be divided further into two categories: bandwidth and latency. By measuring occupancy of pending memory requests we can establish certain thresholds that will categorize if the observed external memory bound component is due to bandwidth restrictions or latency. We define the bandwidth component as time in which there are at least 14 outstanding memory requests, and the rest of the time is attributed to latency. The above figure shows that latency is the dominating factor, and that there are no bandwidth limitations. In fact, we conducted an additional study to calculate the average memory bandwidth utilization, which shows that this remains below 6% of the total available bandwidth (~51GB/s) for all queries. This result certifies that bandwidth is over provisioned and has no negative impact in performance.
In order to optimize or improve the performance of a given system it is important to understand its sources of inefficiency. In this study we have shown that when running a well known DSS benchmark like TPC-H with a large dataset, the memory hierarchy of the system is the bottleneck. In particular, the high latency that has to be paid to access the external DRAM memory can account for more than 50% of the execution time. DRAM access latency is high due to slower bus frequencies with respect to the high frequency of the processing units.
The gap in processing speed between external memory and processing units, also known as the memory wall, is not likely to diminish with the use of current technologies. However, the use emerging memory technologies like 3D integration of memory and computational logic, can boost the frequency of the integrated memory and shorten the distance between the processing units and large portions of memory. This has the potential to reduce the latency for a large fraction of accesses that otherwise would have to be serviced by external memory. For these reasons, within the AXLE project we are investigating novel 3D stacking architectures that can deliver higher performance when running DSS workloads on large datasets.