TPCH Q16, sort is slow

Hi I am runnning TPCH (scale=10) on a single T4 against HeavyDB, I find Q16 is very slow.
Q16

select
    p_brand,
    p_type,
    p_size,
    count(distinct ps_suppkey) as supplier_cnt
from
    partsupp,
    part
where
    p_partkey = ps_partkey
    and p_brand <> 'Brand#41'
    and p_type not like 'PROMO ANODIZED%'
    and p_size in (34, 10, 44, 45, 19, 26, 18, 47)
    and ps_suppkey not in (
        select
            s_suppkey
        from
            supplier
        where
            s_comment like '%Customer%Complaints%'
    )
group by
    p_brand,
    p_type,
    p_size
order by
    supplier_cnt desc,
    p_brand,
    p_type,
    p_size;

And this is the execution time of different steps

6454ms total duration for executeRelAlgQuery
  6454ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:588
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:589
    6ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:895
      6ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:1043
        6ms start(0ms) executeCompound RelAlgExecutor.cpp:2207
          6ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3552
            2ms start(0ms) compileWorkUnit NativeCodegen.cpp:2672
              0ms start(2ms) markDeadRuntimeFuncs NativeCodegen.cpp:1860
              0ms start(2ms) optimizeAndCodegenGPU NativeCodegen.cpp:1336
            0ms start(2ms) ExecutionKernel::run ExecutionKernel.cpp:126
            0ms start(2ms) fetchChunks Execute.cpp:2927
            0ms start(2ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
            0ms start(2ms) executePlanWithoutGroupBy Execute.cpp:3314
              0ms start(2ms) launchGpuCode QueryExecutionContext.cpp:225
            0ms start(3ms) collectAllDeviceResults Execute.cpp:2319
              0ms start(3ms) reduceMultiDeviceResults Execute.cpp:1294
                0ms start(3ms) reduceMultiDeviceResultSets Execute.cpp:1340
            1ms start(3ms) compileWorkUnit NativeCodegen.cpp:2672
              0ms start(5ms) markDeadRuntimeFuncs NativeCodegen.cpp:1860
              0ms start(5ms) optimizeAndCodegenGPU NativeCodegen.cpp:1336
            0ms start(5ms) ExecutionKernel::run ExecutionKernel.cpp:126
            0ms start(5ms) fetchChunks Execute.cpp:2927
            0ms start(5ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
            1ms start(5ms) executePlanWithGroupBy Execute.cpp:3536
              1ms start(5ms) launchGpuCode QueryExecutionContext.cpp:225
              0ms start(7ms) getRowSet QueryExecutionContext.cpp:160
                0ms start(7ms) reduceMultiDeviceResults Execute.cpp:1294
                  0ms start(7ms) reduceMultiDeviceResultSets Execute.cpp:1340
            0ms start(7ms) resultsUnion Execute.cpp:1264
    6446ms start(7ms) executeRelAlgSeq RelAlgExecutor.cpp:895
      6446ms start(7ms) executeRelAlgStep RelAlgExecutor.cpp:1043
        6446ms start(7ms) executeSort RelAlgExecutor.cpp:3149
          1864ms start(8ms) executeWorkUnit RelAlgExecutor.cpp:3552
            3ms start(8ms) compileWorkUnit NativeCodegen.cpp:2672
              0ms start(9ms) getInstance HashJoin.cpp:295
                0ms start(9ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(9ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(17)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:727
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:90
                    End thread(17)
              0ms start(10ms) markDeadRuntimeFuncs NativeCodegen.cpp:1860
              0ms start(10ms) optimizeAndCodegenGPU NativeCodegen.cpp:1336
            0ms start(11ms) ExecutionKernel::run ExecutionKernel.cpp:126
            0ms start(11ms) fetchChunks Execute.cpp:2927
            44ms start(11ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
            1815ms start(56ms) executePlanWithGroupBy Execute.cpp:3536
              1814ms start(56ms) launchGpuCode QueryExecutionContext.cpp:225
              0ms start(1870ms) getRowSet QueryExecutionContext.cpp:160
                0ms start(1870ms) reduceMultiDeviceResults Execute.cpp:1294
                  0ms start(1870ms) reduceMultiDeviceResultSets Execute.cpp:1340
            0ms start(1872ms) collectAllDeviceResults Execute.cpp:2319
              0ms start(1872ms) reduceMultiDeviceResults Execute.cpp:1294
                0ms start(1872ms) reduceMultiDeviceResultSets Execute.cpp:1340
          4581ms start(1872ms) sort ResultSet.cpp:771
            3ms start(1872ms) initPermutationBuffer ResultSet.cpp:849
            36ms start(1875ms) createComparator ResultSet.h:838
            4541ms start(1912ms) topPermutation ResultSet.cpp:1211

I have some observations:

  1. We have the sort operator here because of the order by clause. Sort is very slow and it seems sort happens on CPU instead of GPU. I’m wondering why sort is performed on the CPU? Moreover, the query returns 27K records. Sorting 27K queries should not be that slow even on CPU?
  2. I observe two executeRelAlgStep here, it seems output from the first step is moved out of GPU after the first step, before executing the second step, the intermediate result is moved from CPU to GPU. Why moving the data back and forth, is it because of multi-device reduction? Since reduction is happened on CPU, therefore all the intermediate results need to be moved out of GPU after a single step? But here, I only have one GPU, it still moves out the intermediate results, any explanation for that?

Thanks!
Lily

Hi,

Yes probably the last sort operation is limiting your performance, Anyway also for an SF of 10, the 1800ms figure isn’t exactly a stellar performer.

The join and the semijoin itself should go thru a perfect hash join, and after the hash index is created, they shouldn’t take so much (on my workstation with a SF of 100 and 1 GPU it’s taking around
100ms)
Removing the distinct in the count(distinct ps_suppkey) and the operation from the sort, the query is taking around 400ms (with the sort of the first 3 fields taking 329ms), and using the approx_count_distinct function is taking around 1000ms (the sort is 327ms).

So I guess we have some troubles while trying to sort the results of a reduced operation; we have an optimized path if just the top n results are needed, so if you try to run

heavysql> select p_brand, p_type, p_size, approx_count_distinct(ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#41' and p_type not like 'PROMO ANODIZED%' and p_size in (34, 10, 44, 45, 19, 26, 18, 47) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc,p_brand, p_type, p_size limit 10;
p_brand|p_type|p_size|supplier_cnt
Brand#33|MEDIUM BRUSHED TIN|34|611
Brand#11|STANDARD BURNISHED NICKEL|26|591
Brand#55|LARGE BURNISHED STEEL|10|590
Brand#14|SMALL BRUSHED BRASS|45|587
Brand#44|LARGE BRUSHED BRASS|18|578
Brand#44|STANDARD BURNISHED BRASS|18|577
Brand#14|ECONOMY BRUSHED COPPER|19|575
Brand#54|ECONOMY PLATED COPPER|18|575
Brand#24|STANDARD PLATED COPPER|34|573
Brand#25|PROMO BRUSHED COPPER|47|571
10 rows returned.
Execution time: 833 ms, Total time: 836 ms

and in this case, the sort is done in parallel, and it takes 64ms

without limit, the query returns the results in more than 3000ms with the sort alone taking 2464ms.

sorting for the dimension only 1100ms (sort 334)

sorting the count distinct only so writing the query this way

select p_brand, p_type, p_size, approx_count_distinct(ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#41' and p_type not like 'PROMO ANODIZED%' and p_size in (34, 10, 44, 45, 19, 26, 18, 47) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc;

just the sort itself takes 1960ms, so there is something we are doing bad, but I don’t know what can be the reason. I opened an issue about something similar (sort with top n) but it hasn’t be prioritized.

Candido