How to find out, what takes most time in query processing

Hello , where can i find out, what takes most time in query proccessing ?

I have following query, but it takes almost 40 sec to proccess on 10GB dataset

*select s_acctbal,s_name,n_name,p_partkey,p_mfgr,s_address,s_phone,s_comment from part,supplier,partsupp,nation,region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 15 *
*and p_type like ‘%BRASS’ and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = ‘EUROPE’ and ps_supplycost = (select min(ps_supplycost) from partsupp,supplier,nation,region *
where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = ‘EUROPE’) order by s_acctbal desc,n_name,s_name,p_partkey fetch next 100 rows only;

Hi @Sipy,

You can start the database by setting the parameter enable-debug-timer to true.
The timings of macro steps of the query, like the execution in CPU/GPU, projection, and the “order by” will be written to the log.
.

We haven’t anything to return the timings of single steps of the query, like group by, joins.

It would be difficult considering the extreme parallelism degree of GPU execution.

I tried query on 100GB, using 1 or 2 GPUs, and the response time is between 4000ms and 1200ms.

heavysql> select s_acctbal,s_name,n_name,p_partkey,p_mfgr,s_address,s_phone,s_comment from part,supplier,partsupp,nation,region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 15 and p_type like '%BRASS' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE' and ps_supplycost = (select min(ps_supplycost) from partsupp,supplier,nation,region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE') order by s_acctbal desc,n_name,s_name,p_partkey limit 10;
s_acctbal|s_name|n_name|p_partkey|p_mfgr|s_address|s_phone|s_comment
9999.70|Supplier#000239544|UNITED KINGDOM|6739531|Manufacturer#4|,6oexY7ft68JKJCN1SQ LUO5Xf|33-509-584-9496|ets are. blithely special accounts wake across t
9999.65|Supplier#000143654|FRANCE|2393647|Manufacturer#3|HhHCZ,RosE8He4uYvyIDqsPZe,7cZiJh1y9,|16-166-504-5864|special accounts. never dogged deposits
9999.49|Supplier#000615014|GERMANY|12365001|Manufacturer#3|4C,uOuJDig7z,4e|17-780-902-4027|pinto beans affix along th
9999.28|Supplier#000494480|ROMANIA|17244462|Manufacturer#3|hNB2qcYmnyeqrd em,o29TxzLwwrl|29-756-312-1779|ly. slyly final pains detect furiously qu
9998.87|Supplier#000826281|ROMANIA|14326252|Manufacturer#3|LkGmLSAjbjgKS5ZRepL|29-775-451-4774|slyly silent accounts affix furiously across the final, even ideas; tithe
9998.56|Supplier#000039514|UNITED KINGDOM|7539499|Manufacturer#2|fUN95rpikfiqeGo,okl,27ItSq9fMpr|33-317-828-7758|olites use about the blithely regular warhorses. carefully final instructions a
9997.89|Supplier#000319666|RUSSIA|19319665|Manufacturer#1|SZtdx3rzXnFuiVuJzKathV9|32-972-151-6038|furiously ironic pearls. furiously regular foxes along the furiously
9997.85|Supplier#000718770|ROMANIA|718769|Manufacturer#1|2173woKsCg7zpHCqhiEXFTmYYqpT4XONnRAsUryE|29-663-409-2865|posits sleep across the blithely express requests.
9997.83|Supplier#000348318|FRANCE|9348317|Manufacturer#4|AvfnvjCFAg1aNpj|16-718-707-5676|e pending requests: furiously even deposits boost furiously slyly even requests.
9997.73|Supplier#000329974|RUSSIA|4079969|Manufacturer#3|eKtIadopsE|32-268-488-7178|uickly. furiously ironic requests are slyly above the regular
10 rows returned.
Execution time: 1133 ms, Total time: 1179 ms

Could you enable the debug timer and post the log of your execution?
If the default storage directory is used the log file is /var/lib/heavyai/log/omnisci_server.INFO

the result of my execution is

1107ms total duration for executeRelAlgQuery
  1107ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
    1106ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
      4ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
        4ms start(0ms) executeCompound RelAlgExecutor.cpp:2116
          4ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388
            1ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(2ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(2ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(2ms) ExecutionKernel::run ExecutionKernel.cpp:125
            0ms start(2ms) fetchChunks Execute.cpp:2941
            0ms start(2ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            0ms start(2ms) executePlanWithoutGroupBy Execute.cpp:3362
              0ms start(2ms) launchGpuCode QueryExecutionContext.cpp:223
            0ms start(3ms) collectAllDeviceResults Execute.cpp:2333
              0ms start(3ms) reduceMultiDeviceResults Execute.cpp:1309
                0ms start(3ms) reduceMultiDeviceResultSets Execute.cpp:1355
            1ms start(3ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(4ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(4ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(4ms) ExecutionKernel::run ExecutionKernel.cpp:125
            0ms start(4ms) fetchChunks Execute.cpp:2941
            0ms start(4ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            0ms start(4ms) executePlanWithGroupBy Execute.cpp:3583
              0ms start(4ms) launchGpuCode QueryExecutionContext.cpp:223
              0ms start(5ms) getRowSet QueryExecutionContext.cpp:158
                0ms start(5ms) reduceMultiDeviceResults Execute.cpp:1309
                  0ms start(5ms) reduceMultiDeviceResultSets Execute.cpp:1355
            0ms start(5ms) resultsUnion Execute.cpp:1279
      156ms start(5ms) executeRelAlgStep RelAlgExecutor.cpp:962
        156ms start(5ms) executeCompound RelAlgExecutor.cpp:2116
          156ms start(6ms) executeWorkUnit RelAlgExecutor.cpp:3388
            2ms start(6ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(6ms) getInstance HashJoin.cpp:295
                0ms start(6ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(6ms) getOneColumnFragment ColumnFetcher.cpp:78
                  0ms start(6ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(170)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(170)
                    New thread(171)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(171)
              0ms start(7ms) getInstance HashJoin.cpp:295
                0ms start(7ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(7ms) getOneColumnFragment ColumnFetcher.cpp:78
                  0ms start(7ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(172)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(172)
                    New thread(173)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(173)
              0ms start(7ms) getInstance HashJoin.cpp:295
                0ms start(7ms) synthesize_metadata InputMetadata.cpp:306
                0ms start(7ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(7ms) getOneColumnFragment ColumnFetcher.cpp:78
                    0ms start(7ms) ColumnarResults ColumnarResults.cpp:64
                  0ms start(7ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(174)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(174)
                    New thread(175)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(175)
              0ms start(8ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(8ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(8ms) ExecutionKernel::run ExecutionKernel.cpp:125
            0ms start(8ms) fetchChunks Execute.cpp:2941
                New thread(47)
                  0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                  0ms start(0ms) fetchChunks Execute.cpp:2941
                  0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                  131ms start(0ms) executePlanWithGroupBy Execute.cpp:3583
                    131ms start(0ms) launchGpuCode QueryExecutionContext.cpp:223
                    0ms start(131ms) getRowSet QueryExecutionContext.cpp:158
                      0ms start(131ms) reduceMultiDeviceResults Execute.cpp:1309
                        0ms start(131ms) reduceMultiDeviceResultSets Execute.cpp:1355
                End thread(47)
            0ms start(8ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            130ms start(8ms) executePlanWithGroupBy Execute.cpp:3583
              130ms start(8ms) launchGpuCode QueryExecutionContext.cpp:223
              0ms start(139ms) getRowSet QueryExecutionContext.cpp:158
                0ms start(139ms) reduceMultiDeviceResults Execute.cpp:1309
                  0ms start(139ms) reduceMultiDeviceResultSets Execute.cpp:1355
            21ms start(140ms) collectAllDeviceResults Execute.cpp:2333
              21ms start(140ms) reduceMultiDeviceResults Execute.cpp:1309
                21ms start(140ms) reduceMultiDeviceResultSets Execute.cpp:1355
      4ms start(162ms) executeRelAlgStep RelAlgExecutor.cpp:962
        4ms start(162ms) executeCompound RelAlgExecutor.cpp:2116
          4ms start(162ms) executeWorkUnit RelAlgExecutor.cpp:3388
            1ms start(162ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(163ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(163ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(163ms) ExecutionKernel::run ExecutionKernel.cpp:125
            0ms start(163ms) fetchChunks Execute.cpp:2941
            0ms start(163ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            0ms start(163ms) executePlanWithoutGroupBy Execute.cpp:3362
              0ms start(163ms) launchGpuCode QueryExecutionContext.cpp:223
            0ms start(164ms) collectAllDeviceResults Execute.cpp:2333
              0ms start(164ms) reduceMultiDeviceResults Execute.cpp:1309
                0ms start(164ms) reduceMultiDeviceResultSets Execute.cpp:1355
            1ms start(164ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(165ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(165ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(165ms) ExecutionKernel::run ExecutionKernel.cpp:125
            0ms start(165ms) fetchChunks Execute.cpp:2941
            0ms start(165ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            0ms start(165ms) executePlanWithGroupBy Execute.cpp:3583
              0ms start(165ms) launchGpuCode QueryExecutionContext.cpp:223
              0ms start(166ms) getRowSet QueryExecutionContext.cpp:158
                0ms start(166ms) reduceMultiDeviceResults Execute.cpp:1309
                  0ms start(166ms) reduceMultiDeviceResultSets Execute.cpp:1355
            0ms start(166ms) resultsUnion Execute.cpp:1279
      5ms start(166ms) executeRelAlgStep RelAlgExecutor.cpp:962
        5ms start(166ms) executeCompound RelAlgExecutor.cpp:2116
          5ms start(166ms) executeWorkUnit RelAlgExecutor.cpp:3388
            1ms start(166ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(167ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(167ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(167ms) ExecutionKernel::run ExecutionKernel.cpp:125
            0ms start(167ms) fetchChunks Execute.cpp:2941
            0ms start(167ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            0ms start(168ms) executePlanWithoutGroupBy Execute.cpp:3362
              0ms start(168ms) launchGpuCode QueryExecutionContext.cpp:223
            0ms start(168ms) collectAllDeviceResults Execute.cpp:2333
              0ms start(168ms) reduceMultiDeviceResults Execute.cpp:1309
                0ms start(168ms) reduceMultiDeviceResultSets Execute.cpp:1355
            1ms start(169ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(170ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(170ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(170ms) ExecutionKernel::run ExecutionKernel.cpp:125
            0ms start(170ms) fetchChunks Execute.cpp:2941
            0ms start(170ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            1ms start(170ms) executePlanWithGroupBy Execute.cpp:3583
              1ms start(170ms) launchGpuCode QueryExecutionContext.cpp:223
              0ms start(172ms) getRowSet QueryExecutionContext.cpp:158
                0ms start(172ms) reduceMultiDeviceResults Execute.cpp:1309
                  0ms start(172ms) reduceMultiDeviceResultSets Execute.cpp:1355
            0ms start(172ms) resultsUnion Execute.cpp:1279
      893ms start(172ms) executeRelAlgStep RelAlgExecutor.cpp:962
        893ms start(172ms) executeProject RelAlgExecutor.cpp:2166
          892ms start(173ms) executeWorkUnit RelAlgExecutor.cpp:3388
            376ms start(204ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(205ms) getInstance HashJoin.cpp:295
                0ms start(205ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(205ms) getOneColumnFragment ColumnFetcher.cpp:78
                  0ms start(205ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(176)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(176)
                    New thread(177)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(177)
              13ms start(206ms) getInstance HashJoin.cpp:295
                4ms start(206ms) synthesize_metadata InputMetadata.cpp:306
                8ms start(210ms) reify PerfectJoinHashTable.cpp:354
                  8ms start(210ms) getOneColumnFragment ColumnFetcher.cpp:78
                    7ms start(210ms) ColumnarResults ColumnarResults.cpp:64
                  0ms start(219ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(178)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(178)
                    New thread(179)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(179)
              0ms start(219ms) getInstance HashJoin.cpp:295
                0ms start(219ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(219ms) getOneColumnFragment ColumnFetcher.cpp:78
                  0ms start(219ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(180)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(180)
                    New thread(181)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(181)
              0ms start(219ms) getInstance HashJoin.cpp:295
                0ms start(219ms) synthesize_metadata InputMetadata.cpp:306
                0ms start(219ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(219ms) getOneColumnFragment ColumnFetcher.cpp:78
                    0ms start(219ms) ColumnarResults ColumnarResults.cpp:64
                  0ms start(220ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(182)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(182)
                    New thread(183)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(183)
              360ms start(220ms) getInstance HashJoin.cpp:295
                235ms start(220ms) synthesize_metadata InputMetadata.cpp:306
                124ms start(456ms) reify PerfectJoinHashTable.cpp:354
                  115ms start(456ms) getOneColumnFragment ColumnFetcher.cpp:78
                    107ms start(456ms) ColumnarResults ColumnarResults.cpp:64
                  6ms start(572ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(184)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(184)
                    New thread(185)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(185)
              0ms start(580ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(580ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(581ms) ExecutionKernel::run ExecutionKernel.cpp:125
            24ms start(581ms) fetchChunks Execute.cpp:2941
                New thread(88)
                  0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                  37ms start(0ms) fetchChunks Execute.cpp:2941
                  0ms start(37ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                  2ms start(37ms) executePlanWithoutGroupBy Execute.cpp:3362
                    2ms start(37ms) launchGpuCode QueryExecutionContext.cpp:223
                End thread(88)
            0ms start(605ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            2ms start(605ms) executePlanWithoutGroupBy Execute.cpp:3362
              2ms start(605ms) launchGpuCode QueryExecutionContext.cpp:223
            0ms start(622ms) collectAllDeviceResults Execute.cpp:2333
              0ms start(622ms) reduceMultiDeviceResults Execute.cpp:1309
                0ms start(622ms) reduceMultiDeviceResultSets Execute.cpp:1355
            394ms start(622ms) compileWorkUnit NativeCodegen.cpp:2589
              1ms start(623ms) getInstance HashJoin.cpp:295
                1ms start(623ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(623ms) getOneColumnFragment ColumnFetcher.cpp:78
                  0ms start(623ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(186)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(186)
                    New thread(187)
                      1ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        1ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(187)
              17ms start(625ms) getInstance HashJoin.cpp:295
                7ms start(625ms) synthesize_metadata InputMetadata.cpp:306
                9ms start(633ms) reify PerfectJoinHashTable.cpp:354
                  8ms start(633ms) getOneColumnFragment ColumnFetcher.cpp:78
                    8ms start(633ms) ColumnarResults ColumnarResults.cpp:64
                  0ms start(642ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(188)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(188)
                    New thread(189)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(189)
              0ms start(642ms) getInstance HashJoin.cpp:295
                0ms start(642ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(643ms) getOneColumnFragment ColumnFetcher.cpp:78
                  0ms start(643ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(190)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(190)
                    New thread(191)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(191)
              0ms start(643ms) getInstance HashJoin.cpp:295
                0ms start(643ms) synthesize_metadata InputMetadata.cpp:306
                0ms start(643ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(643ms) getOneColumnFragment ColumnFetcher.cpp:78
                    0ms start(643ms) ColumnarResults ColumnarResults.cpp:64
                  0ms start(643ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(192)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(192)
                    New thread(193)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(193)
              371ms start(643ms) getInstance HashJoin.cpp:295
                238ms start(643ms) synthesize_metadata InputMetadata.cpp:306
                133ms start(881ms) reify PerfectJoinHashTable.cpp:354
                  124ms start(881ms) getOneColumnFragment ColumnFetcher.cpp:78
                    111ms start(881ms) ColumnarResults ColumnarResults.cpp:64
                  7ms start(1006ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(194)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(194)
                    New thread(195)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(195)
              0ms start(1015ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(1016ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(1016ms) ExecutionKernel::run ExecutionKernel.cpp:125
              New thread(196)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                16ms start(27ms) fetchChunks Execute.cpp:2941
                0ms start(44ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                4ms start(44ms) executePlanWithGroupBy Execute.cpp:3583
                  4ms start(44ms) launchGpuCode QueryExecutionContext.cpp:223
                  0ms start(49ms) getRowSet QueryExecutionContext.cpp:158
                    0ms start(49ms) reduceMultiDeviceResults Execute.cpp:1309
                      0ms start(49ms) reduceMultiDeviceResultSets Execute.cpp:1355
              End thread(196)
              New thread(197)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                23ms start(0ms) fetchChunks Execute.cpp:2941
                0ms start(24ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                6ms start(24ms) executePlanWithGroupBy Execute.cpp:3583
                  6ms start(24ms) launchGpuCode QueryExecutionContext.cpp:223
                  0ms start(30ms) getRowSet QueryExecutionContext.cpp:158
                    0ms start(30ms) reduceMultiDeviceResults Execute.cpp:1309
                      0ms start(30ms) reduceMultiDeviceResultSets Execute.cpp:1355
              End thread(197)
            23ms start(1016ms) fetchChunks Execute.cpp:2941
            0ms start(1040ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            3ms start(1040ms) executePlanWithGroupBy Execute.cpp:3583
              3ms start(1040ms) launchGpuCode QueryExecutionContext.cpp:223
              0ms start(1043ms) getRowSet QueryExecutionContext.cpp:158
                0ms start(1043ms) reduceMultiDeviceResults Execute.cpp:1309
                  0ms start(1043ms) reduceMultiDeviceResultSets Execute.cpp:1355
            0ms start(1065ms) resultsUnion Execute.cpp:1279
      41ms start(1066ms) executeRelAlgStep RelAlgExecutor.cpp:962
        41ms start(1066ms) executeSort RelAlgExecutor.cpp:2985
          37ms start(1066ms) executeWorkUnit RelAlgExecutor.cpp:3388
            1ms start(1066ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(1067ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(1067ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(1067ms) ExecutionKernel::run ExecutionKernel.cpp:125
            16ms start(1067ms) fetchChunks Execute.cpp:2941
              15ms start(1067ms) ColumnarResults ColumnarResults.cpp:64
            0ms start(1083ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            0ms start(1083ms) executePlanWithoutGroupBy Execute.cpp:3362
              0ms start(1083ms) launchGpuCode QueryExecutionContext.cpp:223
            0ms start(1084ms) collectAllDeviceResults Execute.cpp:2333
              0ms start(1084ms) reduceMultiDeviceResults Execute.cpp:1309
                0ms start(1084ms) reduceMultiDeviceResultSets Execute.cpp:1355
            1ms start(1084ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(1085ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(1085ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(1085ms) ExecutionKernel::run ExecutionKernel.cpp:125
            15ms start(1085ms) fetchChunks Execute.cpp:2941
              14ms start(1085ms) ColumnarResults ColumnarResults.cpp:64
            0ms start(1100ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            2ms start(1101ms) executePlanWithGroupBy Execute.cpp:3583
              2ms start(1101ms) launchGpuCode QueryExecutionContext.cpp:223
              0ms start(1103ms) getRowSet QueryExecutionContext.cpp:158
                0ms start(1103ms) reduceMultiDeviceResults Execute.cpp:1309
                  0ms start(1103ms) reduceMultiDeviceResultSets Execute.cpp:1355
            0ms start(1103ms) resultsUnion Execute.cpp:1279
          3ms start(1103ms) sort ResultSet.cpp:768
            0ms start(1103ms) initPermutationBuffer ResultSet.cpp:846
            0ms start(1104ms) createComparator ResultSet.h:840
            3ms start(1104ms) topPermutation ResultSet.cpp:1208

Quite complex :wink:

Regards,
Candido

1 Like

Hi Candido, I notice ColumnarResults takes quite a few amount of time, can I have a rough understanding about what ColumnarResults is doing? Thanks!

Hi,

Do you mean the 111ms? Probably some intermediate result is materialized and passed to the next query step.

I’m not sure about that

Thanks for the reply! Since HeavyDB do aggressive coalesce (fuse project, groupby and filer), why still are multiple steps? Could you give me a example that a query has to be divided into multiple steps? Thanks!

Hi,

It was an educated guess, but having an MPP database engine that doesn’t use intermediate results is almost impossible.

As an example, take a simple group-by query like that.

select payment_type,passenger_count,trip_distance,tip_amount,avg(tip_amount) 
from yellow_tripdata_p
group by 1,2,3,4

is generating this

59ms total duration for executeRelAlgQuery
  58ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:588
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:589
    58ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:895
      58ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:1043
        58ms start(0ms) executeSort RelAlgExecutor.cpp:3181
          58ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3585
            3ms start(0ms) compileWorkUnit NativeCodegen.cpp:2676
              0ms start(3ms) markDeadRuntimeFuncs NativeCodegen.cpp:1864
              0ms start(4ms) optimizeAndCodegenGPU NativeCodegen.cpp:1340
            0ms start(4ms) ExecutionKernel::run ExecutionKernel.cpp:126
            0ms start(4ms) fetchChunks Execute.cpp:2928
            0ms start(4ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:780
            53ms start(4ms) executePlanWithGroupBy Execute.cpp:3537
              51ms start(5ms) launchGpuCode QueryExecutionContext.cpp:225
              0ms start(56ms) getRowSet QueryExecutionContext.cpp:160
                0ms start(56ms) reduceMultiDeviceResults Execute.cpp:1294
                  0ms start(56ms) reduceMultiDeviceResultSets Execute.cpp:1340
            0ms start(58ms) collectAllDeviceResults Execute.cpp:2320
              0ms start(58ms) reduceMultiDeviceResults Execute.cpp:1294
                0ms start(58ms) reduceMultiDeviceResultSets Execute.cpp:1340

then we add a simple HAVING count(*) < 10

126ms total duration for executeRelAlgQuery
  126ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:588
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:589
    125ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:895
      47ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:1043
        47ms start(0ms) executeCompound RelAlgExecutor.cpp:2238
          47ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3585
            3ms start(0ms) compileWorkUnit NativeCodegen.cpp:2676
              0ms start(3ms) markDeadRuntimeFuncs NativeCodegen.cpp:1864
              0ms start(3ms) optimizeAndCodegenGPU NativeCodegen.cpp:1340
            0ms start(4ms) ExecutionKernel::run ExecutionKernel.cpp:126
            0ms start(4ms) fetchChunks Execute.cpp:2928
            0ms start(4ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:780
            42ms start(4ms) executePlanWithGroupBy Execute.cpp:3537
              39ms start(4ms) launchGpuCode QueryExecutionContext.cpp:225
              0ms start(44ms) getRowSet QueryExecutionContext.cpp:160
                0ms start(44ms) reduceMultiDeviceResults Execute.cpp:1294
                  0ms start(44ms) reduceMultiDeviceResultSets Execute.cpp:1340
            0ms start(47ms) collectAllDeviceResults Execute.cpp:2320
              0ms start(47ms) reduceMultiDeviceResults Execute.cpp:1294
                0ms start(47ms) reduceMultiDeviceResultSets Execute.cpp:1340
      78ms start(48ms) executeRelAlgStep RelAlgExecutor.cpp:1043
        77ms start(48ms) executeSort RelAlgExecutor.cpp:3181
          77ms start(48ms) executeWorkUnit RelAlgExecutor.cpp:3585
            1ms start(48ms) compileWorkUnit NativeCodegen.cpp:2676
              0ms start(49ms) markDeadRuntimeFuncs NativeCodegen.cpp:1864
              0ms start(50ms) optimizeAndCodegenGPU NativeCodegen.cpp:1340
            35ms start(50ms) synthesize_metadata InputMetadata.cpp:307
            0ms start(85ms) ExecutionKernel::run ExecutionKernel.cpp:126
            37ms start(85ms) fetchChunks Execute.cpp:2928
              36ms start(85ms) ColumnarResults ColumnarResults.cpp:167
            0ms start(122ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:780
            2ms start(123ms) executePlanWithGroupBy Execute.cpp:3537
              2ms start(123ms) launchGpuCode QueryExecutionContext.cpp:225
              0ms start(125ms) getRowSet QueryExecutionContext.cpp:160
                0ms start(125ms) reduceMultiDeviceResults Execute.cpp:1294
                  0ms start(125ms) reduceMultiDeviceResultSets Execute.cpp:1340

And this is adding some other steps, including a ColumnmarResult so intermediate results. Fusing a filter that needs reduced results would be impossible, so the intermediate result.

I’m sorry if this isn’t the best example, but I don’t know every scenario where intermediate results are used.

Regards,
Cadido

Hi thanks for the reply! Should the query still be fusable if there is only one GPU? you can perform filtering directly after groupby, you don’t need reduce results if there is only one GPU?

Hi,

Nope, also with just one GPU, you need to reduce the results.
The GPUs are the hardware that needs to be used in an MPP way, so 1 or more GPUs in a system is going to need a reduction after a scan-join-filter-aggregate, filter on the reduced aggregate

Candido