Analyzing the time of queries sub-steps

Hi @jiashenC,

I created a thread to discuss about hot to interpret the output of the debug timer command

You said:

One thing puzzles me is for some queries, the time reported in debug timer is different from time profiled from the client. Please see the following output. Thread 6 only reports 460 ms, but clients gets ~1300 ms. This is same as what reported by thread 1, but huge difference than thread 6. How should I interpret this?

I guess there is something that’s initializing the first time you run a query after the cold of the database

Try to run another query on another table like a select distinct f1 from table1 limit 10;

I guess most of the overhead will go away.

About the incomplete out, it looks that when you use CTEs a chunk of the debug timer output is omitted, and that seems the reason why aren’t you getting a lòarge chunk of the outìput comprising the memory transfers and the GPU execution.

Removing the CTE part of the query and running a simple join and aggregate query using the TPC-H schema

select extract(month from O_ORDERDATE),
           O_ORDERSTATUS,
           sum(L_EXTENDEDPRICE*L_QUANTITY) 
from    lineitem 
join     orders on L_ORDERKEY=O_ORDERKEY 
where L_COMMITDATE between '1992-03-03' and '1992-12-03' 
group by 1,2;

and that’s the full output

242ms total duration for executeRelAlgQuery
  242ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
    241ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
      241ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
        241ms start(0ms) executeSort RelAlgExecutor.cpp:2985
          241ms start(1ms) executeWorkUnit RelAlgExecutor.cpp:3388
            131ms start(1ms) compileWorkUnit NativeCodegen.cpp:2589
              11ms start(2ms) getInstance HashJoin.cpp:295
                11ms start(2ms) reify PerfectJoinHashTable.cpp:354
                  10ms start(2ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(9)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(9)
              2ms start(14ms) codegen GpuSharedMemoryUtils.cpp:55
              0ms start(16ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              116ms start(16ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(17ms) initializeNVPTXBackend NativeCodegen.cpp:1442
                115ms start(17ms) generateNativeGPUCode NativeCodegen.cpp:1126
                  0ms start(17ms) check_module_requires_libdevice NativeCodegen.cpp:272
                  6ms start(17ms) optimize_ir NativeCodegen.cpp:310
                  13ms start(24ms) generatePTX NativeCodegen.cpp:1405
                  94ms start(38ms) ptx_to_cubin NvidiaKernel.cpp:128
            0ms start(133ms) ExecutionKernel::run ExecutionKernel.cpp:125
            97ms start(133ms) fetchChunks Execute.cpp:2941
            0ms start(231ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            10ms start(231ms) executePlanWithGroupBy Execute.cpp:3583
              5ms start(231ms) launchGpuCode QueryExecutionContext.cpp:223
              4ms start(237ms) getRowSet QueryExecutionContext.cpp:158
                4ms start(237ms) reduceMultiDeviceResults Execute.cpp:1309
                  4ms start(237ms) reduceMultiDeviceResultSets Execute.cpp:1355
                    3ms start(237ms) generateNativeCPUCode NativeCodegen.cpp:443
                      2ms start(237ms) optimize_ir NativeCodegen.cpp:310
                      1ms start(240ms) create_execution_engine NativeCodegen.cpp:419
            0ms start(241ms) collectAllDeviceResults Execute.cpp:2333
              0ms start(241ms) reduceMultiDeviceResults Execute.cpp:1309
                0ms start(241ms) reduceMultiDeviceResultSets Execute.cpp:1355

the runtime is higher than the 242ms because for whatever reason 65ms has been spent in the parser

2022-06-09T14:09:03.923738 I 3835849 0 1 Calcite.cpp:635 Time in Thrift 2 (ms), Time in Java Calcite server 65 (ms)

Analyzing the out we can notice that we spent around 11ms to create a hash-index to manage the hash-join

              11ms start(2ms) getInstance HashJoin.cpp:295
                11ms start(2ms) reify PerfectJoinHashTable.cpp:354
                  10ms start(2ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(9)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(9)

then around 131ms to generate and compile the llvm code needed to run the query

           131ms start(1ms) compileWorkUnit NativeCodegen.cpp:2589
[cut]
              2ms start(14ms) codegen GpuSharedMemoryUtils.cpp:55
              0ms start(16ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              116ms start(16ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(17ms) initializeNVPTXBackend NativeCodegen.cpp:1442
                115ms start(17ms) generateNativeGPUCode NativeCodegen.cpp:1126
                  0ms start(17ms) check_module_requires_libdevice NativeCodegen.cpp:272
                  6ms start(17ms) optimize_ir NativeCodegen.cpp:310
                  13ms start(24ms) generatePTX NativeCodegen.cpp:1405
                  94ms start(38ms) ptx_to_cubin NvidiaKernel.cpp:128

another 97ms to fetch data from DISK into CPU, then into GPU memory

            97ms start(133ms) fetchChunks Execute.cpp:2941

and 10ms to execute the kernel, fetch and reduce the results

            10ms start(231ms) executePlanWithGroupBy Execute.cpp:3583
              5ms start(231ms) launchGpuCode QueryExecutionContext.cpp:223
              4ms start(237ms) getRowSet QueryExecutionContext.cpp:158
                4ms start(237ms) reduceMultiDeviceResults Execute.cpp:1309
                  4ms start(237ms) reduceMultiDeviceResultSets Execute.cpp:1355
                    3ms start(237ms) generateNativeCPUCode NativeCodegen.cpp:443
                      2ms start(237ms) optimize_ir NativeCodegen.cpp:310
                      1ms start(240ms) create_execution_engine NativeCodegen.cpp:419
            0ms start(241ms) collectAllDeviceResults Execute.cpp:2333
              0ms start(241ms) reduceMultiDeviceResults Execute.cpp:1309
                0ms start(241ms) reduceMultiDeviceResultSets Execute.cpp:1355

Running the same query another time, changing the filtered values, making the optimizer reuse the same plan, so the faster execution

12ms total duration for executeRelAlgQuery
  12ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
    12ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
      12ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
        12ms start(0ms) executeSort RelAlgExecutor.cpp:2985
          11ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388
            3ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(1ms) getInstance HashJoin.cpp:295
                0ms start(1ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(1ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(33)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(33)
              1ms start(2ms) codegen GpuSharedMemoryUtils.cpp:55
              0ms start(4ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(4ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(4ms) ExecutionKernel::run ExecutionKernel.cpp:125
            0ms start(4ms) fetchChunks Execute.cpp:2941
            0ms start(4ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            7ms start(4ms) executePlanWithGroupBy Execute.cpp:3583
              7ms start(4ms) launchGpuCode QueryExecutionContext.cpp:223
              0ms start(11ms) getRowSet QueryExecutionContext.cpp:158
                0ms start(11ms) reduceMultiDeviceResults Execute.cpp:1309
                  0ms start(11ms) reduceMultiDeviceResultSets Execute.cpp:1355
            0ms start(12ms) collectAllDeviceResults Execute.cpp:2333
              0ms start(12ms) reduceMultiDeviceResults Execute.cpp:1309
                0ms start(12ms) reduceMultiDeviceResultSets Execute.cpp:1355
2022-06-09T14:33:46.299971 I 3835849 0 1 DBHandler.cpp:1556 DEBUG_TIMER thread_id(1)
29ms total duration for sql_execute

Also in this query, the calcite is taking a big chunk of query duration (13ms), but it’s not important because with the 6.0 release the Calcite, generation, and compilation of the query’s code are almost parallelized, so when running queries concurrently, this overhead is less important.

Anyway enclosing the query in a select * from (select ... ) limit 0; is returning a cutted down debug timer output

12ms total duration for executeRelAlgQuery
  12ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
    12ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
      11ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
        11ms start(0ms) executeSort RelAlgExecutor.cpp:2985
          11ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388
            11ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589
              7ms start(1ms) getInstance HashJoin.cpp:295
                7ms start(1ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(1ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(15)
                      0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(15)
              1ms start(10ms) codegen GpuSharedMemoryUtils.cpp:55
              0ms start(11ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(11ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(12ms) collectAllDeviceResults Execute.cpp:2333
              0ms start(12ms) reduceMultiDeviceResults Execute.cpp:1309

while adding an additional limit in the inner query

select * from (
select extract(month from O_ORDERDATE),
           O_ORDERSTATUS,
           sum(L_EXTENDEDPRICE*L_QUANTITY) 
from    lineitem 
join     orders on L_ORDERKEY=O_ORDERKEY 
where L_COMMITDATE between '1992-03-03' and '1992-12-03' 
group by 1,2 
limit 100000)
limit 0;

this way is returning a more complete output, but I would go with the inner query (with a limit?).

19ms total duration for executeRelAlgQuery
  19ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
    19ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
      16ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
        16ms start(0ms) executeSort RelAlgExecutor.cpp:2985
          16ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388
            11ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589
              8ms start(1ms) getInstance HashJoin.cpp:295
                8ms start(1ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(1ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(18)
                      1ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        1ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(18)
              1ms start(10ms) codegen GpuSharedMemoryUtils.cpp:55
              0ms start(12ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(12ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(12ms) ExecutionKernel::run ExecutionKernel.cpp:125
            0ms start(12ms) fetchChunks Execute.cpp:2941
            0ms start(12ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
            4ms start(12ms) executePlanWithGroupBy Execute.cpp:3583
              3ms start(12ms) launchGpuCode QueryExecutionContext.cpp:223
              0ms start(16ms) getRowSet QueryExecutionContext.cpp:158
                0ms start(16ms) reduceMultiDeviceResults Execute.cpp:1309
                  0ms start(16ms) reduceMultiDeviceResultSets Execute.cpp:1355
            0ms start(17ms) collectAllDeviceResults Execute.cpp:2333
              0ms start(17ms) reduceMultiDeviceResults Execute.cpp:1309
                0ms start(17ms) reduceMultiDeviceResultSets Execute.cpp:1355
      2ms start(17ms) executeRelAlgStep RelAlgExecutor.cpp:962
        2ms start(17ms) executeSort RelAlgExecutor.cpp:2985
          2ms start(17ms) executeWorkUnit RelAlgExecutor.cpp:3388
            1ms start(17ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(18ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(18ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(18ms) collectAllDeviceResults Execute.cpp:2333
            1ms start(18ms) compileWorkUnit NativeCodegen.cpp:2589
              0ms start(19ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(19ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(19ms) resultsUnion Execute.cpp:1279

Besides that, I also noticed some discrepancies between the runtime and the total time, but at this time I haven’t if it’s something introduced in the 6.0 release, or something else.

Regards,
Candido

Thanks!

In addition, putting the query inside select * from (...) limit 0 seems to affect the query execution time in debug timer as well. My inner query is just a join between two tables select * from tb1,tb2 where tb1.k=tb2.k. Here, I compare both runs under the system is warm up.

The following log is when I just run the inner query, which indicates ~400ms.

439ms total duration for executeRelAlgQuery
  439ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
    439ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
      439ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
        439ms start(0ms) executeProject RelAlgExecutor.cpp:2166
          438ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388
            26ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589
              24ms start(2ms) getInstance HashJoin.cpp:295
                24ms start(2ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(2ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(18)
                      24ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        23ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(18)
              0ms start(26ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(26ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(27ms) ExecutionKernel::run ExecutionKernel.cpp:125
            0ms start(27ms) fetchChunks Execute.cpp:2941
            0ms start(27ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
            41ms start(27ms) executePlanWithoutGroupBy Execute.cpp:3362
              41ms start(27ms) launchGpuCode QueryExecutionContext.cpp:223
            0ms start(69ms) collectAllDeviceResults Execute.cpp:2333
              0ms start(69ms) reduceMultiDeviceResults Execute.cpp:1309
                0ms start(69ms) reduceMultiDeviceResultSets Execute.cpp:1355
            25ms start(69ms) compileWorkUnit NativeCodegen.cpp:2589
              24ms start(70ms) getInstance HashJoin.cpp:295
                24ms start(70ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(70ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(19)
                      24ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        23ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(19)
              0ms start(94ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(95ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(95ms) ExecutionKernel::run ExecutionKernel.cpp:125
            0ms start(95ms) fetchChunks Execute.cpp:2941
            0ms start(95ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
              New thread(20)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                0ms start(17ms) fetchChunks Execute.cpp:2941
                0ms start(17ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
                42ms start(17ms) executePlanWithGroupBy Execute.cpp:3583
                  42ms start(17ms) launchGpuCode QueryExecutionContext.cpp:223
                  0ms start(59ms) getRowSet QueryExecutionContext.cpp:158
                    0ms start(59ms) reduceMultiDeviceResults Execute.cpp:1309
                      0ms start(59ms) reduceMultiDeviceResultSets Execute.cpp:1355
              End thread(20)
              New thread(21)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                0ms start(59ms) fetchChunks Execute.cpp:2941
                0ms start(59ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
                38ms start(59ms) executePlanWithGroupBy Execute.cpp:3583
                  38ms start(59ms) launchGpuCode QueryExecutionContext.cpp:223
                  0ms start(97ms) getRowSet QueryExecutionContext.cpp:158
                    0ms start(97ms) reduceMultiDeviceResults Execute.cpp:1309
                      0ms start(97ms) reduceMultiDeviceResultSets Execute.cpp:1355
              End thread(21)
            16ms start(95ms) executePlanWithGroupBy Execute.cpp:3583
                New thread(22)
                  0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                  0ms start(303ms) fetchChunks Execute.cpp:2941
                  0ms start(303ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
                  39ms start(304ms) executePlanWithGroupBy Execute.cpp:3583
                    39ms start(304ms) launchGpuCode QueryExecutionContext.cpp:223
                    0ms start(343ms) getRowSet QueryExecutionContext.cpp:158
                      0ms start(343ms) reduceMultiDeviceResults Execute.cpp:1309
                        0ms start(343ms) reduceMultiDeviceResultSets Execute.cpp:1355
                End thread(22)
                New thread(23)
                  0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                  0ms start(179ms) fetchChunks Execute.cpp:2941
                  0ms start(179ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
                  43ms start(179ms) executePlanWithGroupBy Execute.cpp:3583
                    43ms start(179ms) launchGpuCode QueryExecutionContext.cpp:223
                    0ms start(222ms) getRowSet QueryExecutionContext.cpp:158
                      0ms start(222ms) reduceMultiDeviceResults Execute.cpp:1309
                        0ms start(222ms) reduceMultiDeviceResultSets Execute.cpp:1355
                End thread(23)
                New thread(24)
                  0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                  0ms start(97ms) fetchChunks Execute.cpp:2941
                  0ms start(97ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
                  39ms start(97ms) executePlanWithGroupBy Execute.cpp:3583
                    39ms start(97ms) launchGpuCode QueryExecutionContext.cpp:223
                    0ms start(136ms) getRowSet QueryExecutionContext.cpp:158
                      0ms start(136ms) reduceMultiDeviceResults Execute.cpp:1309
                        0ms start(136ms) reduceMultiDeviceResultSets Execute.cpp:1355
                End thread(24)
                New thread(25)
                  0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                  0ms start(137ms) fetchChunks Execute.cpp:2941
                  0ms start(137ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
                  42ms start(137ms) executePlanWithGroupBy Execute.cpp:3583
                    42ms start(137ms) launchGpuCode QueryExecutionContext.cpp:223
                    0ms start(179ms) getRowSet QueryExecutionContext.cpp:158
                      0ms start(179ms) reduceMultiDeviceResults Execute.cpp:1309
                        0ms start(179ms) reduceMultiDeviceResultSets Execute.cpp:1355
                End thread(25)
                New thread(26)
                  0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                  0ms start(264ms) fetchChunks Execute.cpp:2941
                  0ms start(264ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
                  39ms start(264ms) executePlanWithGroupBy Execute.cpp:3583
                    39ms start(264ms) launchGpuCode QueryExecutionContext.cpp:223
                    0ms start(303ms) getRowSet QueryExecutionContext.cpp:158
                      0ms start(303ms) reduceMultiDeviceResults Execute.cpp:1309
                        0ms start(303ms) reduceMultiDeviceResultSets Execute.cpp:1355
                End thread(26)
                New thread(27)
                  0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                  0ms start(222ms) fetchChunks Execute.cpp:2941
                  0ms start(222ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
                  41ms start(222ms) executePlanWithGroupBy Execute.cpp:3583
                    41ms start(222ms) launchGpuCode QueryExecutionContext.cpp:223
                    0ms start(264ms) getRowSet QueryExecutionContext.cpp:158
                      0ms start(264ms) reduceMultiDeviceResults Execute.cpp:1309
                        0ms start(264ms) reduceMultiDeviceResultSets Execute.cpp:1355
                End thread(27)
              16ms start(95ms) launchGpuCode QueryExecutionContext.cpp:223
              0ms start(112ms) getRowSet QueryExecutionContext.cpp:158
                0ms start(112ms) reduceMultiDeviceResults Execute.cpp:1309
                  0ms start(112ms) reduceMultiDeviceResultSets Execute.cpp:1355
            0ms start(439ms) resultsUnion Execute.cpp:1279

But, if I add the limit statement, the execution time becomes much less (~50ms)

52ms total duration for executeRelAlgQuery
  52ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
    52ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
      52ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
        52ms start(0ms) executeSort RelAlgExecutor.cpp:2985
          52ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388
            25ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589
              24ms start(1ms) getInstance HashJoin.cpp:295
                24ms start(1ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(1ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(13)
                      24ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        23ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(13)
              0ms start(25ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(25ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(26ms) collectAllDeviceResults Execute.cpp:2333
            26ms start(26ms) compileWorkUnit NativeCodegen.cpp:2589
              24ms start(27ms) getInstance HashJoin.cpp:295
                24ms start(27ms) reify PerfectJoinHashTable.cpp:354
                  0ms start(27ms) getOneColumnFragment ColumnFetcher.cpp:78
                    New thread(14)
                      23ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                        23ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                    End thread(14)
              0ms start(52ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
              0ms start(52ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
            0ms start(52ms) resultsUnion Execute.cpp:1279

I wonder which represents the closest query execution time?

Also, on the client side, the \timing command reports execution and total time. Do those include Calcite server start and parsing time?

Hi,

Probably the time difference is the time spent on projection.

In the earlier release we had some timings on that, but lately.they are disappeared.

Yes in the client everything in the server is included. The client is calling a thrifth endpoint, and ita reporting the time of a execution in the server reporter by the thrifth call and a Total time spent ti return the result back to the user.

1 Like