Hi @Eilowangfang,
To your question, if there is a time that indicates the end-to-end etc., the answer is YES. to activate it, you have to start the server with the parameter enable-debug-timer turned to true. If you look at docs is listed, and it’s supposed to
Enable fine-grained query execution timers for debug. For debugging, logs verbose timing information for query execution (time to load data, time to compile code, and so on).
the Infos will be written into the logs at INFO level so that they will be available at [/mapd_storage]/data/mapd_logs/omnisci.INFO
running the TPCHQ10 query after a server cold start the output is this one
2020-12-18T16:26:00.744844 I 4773 1 measure.h:74 Timer start getExecutor getExecutor: 158
2020-12-18T16:26:00.744860 I 4773 1 measure.h:81 Timer end getExecutor getExecutor: 158 elapsed 0 ms
2020-12-18T16:26:00.744900 I 4773 1 QueryDispatchQueue.h:61 Dispatching query with 0 queries in the queue.
2020-12-18T16:26:00.744953 I 4773 7 QueryDispatchQueue.h:92 Worker 1 running the query and returning control. There are now 0 queries in the queue.
2020-12-18T16:26:00.744998 I 4773 7 measure.h:74 Timer start getExecutor getExecutor: 158
2020-12-18T16:26:00.745018 I 4773 7 measure.h:81 Timer end getExecutor getExecutor: 158 elapsed 0 ms
2020-12-18T16:26:00.745647 I 4773 7 measure.h:74 Timer start executeRelAlgQuery executeRelAlgQuery: 222
2020-12-18T16:26:00.745668 I 4773 7 measure.h:74 Timer start executeRelAlgQueryNoRetry executeRelAlgQueryNoRetry: 253
2020-12-18T16:26:00.943161 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_28 was 197ms
2020-12-18T16:26:01.140584 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_30 was 197ms
2020-12-18T16:26:01.313512 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_26 was 172ms
2020-12-18T16:26:01.505229 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_27 was 191ms
2020-12-18T16:26:01.505494 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_22 was 0ms
2020-12-18T16:26:01.505611 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_31 was 0ms
2020-12-18T16:26:01.505698 I 4773 7 measure.h:74 Timer start executeRelAlgSeq executeRelAlgSeq: 598
2020-12-18T16:26:01.505714 I 4773 7 measure.h:74 Timer start executeRelAlgStep executeRelAlgStep: 677
2020-12-18T16:26:01.505870 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(lineitem, ["L_ORDERKEY", "L_PARTKEY", "L_SUPPKEY", "L_LINENUMBER", "L_QUANTITY", "L_EXTENDEDPRICE", "L_DISCOUNT", "L_TAX", "L_RETURNFLAG", "L_LINESTATUS", "L_SHIPDATE", "L_COMMITDATE", "L_RECEIPTDATE", "L_SHIPINSTRUCT", "L_SHIPMODE", "L_COMMENT", "rowid"]) to nest level 0
2020-12-18T16:26:01.505891 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(orders, ["O_ORDERKEY", "O_CUSTKEY", "O_ORDERSTATUS", "O_TOTALPRICE", "O_ORDERDATE", "O_ORDERPRIORITY", "O_CLERK", "O_SHIPPRIORITY", "O_COMMENT", "rowid"]) to nest level 1
2020-12-18T16:26:01.505898 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(customer, ["C_CUSTKEY", "C_NAME", "C_ADDRESS", "C_NATIONKEY", "C_PHONE", "C_ACCTBAL", "C_MKTSEGMENT", "C_COMMENT", "rowid"]) to nest level 2
2020-12-18T16:26:01.505905 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(nation, ["N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT", "rowid"]) to nest level 3
2020-12-18T16:26:01.505929 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(lineitem, ["L_ORDERKEY", "L_PARTKEY", "L_SUPPKEY", "L_LINENUMBER", "L_QUANTITY", "L_EXTENDEDPRICE", "L_DISCOUNT", "L_TAX", "L_RETURNFLAG", "L_LINESTATUS", "L_SHIPDATE", "L_COMMITDATE", "L_RECEIPTDATE", "L_SHIPINSTRUCT", "L_SHIPMODE", "L_COMMENT", "rowid"]) to nest level 0
2020-12-18T16:26:01.505947 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(orders, ["O_ORDERKEY", "O_CUSTKEY", "O_ORDERSTATUS", "O_TOTALPRICE", "O_ORDERDATE", "O_ORDERPRIORITY", "O_CLERK", "O_SHIPPRIORITY", "O_COMMENT", "rowid"]) to nest level 1
2020-12-18T16:26:01.505953 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(customer, ["C_CUSTKEY", "C_NAME", "C_ADDRESS", "C_NATIONKEY", "C_PHONE", "C_ACCTBAL", "C_MKTSEGMENT", "C_COMMENT", "rowid"]) to nest level 2
2020-12-18T16:26:01.505959 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(nation, ["N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT", "rowid"]) to nest level 3
2020-12-18T16:26:01.506037 I 4773 7 measure.h:74 Timer start executeWorkUnit executeWorkUnit: 2769
2020-12-18T16:26:01.506046 I 4773 7 measure.h:74 Timer start get_table_infos get_table_infos: 263
2020-12-18T16:26:01.506053 I 4773 7 measure.h:81 Timer end get_table_infos get_table_infos: 263 elapsed 0 ms
2020-12-18T16:26:01.506113 I 4773 7 measure.h:74 Timer start Exec_executeWorkUnit executeWorkUnitImpl: 1405
2020-12-18T16:26:01.506138 I 4773 7 measure.h:74 Timer start query_step_compilation executeWorkUnitImpl: 1425
2020-12-18T16:26:01.506247 I 4773 7 measure.h:81 Timer end query_step_compilation executeWorkUnitImpl: 1425 elapsed 0 ms
2020-12-18T16:26:01.506255 I 4773 7 measure.h:81 Timer end Exec_executeWorkUnit executeWorkUnitImpl: 1405 elapsed 0 ms
2020-12-18T16:26:01.506268 I 4773 7 measure.h:74 Timer start get_table_infos get_table_infos: 263
2020-12-18T16:26:01.506276 I 4773 7 measure.h:81 Timer end get_table_infos get_table_infos: 263 elapsed 0 ms
2020-12-18T16:26:01.506280 I 4773 7 measure.h:74 Timer start Exec_executeWorkUnit executeWorkUnitImpl: 1405
2020-12-18T16:26:01.506289 I 4773 7 measure.h:74 Timer start query_step_compilation executeWorkUnitImpl: 1425
2020-12-18T16:26:01.507720 I 4773 7 measure.h:74 Timer start buildJoinLoops buildJoinLoops: 266
2020-12-18T16:26:01.510955 I 4773 7 BufferMgr.cpp:304 ALLOCATION slab of 8388608 pages (4294967296B) created in 3 ms GPU_MGR:0
2020-12-18T16:26:01.510986 I 4773 7 BufferMgr.cpp:304 ALLOCATION slab of 8388608 pages (4294967296B) created in 0 ms CPU_MGR:0
2020-12-18T16:26:01.584934 I 4773 7 BufferMgr.cpp:304 ALLOCATION slab of 8388608 pages (4294967296B) created in 3 ms GPU_MGR:1
2020-12-18T16:26:01.996789 I 4773 7 measure.h:81 Timer end buildJoinLoops buildJoinLoops: 266 elapsed 489 ms
2020-12-18T16:26:01.998070 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice
2020-12-18T16:26:01.998105 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice
2020-12-18T16:26:02.083164 I 4773 7 measure.h:81 Timer end query_step_compilation executeWorkUnitImpl: 1425 elapsed 576 ms
2020-12-18T16:26:02.083338 I 4773 14 measure.h:74 Timer start kernel_run run: 92
2020-12-18T16:26:02.083397 I 4773 14 measure.h:74 Timer start fetchChunks fetchChunks: 2342
2020-12-18T16:26:02.083890 I 4773 15 measure.h:74 Timer start kernel_run run: 92
2020-12-18T16:26:02.083909 I 4773 15 measure.h:74 Timer start fetchChunks fetchChunks: 2342
2020-12-18T16:26:02.634163 I 4773 14 measure.h:81 Timer end fetchChunks fetchChunks: 2342 elapsed 550 ms
2020-12-18T16:26:02.634241 I 4773 14 measure.h:74 Timer start executePlanWithoutGroupBy executePlanWithoutGroupBy: 2693
2020-12-18T16:26:02.634276 I 4773 14 measure.h:74 Timer start lauchGpuCode launchGpuCode: 200
2020-12-18T16:26:02.727827 I 4773 15 measure.h:81 Timer end fetchChunks fetchChunks: 2342 elapsed 643 ms
2020-12-18T16:26:02.727893 I 4773 15 measure.h:74 Timer start executePlanWithoutGroupBy executePlanWithoutGroupBy: 2693
2020-12-18T16:26:02.727910 I 4773 15 measure.h:74 Timer start lauchGpuCode launchGpuCode: 200
2020-12-18T16:26:02.771668 I 4773 15 measure.h:81 Timer end lauchGpuCode launchGpuCode: 200 elapsed 43 ms
2020-12-18T16:26:02.771713 I 4773 15 measure.h:81 Timer end executePlanWithoutGroupBy executePlanWithoutGroupBy: 2693 elapsed 43 ms
2020-12-18T16:26:02.771734 I 4773 15 measure.h:81 Timer end kernel_run run: 92 elapsed 687 ms
2020-12-18T16:26:02.775700 I 4773 14 measure.h:81 Timer end lauchGpuCode launchGpuCode: 200 elapsed 141 ms
2020-12-18T16:26:02.775716 I 4773 14 measure.h:81 Timer end executePlanWithoutGroupBy executePlanWithoutGroupBy: 2693 elapsed 141 ms
2020-12-18T16:26:02.775736 I 4773 14 measure.h:81 Timer end kernel_run run: 92 elapsed 692 ms
2020-12-18T16:26:02.779040 I 4773 7 measure.h:81 Timer end Exec_executeWorkUnit executeWorkUnitImpl: 1405 elapsed 1272 ms
2020-12-18T16:26:02.779554 I 4773 7 measure.h:74 Timer start Exec_executeWorkUnit executeWorkUnitImpl: 1405
2020-12-18T16:26:02.779595 I 4773 7 measure.h:74 Timer start query_step_compilation executeWorkUnitImpl: 1425
2020-12-18T16:26:02.781268 I 4773 7 measure.h:74 Timer start buildJoinLoops buildJoinLoops: 266
2020-12-18T16:26:02.790142 I 4773 7 measure.h:81 Timer end buildJoinLoops buildJoinLoops: 266 elapsed 8 ms
2020-12-18T16:26:02.791306 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice
2020-12-18T16:26:02.791374 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice
2020-12-18T16:26:02.879975 I 4773 7 measure.h:81 Timer end query_step_compilation executeWorkUnitImpl: 1425 elapsed 100 ms
2020-12-18T16:26:02.880165 I 4773 23 measure.h:74 Timer start kernel_run run: 92
2020-12-18T16:26:02.880199 I 4773 23 measure.h:74 Timer start fetchChunks fetchChunks: 2342
2020-12-18T16:26:02.880148 I 4773 22 measure.h:74 Timer start kernel_run run: 92
2020-12-18T16:26:02.880376 I 4773 22 measure.h:74 Timer start fetchChunks fetchChunks: 2342
2020-12-18T16:26:02.956186 I 4773 23 measure.h:81 Timer end fetchChunks fetchChunks: 2342 elapsed 76 ms
2020-12-18T16:26:02.956326 I 4773 23 measure.h:74 Timer start executePlanWithGroupBy executePlanWithGroupBy: 2896
2020-12-18T16:26:02.956351 I 4773 23 measure.h:74 Timer start lauchGpuCode launchGpuCode: 200
2020-12-18T16:26:03.024190 I 4773 22 measure.h:81 Timer end fetchChunks fetchChunks: 2342 elapsed 143 ms
2020-12-18T16:26:03.024304 I 4773 22 measure.h:74 Timer start executePlanWithGroupBy executePlanWithGroupBy: 2896
2020-12-18T16:26:03.024330 I 4773 22 measure.h:74 Timer start lauchGpuCode launchGpuCode: 200
2020-12-18T16:26:03.115051 I 4773 23 measure.h:81 Timer end lauchGpuCode launchGpuCode: 200 elapsed 158 ms
2020-12-18T16:26:03.116177 I 4773 22 measure.h:81 Timer end lauchGpuCode launchGpuCode: 200 elapsed 91 ms
2020-12-18T16:26:03.119799 I 4773 23 measure.h:81 Timer end executePlanWithGroupBy executePlanWithGroupBy: 2896 elapsed 163 ms
2020-12-18T16:26:03.119840 I 4773 23 measure.h:81 Timer end kernel_run run: 92 elapsed 239 ms
2020-12-18T16:26:03.119846 I 4773 22 measure.h:81 Timer end executePlanWithGroupBy executePlanWithGroupBy: 2896 elapsed 95 ms
2020-12-18T16:26:03.119875 I 4773 22 measure.h:81 Timer end kernel_run run: 92 elapsed 239 ms
2020-12-18T16:26:03.217164 I 4773 7 measure.h:81 Timer end Exec_executeWorkUnit executeWorkUnitImpl: 1405 elapsed 437 ms
2020-12-18T16:26:03.217273 I 4773 7 measure.h:81 Timer end executeWorkUnit executeWorkUnit: 2769 elapsed 1711 ms
2020-12-18T16:26:03.359704 I 4773 7 measure.h:81 Timer end executeRelAlgStep executeRelAlgStep: 677 elapsed 1854 ms
2020-12-18T16:26:03.359741 I 4773 7 measure.h:81 Timer end executeRelAlgSeq executeRelAlgSeq: 598 elapsed 1854 ms
2020-12-18T16:26:03.359769 I 4773 7 measure.h:81 Timer end executeRelAlgQueryNoRetry executeRelAlgQueryNoRetry: 253 elapsed 2614 ms
2020-12-18T16:26:03.359778 I 4773 7 measure.h:81 Timer end executeRelAlgQuery executeRelAlgQuery: 222 elapsed 2614 ms
the most relevant info are:
2020-12-18T16:26:02.083164 I 4773 7 measure.h:81 Timer end query_step_compilation executeWorkUnitImpl: 1425 elapsed 576 ms
The system didn’t know the query, so it has to parse, optimize and generate the LLVM code for each step of the query (You can get the LLVM code with the explain [your_statement] command).
To build the code, the system took 576ms (if you run the same query with different parameters, the old code is reused)
then to load the data from disk to system RAM and GPU ram, you have to look to fetch chunks
2020-12-18T16:26:02.634163 I 4773 14 measure.h:81 Timer end fetchChunks fetchChunks: 2342 elapsed **550** ms
2020-12-18T16:26:02.727827 I 4773 15 measure.h:81 Timer end fetchChunks fetchChunks: 2342 elapsed **643** ms
Those threads run in parallel, so the entire operation likely lasted 643ms. I have a disk subsystem capable of 10GB/sec, so the reads from disks are really fast, and the bottleneck is the CPU power and the PCI-ex bus. You get other data like the time the query spent on Gpu or in Cpu depending on the executor chooses for your query.
You will also get a summarization in a more readable format at the end of the log. Something like that
2614ms total duration for executeRelAlgQuery
2614ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:254
760ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:255
1854ms start(760ms) executeRelAlgSeq RelAlgExecutor.cpp:599
1853ms start(760ms) executeRelAlgStep RelAlgExecutor.cpp:678
1853ms start(760ms) executeSort RelAlgExecutor.cpp:2446
1711ms start(760ms) executeWorkUnit RelAlgExecutor.cpp:2770
0ms start(760ms) compileWorkUnit NativeCodegen.cpp:2409
576ms start(760ms) compileWorkUnit NativeCodegen.cpp:2409
474ms start(762ms) getInstance HashJoin.cpp:480
474ms start(762ms) reify JoinHashTable.cpp:487
New thread(8)
214ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
End thread(8)
New thread(9)
374ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
End thread(9)
14ms start(1236ms) getInstance HashJoin.cpp:480
14ms start(1236ms) reify JoinHashTable.cpp:487
New thread(10)
2ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
End thread(10)
New thread(11)
0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
End thread(11)
0ms start(1250ms) getInstance HashJoin.cpp:480
0ms start(1250ms) reify JoinHashTable.cpp:487
New thread(12)
0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
End thread(12)
New thread(13)
0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
End thread(13)
New thread(14)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91
550ms start(0ms) fetchChunks Execute.cpp:2341
0ms start(550ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703
141ms start(550ms) executePlanWithoutGroupBy Execute.cpp:2694
141ms start(550ms) launchGpuCode QueryExecutionContext.cpp:199
End thread(14)
New thread(15)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91
643ms start(0ms) fetchChunks Execute.cpp:2341
0ms start(644ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703
43ms start(644ms) executePlanWithoutGroupBy Execute.cpp:2694
43ms start(644ms) launchGpuCode QueryExecutionContext.cpp:199
End thread(15)
3ms start(2030ms) collectAllDeviceResults Execute.cpp:1783
3ms start(2030ms) reduceMultiDeviceResults Execute.cpp:887
100ms start(2033ms) compileWorkUnit NativeCodegen.cpp:2409
7ms start(2035ms) getInstance HashJoin.cpp:480
7ms start(2035ms) reify JoinHashTable.cpp:487
New thread(16)
6ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
End thread(16)
New thread(17)
6ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
End thread(17)
0ms start(2043ms) getInstance HashJoin.cpp:480
0ms start(2043ms) reify JoinHashTable.cpp:487
New thread(18)
0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
End thread(18)
New thread(19)
0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
End thread(19)
0ms start(2044ms) getInstance HashJoin.cpp:480
0ms start(2044ms) reify JoinHashTable.cpp:487
New thread(20)
0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
End thread(20)
New thread(21)
0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
End thread(21)
New thread(22)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91
143ms start(0ms) fetchChunks Execute.cpp:2341
0ms start(144ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703
95ms start(144ms) executePlanWithGroupBy Execute.cpp:2895
91ms start(144ms) launchGpuCode QueryExecutionContext.cpp:199
3ms start(236ms) getRowSet QueryExecutionContext.cpp:136
3ms start(236ms) reduceMultiDeviceResults Execute.cpp:887
3ms start(236ms) reduceMultiDeviceResultSets Execute.cpp:933
End thread(22)
New thread(23)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91
76ms start(0ms) fetchChunks Execute.cpp:2341
0ms start(76ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703
163ms start(76ms) executePlanWithGroupBy Execute.cpp:2895
158ms start(76ms) launchGpuCode QueryExecutionContext.cpp:199
4ms start(234ms) getRowSet QueryExecutionContext.cpp:136
4ms start(234ms) reduceMultiDeviceResults Execute.cpp:887
4ms start(234ms) reduceMultiDeviceResultSets Execute.cpp:933
End thread(23)
97ms start(2374ms) collectAllDeviceResults Execute.cpp:1783
97ms start(2374ms) reduceMultiDeviceResults Execute.cpp:887
97ms start(2374ms) reduceMultiDeviceResultSets Execute.cpp:933
142ms start(2471ms) sort ResultSet.cpp:484
142ms start(2471ms) baselineSort ResultSet.cpp:544
0ms start(2613ms) createComparator ResultSet.h:638
0ms start(2614ms) topPermutation ResultSet.cpp:846
Re-running the query changing the year from 1993 to 1994, the total time is 441 ms.
Clearing the GPU memory cache with the command \clear_gpu and re-running the query, the total time went up to 874 ms, so it’s likely to transfer data and populate the GPU caches, the system is taking over 400ms.
So your guesses were mostly right, except we don’t load anything automatically when the Omnisci server start (there is a parameter that can be used to warm-up some caches), and the data resides in the GPU if needed, but for a complete GPU execution using the default parameters you need that all the data fit in GPU memory; if there isn’t enough GPU ram the query will be executed on CPU if there isn’t enough CPU memory, it’s likely to fail.
The GPU memory is allocated when needed, and you can limit using the GPU-buffer-mem-bytes that, as default, is 0 and will allocate all available GPU mem in your system (if needed).
I think I replied to your question, but If I forgot something, pleas tell me.
Best Regards,
Candido 