Additional details on num-reader-threads configuration parameter

Does the num-reader-threads parameter control the number of threads used when loading data into a table, or when loading table data into memory?


Hi @Ray1, it controls the latter, the number of threads used when reading persisted table data into memory. It defaults to the number of hardware threads if not specified.

Hopefully this helps!


Hi @Ray1,

to add something to what @todd said, you can also control the number of CPU threads used when you load data with copy command with the optiona parameter threads.

so you can achieve a nice balance of CPU resource used to load, read and processing data

Thanks, these parameters will be useful.

I’m trying to understand how queries perform in a multi-user environment.

If a query triggers a data load, will other queries be blocked waiting on the load if num-reader-threads is less than the total number of threads?

Will other queries be blocked on a \copy command if the thread parameter value is less than the total?

Finally, does use of pymapd load_table_columnar have a significant impact on query performance?


Nope, you can query the tables while data is loading, without being blocked.

Naturally rhe queries will be slower because each time you run the query, the new data has to be read frondisk and put into memory to be processed.

If the sum of threads and num readers is higher than the number of cores, the OS will manage the resource allocation of each process, but you won’t be blocked at all.

About pymapd. The load table columnar is using a thrift endpoint on the serverx probably similar to the one used by copy command, so itd not going to affect performance in a significant way

Thanks, Candido,

I have found that if I run two queries at about the same time the second query will need to wait for the first to complete. If the first query requires memory load and runs longer the second query also takes that much longer. Can num-reader-threads be used to speed up the second query? My guess is that num-reader-threads should be set high so the first query runs as quickly as possible.

Are there disadvantages of using all num-reader-threads?

Are there any other settings I can use to improve query performance in a multi-user environment?

Thanks again,

Hi @Ray1,

Right now, the query cannot run concurrently, while some stages of queries can overlap, as @anon45925161 explained to us in this post.

At the time of writing, the process of populating the caches isn’t running concurrently with other stages of the query, so if it takes a lot of time, is it going to slow down everything.

Anyway, I have to confess that I never benchmarked the num-reader-threads this way, but I think it makes sense, but it heavily depends on how your disk subsystem is fast.

With a very fast DIsk Subsystem (3 Nvmes disks with four lanes PCI-ex each in RAID 0) with a potential throughput of 12 GigaBytes per second, using 6 Threads, I got a TP of 3.8 GB/sec.

omnisql> select count(*) from flights_sk;
1 rows returned.
Execution time: 1235 ms, Total time: 1236 ms
omnisql> select extract(hour from dep_timestamp),avg(arrdelay),avg(depdelay),avg(distance) from flights_sk
..> where dep_timestamp between '2004-01-01 00:00:00' and '2006-12-31 00:00:00'
..> group by 1;
24 rows returned.
Execution time: 1242 ms, Total time: 1243 ms
omnisql> \memory_summary
OmniSci Server CPU Memory Summary:
            MAX            USE      ALLOCATED           FREE
    51433.93 MB     2746.58 MB     4096.00 MB     1349.42 MB

OmniSci Server GPU Memory Summary:
[GPU]            MAX            USE      ALLOCATED           FREE
  [0]     9681.76 MB     1525.88 MB     2048.00 MB      522.12 MB
  [1]     9680.45 MB     1220.70 MB     2048.00 MB      827.30 MB

1243ms total duration for sql_execute
  1167ms start(76ms) executeRelAlgQuery RelAlgExecutor.cpp:157
    1167ms start(76ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:181
      0ms start(76ms) Query pre-execution steps RelAlgExecutor.cpp:182
      1166ms start(76ms) executeRelAlgSeq RelAlgExecutor.cpp:448
        1166ms start(76ms) executeRelAlgStep RelAlgExecutor.cpp:527
          1166ms start(76ms) executeCompound RelAlgExecutor.cpp:1515
            1166ms start(76ms) executeWorkUnit RelAlgExecutor.cpp:2627
              166ms start(76ms) compileWorkUnit NativeCodegen.cpp:1809
                New thread(3)
                  726ms start(0ms) fetchChunks Execute.cpp:2207
                  2ms start(726ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:705
                  270ms start(728ms) executePlanWithGroupBy Execute.cpp:2749
                    270ms start(728ms) launchGpuCode QueryExecutionContext.cpp:197
                    0ms start(998ms) getRowSet QueryExecutionContext.cpp:134
                      0ms start(998ms) reduceMultiDeviceResults Execute.cpp:875
                        0ms start(998ms) reduceMultiDeviceResultSets Execute.cpp:899
                End thread(3)
                New thread(4)
                  631ms start(0ms) fetchChunks Execute.cpp:2207
                  3ms start(632ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:705
                  199ms start(635ms) executePlanWithGroupBy Execute.cpp:2749
                    194ms start(635ms) launchGpuCode QueryExecutionContext.cpp:197
                    5ms start(829ms) getRowSet QueryExecutionContext.cpp:134
                      4ms start(829ms) reduceMultiDeviceResults Execute.cpp:875
                        4ms start(829ms) reduceMultiDeviceResultSets Execute.cpp:899
                End thread(4)
              0ms start(1242ms) collectAllDeviceResults Execute.cpp:1690
                0ms start(1242ms) reduceMultiDeviceResults Execute.cpp:875
                  0ms start(1242ms) reduceMultiDeviceResultSets Execute.cpp:899

Just take a look at the fetchChunks section of sql_execute and divide the memory used for the query with the thread took more time. It’s not entirely read from disk, because the data has been transferred into GPU memory (it should be around 200ms, a value I got by a subsequent query), so the Disk to CPU Memory TP is about 5.2 GB per second.

Trying with other values, I got these results.

Number of Threads Troughtput (GB/sec) Total Time (ms) Time to Poulate GPU Caches (ms)
1 or 2 3,4 1024 239
6 4,8 772 217
48 5,2 726 212
Default 5,0 728 193

So with speedy storage on a decent CPU, the parameter is important, but changing from the Default isn’t going to change too much the time needed to populate CPU caches.

I will post more numbers, with different storage options ASAP.
Of course, it would be nice to get a number from you too.

Best Regards