parallelize the COPY phase #2426

tvondra · 2025-10-28T13:18:16Z

tvondra
Oct 28, 2025

Hi,

I've been doing some testing with OSM data, and I noticed that a significant part of the load is taken by COPY, which happens without parallelism. Are there any plans to parallelize this, either by loading multiple tables concurrently, or splitting the data into smaller chunks and loading them through multiple connections?

I'm not very familiar with the data structure, so maybe there are dependencies that make this impossible / inefficient. But it's a bit sad to not be able to better utilize available hardware resources.

joto · 2025-10-29T07:46:36Z

joto
Oct 29, 2025
Maintainer

In the usual configuration there are two threads doing COPYs, one for the "middle" tables (in slim mode only), one for the output tables. Data is collected in chunks and then send via a queue to those threads for the actual COPY operation. We could use a thread pool instead of those two threads for the actual COPY but never thought that this would improve the situation much. In the end the bottle neck is probably the I/O isn't it? And doing more of this in parallel means more contention on the WAL and, if we are writing to the same table in multiple COPYs at once, more contention an that table. So it is unclear to me why having more parallelismus would help significantly. Doing anything with multithreading in C++ code is always a pain, so keeping this code as simple as possible is also important.

But maybe we are wrong there and didn't take some issue into account. And if somebody wanted to try this, that would be great, we'd gat actual data.

3 replies

tvondra Oct 29, 2025
Author

I don't have great data, it's mostly based on watching "top" while the osm2pgsql is running. And most of the time there's just a single backend doing COPY and consuming 100% of the time. Like this:

    PID  %CPU  %MEM     TIME+ COMMAND
  41997 123.5  10.7  19:15.48 osm2pgsql --drop -c --verbose --log-level debug -k -H localhost -d osm planet-251020.osm.pbf
  42007 100.0   0.1   7:09.87 postgres: azureuser osm ::1(41384) COPY
      1   0.0   0.0   0:03.42 /lib/systemd/systemd --system --deserialize=27
      2   0.0   0.0   0:00.02 [kthreadd]

I'm sure there are periods when it really is I/O bound, but this is clearly CPU bound. Processing COPY is not exactly free, and most of a perf profile is related to parsing the input, forming tuples, etc. That should parallelize pretty well, I think.

I don't think WAL contention, or contention on the relation would be a problem. It's a strategy we often use when generating large amounts of data for testing, and it works great. Of course, it assumes it does not get I/O bound (particularly on WAL). Sure, if the storage can't handle that, you won't get an improvement. But parallelism is meant to help "good" systems that don't have this bottleneck. (I'm testing this on a VM with 400GB of RAM and 6 NVMe drives in RAID0. It really is not I/O bound.)

Also, these are bulk WAL writes - large sequential writes, with very few fsyncs. So the system won't wait for the WAL all that much anyway. This is what strace tells me for the COPY backend:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.13    0.998593           4    239801       215 recvfrom
  1.78    0.018331           1     13106           pwrite64
  0.37    0.003829           0      5774           lseek
  0.37    0.003827          11       327           pwritev
  0.28    0.002916           1      1919           fallocate
  0.02    0.000219           2        95        66 openat
  0.01    0.000144           0       218           sendto
  0.01    0.000108           0       144           pread64
  0.01    0.000071           0       212           epoll_wait
  0.00    0.000049           1        28           close
  0.00    0.000014           4         3           rt_sigreturn
  0.00    0.000012           4         3           getpid
  0.00    0.000007           2         3           setitimer
  0.00    0.000000           0         1           kill
------ ----------- ----------- --------- --------- ----------------
100.00    1.028120           3    261634       281 total

There's not a single fsync, it's all about reading data from the connection, and writing pages to disk.

Still, I may be wrong. I know a thing or two about Postgres, but I'm not all that familiar with OSM or osm2pgsql code. I only use it to evaluate Postgres improvement, etc. I won't be able to improve osm2pgsl myself (say, by adjusting the code to use a thread pool), but I'll be able to test / evaluate a patch if someone prepares one.

pnorman Oct 30, 2025
Maintainer

the end the bottle neck is probably the I/O isn't it? And doing more of this in parallel means more contention on the WAL and, if we are writing to the same table in multiple COPYs at once, more contention an that table.

For the Postgres side generally multiple connections doing COPY at the same time will be faster. WAL contention doesn't come in to play at all for the output tables because they're UNLOGGED at that stage and no WAL exists for them. Even for the slim tables, we have synchronous_commit off so very few fsyncs are issued.

If it is IO throughput (MB/s or IOPS) limited you want it very parallel. Modern NVMe SSDs scale with queue depth. Manufacturer spec sheets use queue depths of 128+ for random workloads. Sequential seems to be lower with one source quoting at 32. Even back in 2015 Intel was showing the best performance was at queue depths >100.

tvondra Oct 30, 2025
Author

I don't think the I/O depth matters very much for osm2pgsql, at least on the machine I'm using. With 400GB RAM everything is in memory, so there's literally no read I/O at all. It might matter for writes, but those generally happen through page cache, so it's up to the kernel to figure this out. And it's happening in the background (e.g. in checkpointer), not in the backend.

joto · 2025-10-30T08:34:52Z

joto
Oct 30, 2025
Maintainer

@tvondra When COPYing into the same table from multiple threads, do you see a possible issue with data ordering? What I mean is that in one case (the middle tables in slim mode), those tables will be written in the order of their primary key id. I assume this to be a good thing, at least building the index will be faster I would assume. If we write from multile threads, the table will not be as ordered. Do you forsee any issues there? (I would expect this to be a large issue, with changes afterwards the table will get unsorted anyway, but I'd just though I check.)

3 replies

pnorman Oct 30, 2025
Maintainer

I can check some index details before and after a cluster by ID. I don't expect it will make a big difference because the slow index is the way nodes one.

tvondra Oct 30, 2025
Author

I think small imperfections in data ordering are fine. Data locality helps if workload reads keys in this order, or when loading data into a preexisting index. With multiple threads inserting data the data will still be mostly ordered, with small "localized" differences. But small differences make no measurable difference - the cache hit ratio remains very high, etc. So I don't think this would be a problem in practice.

Also, aren't some of the tables clustered? In that case the insert ordering doesn't matter at all.

joto Oct 30, 2025
Maintainer

Yeah, the "output" tables will be clustered, for them it doesn't matter. Only possibly for the "middle" tables. But it looks like I might have been to cautions there. Thanks for your input.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

parallelize the COPY phase #2426

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

parallelize the COPY phase #2426

Uh oh!

tvondra Oct 28, 2025

Replies: 2 comments · 6 replies

Uh oh!

joto Oct 29, 2025 Maintainer

Uh oh!

tvondra Oct 29, 2025 Author

Uh oh!

pnorman Oct 30, 2025 Maintainer

Uh oh!

tvondra Oct 30, 2025 Author

Uh oh!

joto Oct 30, 2025 Maintainer

Uh oh!

pnorman Oct 30, 2025 Maintainer

Uh oh!

tvondra Oct 30, 2025 Author

Uh oh!

joto Oct 30, 2025 Maintainer

tvondra
Oct 28, 2025

Replies: 2 comments 6 replies

joto
Oct 29, 2025
Maintainer

tvondra Oct 29, 2025
Author

pnorman Oct 30, 2025
Maintainer

tvondra Oct 30, 2025
Author

joto
Oct 30, 2025
Maintainer

pnorman Oct 30, 2025
Maintainer

tvondra Oct 30, 2025
Author

joto Oct 30, 2025
Maintainer