How much RAM to Importing planet.osm? (Feb 2018)

mikeo · February 6, 2018, 10:43pm

How much RAM are people allocating to import planet.osm.pbf with foot profile and contraction hierarchies? I have been unsuccessful even on a digitalocean droplet with 128gb of RAM using latest master from GitHub (09b630f)

My latest run used this config.properties:

graph.flag_encoders=foot
prepare.ch.weightings=fastest
prepare.min_network_size=200
prepare.min_one_way_network_size=200
routing.non_ch.max_waypoint_distance = 1000000
graph.dataaccess=RAM_STORE
datareader.instructions=false
prepare.ch.log_messages=100

And these flags:

export JAVA_OPTS="-server -Xmx125g -Xms125g"
./graphhopper.sh import planet-180129.osm.pbf

(Using osm snapshot from https://aws.amazon.com/public-datasets/osm/)

The job fails with:

2018-02-06 01:49:53,815 [main] INFO  com.graphhopper.routing.subnetwork.PrepareRoutingSubnetworks - optimize to remove subnetworks (5445145), unvisited-dead-end-nodes (4567937), maxEdges/node (42)
2018-02-06 05:20:47,513 [main] INFO  com.graphhopper.reader.osm.GraphHopperOSM - edges: 233521276, nodes 172301086, there were 5445145 subnetworks. removed them => 5444028 less nodes
2018-02-06 05:26:05,645 [main] INFO  com.graphhopper.storage.index.LocationIndexTree - location index created in 318.1291s, size:199 832 521, leafs:40 774 621, precision:300, depth:6, checksum:172301086, entries:[64, 64, 64, 64, 16, 4], entriesPerLeaf:4.9009047
2018-02-06 05:26:06,979 [main] INFO  com.graphhopper.routing.ch.CHAlgoFactoryDecorator - 1/1 calling CH prepare.doWork for fastest|foot ... (totalMB:126548, usedMB:96053)
./graphhopper.sh: line 300:  2449 Killed                  "$JAVA" $JAVA_OPTS -cp "$JAR" $GH_CLASS config=$CONFIG $GH_IMPORT_OPTS graph.location="$GRAPH" datareader.file="$OSM_FILE"

Any help would be appreciated, thanks!

Gerben · February 6, 2018, 11:13pm

I am using an older version from a few months ago and a slightly modified encoder and weighting. Importing planet osm requires just slightly over 64GB on my machine.

mikeo · February 6, 2018, 11:46pm

Thanks @Gerben - is your encoding more similar to foot or car? My understanding is that car only uses 2/3 the RAM as foot. Also how much physical ram do you have and how much of that do you give to Xmx/Xms?

boldtrn · February 7, 2018, 12:09am

Does it work without CH? Also I cannot see an OutOfMemoryException, the process was just killed for some reason. I would expect that it should be possible to import the FootFlagEncoder with 128gb RAM.

Gerben · February 7, 2018, 12:12am

The encoding is hike, which is a subclass of foot. I have 128 GB mem just like you and I use JAVA_OPTS="-Xmx100g -Xms100g -server"

Gerben · February 7, 2018, 12:15am

Happened to me when I put the Xms too high. Another process killed the Java process because it needed a little more than the room I gave for other tasks.

karussell · February 7, 2018, 6:17am

Could it be that you have too few CPU power on this machine? The optimization in PrepareRoutingSubnetworks in my case takes 1.5h and not 3.5h. I’m using multiple profiles and it tells me:

CH prepare.doWork for fastest|hike … (totalMB:129140, usedMB:60741)

So the GC does not seem to do its work properly resulting in a much higher RAM usage. Which OS and JVM are you using?

mikeo · February 8, 2018, 10:52am

OK, so my problem was definitely that I made Xms/Xmx too close to my physical RAM limit (thanks @Gerben) . I switched to version 0.9 and am running foot/fastest import on a linode “memory-optimized” vps with 100GB RAM and JAVA_OPTS="-server -Xconcurrentio -Xmx80g -Xms80g". It took 10 hours to get to “1/1 calling CH prepare.doWork for fastest|foot” and that step has been running for about 20 hours.

@karussell this is the JVM/OS I’m using:

$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

As an experiment I tried running on a 60GB machine with Xmx/Xms set to 45G (ubuntu also but using oracle JDK9 and G1GC). That also got to preparing the CH successfully, although it’s slightly slower.

A couple of questions:

Are these import times expected? What is considered “normal” for planet extract and subsequent contract steps?
@Gerben @karussell are you using a hosting provider or running on your hardware? Once it gets to the “optimize” step in PrepareRoutingSubnetworks and then subsequent CH generation it only uses a single CPU so I don’t think that more CPUs would help?
I’d eventually like to generate output for car, bike, and foot. Would memory requirements triple when you use 3 encodings (ie. are all of the datastructures duplicated?) Or do they share the same base graph, just generate their own weights and CH so peak RAM requirement would be less than 3x?

karussell · February 8, 2018, 11:31am

Are these import times expected? What is considered “normal” for planet extract and subsequent contract steps?

I’ve just looked into the logs and with several outdoor profiles we are at 6h for the import and logs say like 65GB is used (but this could be lower as not after GC). Keep in mind that we have elevation activated and the first time this would take much longer.

Then for CH this takes additionally ~13h, plus LM in less than 3h. (we do many profiles for CH+LM in parallel, see below)

(Speeds for motor vehicles are much faster as less edges in the graph IMO)

you using a hosting provider or running on your hardware?

using bare metal

Once it gets to the “optimize” step in PrepareRoutingSubnetworks and then subsequent CH generation it only uses a single CPU so I don’t think that more CPUs would help?

The import should use at least 3 CPUs due 2 threads for parsing and creating the graph and another for GC. When doing CH you mostly have 1 CPU if you have one profile. If you have more you can utilize more threads, but this requires substantial more RAM.

I’d eventually like to generate output for car, bike, and foot. Would memory requirements triple when you use 3 encodings (ie. are all of the datastructures duplicated?) Or do they share the same base graph, just generate their own weights and CH so peak RAM requirement would be less than 3x?

Yes, they would share the base graph and add its own CH profiles.

mikeo · February 9, 2018, 1:13pm

Thanks for the help. Both planet imports (Xmx=45G and Xmx=80g) completed successfully in about 35 hours. I think the performance discrepancy is just due to slow cpus on linode/digitalocean. The output was ~23GB.

I tested importing a smaller 1GB osm.pbf extract on 3 vps’s to compare performance:

digitalocean “optimized” droplet (8GB RAM, 4 dedicated vCPUs) finished in 34m
digitalocean “regular” droplet (8GB RAM ,4 shared vCPUs) finished in 58m
linode memory-optimized droplet finished in 64m

So I think if I ran the planet import on an optimized droplet, the import time would be closer to ~20 hours you get on bare metal. I’d be curious if there’s a way to parallelize work on PrepareContractionHierarchies.contractNodes() to take advantage of more CPUs though.

Just to clarify before I burn too much more money running tests…

if the foot profile finished with 45GB RAM, how much more would you expect 3 profiles to use with parallel vs sequential contraction hierarchy generation?
What part of the import process do datareader.dataaccess=MMAP and graph.dataacess reduce memory requirements for? Does the entire dataset need to get loaded into RAM at some point regardless of MMAP/RAM_STORE?

karussell · February 9, 2018, 1:18pm

BTW: I forgot to mention that we use also a SSD

I’d be curious if there’s a way to parallelize work on PrepareContractionHierarchies.contractNodes() to take advantage of more CPUs though.

no

if the foot profile finished with 45GB RAM, would you expect multiple profiles to fit in roughly the same RAM if I don’t parallelize them?

no, as the shortcuts require more memory

What part of the import process do datareader.dataaccess=MMAP and graph.dataacess reduce memory requirements for? Does the entire dataset need to get loaded into RAM at some point regardless of MMAP/RAM_STORE?

The base graph itself can be offloaded via MMAP only. But this will make the import much slower and even unusable if without SSD.

mikeo · February 14, 2018, 12:27pm

Thanks for the help here. For anyone who stumbles on this in the future I did end up getting foot/fastest to import in 12hrs (4 to import, 8 to contract) on digitalocean 64gb cpu optimized droplet (Xmx/Xms=54gb). CPU optimized is a bit more expensive per hour but runs about 3x faster so it is worth it.