I tried the default garbage collector, ZGC and Shenandoah
I tried to use an -XmX with a value close to my graph folder size + 2 GB and I also tried clearly bigger value
I checked all similar discussions here and in Github…
Each time, same result after a moment (depends on number of calls):
2022-10-25 10:00:02.140 [dw-285 - GET /route?point=49.0404143,2.031273&point=49.036322,2.078656&type=json&locale=en&elevation=false&profile=pt&pt.earliest_departure_time=2020-01-01T07%3A00%3A00.000Z&points_encoded=false&instructions=false&calc_points=false&alternative_route.max_paths=1] ERROR i.d.j.errors.LoggingExceptionMapper - Error handling a request: 87ff99237564283c
java.lang.OutOfMemoryError: Java heap space
at com.graphhopper.gtfs.PtGraph$$Lambda$503/0x00000008404d5c40.iterator(Unknown Source)
at com.graphhopper.gtfs.GraphExplorer$$Lambda$502/0x00000008404d6c40.iterator(Unknown Source)
at com.graphhopper.gtfs.GraphExplorer$$Lambda$490/0x00000008404cf040.iterator(Unknown Source)
at jdk.internal.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$$Lambda$398/0x000000084044fc40.invoke(Unknown Source)
I don’t think I have the same behaviour for other countries than France (I have one GH instance per country) and I can’t find the reason it always crashes with an OutOfMemoryError after a while…
It also seems to be related to PT routes requests because I tried to host twice the same config on 2 different machines and one of them was dedicated to car/bikes/foot routes and no memory exception after a while but the machine dedicated to PT got the exception after a few hours.
FYI, I launch the process with this cmd: nohup /usr/bin/java -Xmx20g -jar /home/graphhoppermatrixsplit/graphhopper-web-6.0-matrix.jar server /home/graphhoppermatrixsplit/config_fr.yml &> nohupfr.out &
(XmX20g for the moment as an example because I tried different sizes…)
I got around 1000 calls per hour on /route endpoint for PT profile.
Very interested to know if I failed somewhere in the config or even the way to launch it or if there is a memory leak somewhere (if it’s the case, it’s a very specific case because I don’t have the same situation with other countries than France).
To me this looks like the search does not end for some reason and explores more and more nodes of the routing graph until the memory is exceeded. Is France the biggest country (in terms of nodes) you use? That could be the reason the error does not occur for the other countries. Do you have (some) very long running queries? And can you reproduce the problem for a single request or only when there are many parallel requests? If I was to debug this I would log the visited_nodes.sum parameter (should be available in the response of the routing requests): graphhopper/PtRouterImpl.java at 37380c01ac115b2fd11f4293b964eea3fde64c79 · graphhopper/graphhopper · GitHub and check if there are some requests where this value is very large. If you found one the next question of course would be why this happens. Could be a bug or some rared data constellation that is not handled properly.
Indeed, France is the biggest (the 2 other main countries are Belgium and Luxembourg, clearly smaller!) but we only do calls for points at maximum 25 km (“as the crow flies”), no far far away points everywhere in the country…
I started to monitor the visited_nodes.sum and I already found different stats: most of the calls are under 100 ms but I can find different calls with long or very long time like these ones
Ok, but the ‘as the crow flies’ distance isn’t so important here. If (for some reason) the search algorithm does not find the target it may happen that it simply explores lot and lots of nodes, even ones that are very far from both the start and destination points.
I would try to provoke the OutOfMemoryError by sending the request that visited 1000000 nodes (this is the maximum) many times in parallel. If that works this is possibly the cause of the problem. You could also try to set the max visited nodes to an even higher number to see if even more / all (?) nodes are visited. The next question would then be why this happens for this specific request.
And yes if all this is true you could reduce the max visited nodes to make the memory error less likely.
Oh ok, good to know for the “as the crow flies” comparing to the search algo behavior!
To check with a heavy request but not the worst, I tried to do a call with a request I found with visited_nodes.sum = 483372 and every 2 or 3 calls on this one (I didn’t tried in parallel, just simple manual call from the browser) the memory consumption grows with ~300 MB! Then I tried the big one with the visited_nodes.sum = 1000000, the memory took 400 MB more just with one request.
I guess I will limit my max visited nodes and monitor if it’s stable after that, thanks
Ok, it still might be interesting to see if any route can be found at all for the request with visited_nodes.sum=1000000 when you set the maximum to infinity (or anything larger than the absolute number of nodes like maybe 100mio for France). But also just the fact that for some requests visited_nodes.sum is very high, but finishes, like 483372, means that probably there are some edges with very large weight that have to be explored to find the route, but are only explored after all the others, because their weight is so high. And that could be something that needs to be or at least can be fixed.
You can check if the config syntax changed by comparing with the config-example.yml file of the version you are using.
However, in this case I think the problem is that unfortunately routing.max_visited_nodes does not affect the max visited nodes for the pt routing. If I’m not mistaken the value is even hard-coded at the moment and cannot be changed at all without re-compiling the code:
Did you also try setting the max visited nodes to something like 100mio (larger than the number of nodes for all of France)? It would be interesting to see if the request that stopped at 1mio before explores even more nodes, or even leads to OutOfMemory.
(example: this very easy situation which is working fine for car profile but not for foot:
I just checked this. The reason there is no route for foot is that the road seems to be tagged like highway=other. You can see this when you choose ‘Local MVT’ from the layers menu and zoom close to the red marker. When you hover the road it says ‘road_class: other’ which is not routeable. The reason it works for car is that in this case the red marker snaps to the nearby primary road instead.
So the question here is why it snaps to the non-routeable road for the foot profile and this could actually be a bug. Can you share the exact GH version and map file you used for this?
I tried to put prepare.min_network_size: 0 in my config.yml a few weeks ago because GH was not able to find routes on a small island in France so I am a little bit lost on the correct parameter to use to not break “simple” cases like this one in a city like Paris
When you set min_network_size: 0 you will be at risk that small isolated road network ‘islands’ will be kept. To prevent cases where no route can be found you need to use a larger value.
I took time to optimize some stuff in our side (add caching on calls to GH, etc. to reduce the calls number, etc.) and I also tried to log all the requests before the issue happens (the main symptoms are all cores very busy suddenly) and I found some calls on /route which were taking ~37 seconds and here is the hints:
And this is happening with a version built with private int maxVisitedNodes = 300_000;
Do we have another place to put a max value? It seems this value is correctly used for most of the requests because almost all of them are under 300,000 or exactly on this number except a few of them which are running crazy