Performance Issue after some time

gspeed · November 6, 2018, 12:43pm

I have an issue similar to this one, however I use CH with latest master (0.12) in my test case.

In a test VPS cloud (8GB, 4 cores) a pretty complex test route takes <200ms to be calculated - which is good. There is no activity in the server other than myself.

However after one day timing suddenly goes to about 1.5/2secs and more. I disabled swap but that does not seem to help. RAM does not seem to be an issue, here is a snapshot while experiencing the issue:

~$ free
              total        used        free      shared  buff/cache   available
Mem:        8167876     3202564     4825148         564      140164     4758020
Swap:             0           0           0

Top command shows almost 0 % CPU usage when idle and it goes up as expected during calculation.

I know being in a VPS there can be other reasons why CPU performance decreases, however, if I kill and restart graphhopper timing suddenly goes back to the <200ms so the issue does not seem related to the CPU available in the VPS…

Any clue on where I could look at? I tend to believe this is related to my VPS rather than graphhopper itself but perhaps someone experienced a similar behavior.

Thanks!

karussell · November 6, 2018, 2:13pm

How large is your graph-cache folder (i.e. usage on disk) and how much RAM do you give the service via the -Xmx (and -Xms) settings? Did you have a look into the deployment guide?

Top command shows almost 0 % CPU usage when idle and it goes up as expected during calculation.

Which means there is no GC problem. Strange.

Can you post the config.yml that you are using? Are you using the stable 0.11 release or some branch ?

I tend to believe this is related to my VPS

In those cases the only “problem” should be slower response times but not such a strange behaviour.

gspeed · November 6, 2018, 6:09pm

Oh my, I though I was using 0.11 but actually I’m using latest master (no branches) - I update my 1st post. I’m using 4GB for Xmx sn Xms.

Config.yml is:

graphhopper:

  graph.flag_encoders: car,notollcar
  prepare.ch.weightings: fastest
  prepare.min_network_size: 200
  prepare.min_one_way_network_size: 200
  routing.non_ch.max_waypoint_distance: 1000000
  graph.dataaccess: RAM_STORE

server:
  applicationConnectors:
  - type: http
    port: 8989
    bindHost: localhost
  requestLog:
      appenders: []
  adminConnectors:
  - type: http
    port: 8990
    bindHost: localhost

notollcar is just as car without… you can imagine. The issue I experience is with both car and notollcar

I’ll have a deeper look at the deployment guide.

Thanks!

karussell · November 6, 2018, 7:17pm

And how large is the graph-cache? If the graph-cache is close to this size then increase the Xmx setting. If the Xmx is close to the machine RAM then it is crucial to decrease it to e.g. 3GB or even less. As the OS and even the JVM needs additional RAM for certain tasks.

gspeed · November 6, 2018, 8:57pm

If the graph cache is the disk space consumed by the *-gh folder, it’s about 1.5GB. Machine RAM is 8GB, so using 4GB for Xmx and Xms seems fine… I’m trying some JVM options to see if the behavior changes - keep you posted.

karussell · November 6, 2018, 9:35pm

Strange. Are you using the latest jdk12 or 8?

I though I was using 0.11 but actually I’m using latest master (no branches)

Try also if you can reproduce the strange behaviour for 0.11

gspeed · November 6, 2018, 10:35pm

I’ve latest JDK 8. Let’s wait 1-2 days to see if the java opts tweaks have any effects, I’ll try with jdk12 and 0.11 later on if problem persists. Thanks

gspeed · November 8, 2018, 5:40pm

Playing with some JAVA_OPTS and moving to latest JDK 11 did not help. However in the meantime I discovered that when performance is bad, if I continue to stress the system the performance starts improving and goes back to original (i.e. I don’t have to stop/restart graphhopper). I’m wandering if this is related to my VPS hosting that reduces VM resources when idle for some time - which I’m sure happens since I’m the sole user. I installed a simple benchmark tool (sysbench) but I can’t seem to find a real proof of what I am saying - although you may find several complains around VPS slowness after being idle for a while.

While continuing my investigation on the VPS itself, here are GH logs for the very same route.

this took about 0.2 secs:

INFO  [2018-11-08 07:47:37,622] com.graphhopper.resources.RouteResource: point=44.071800%2C10.371090&point=44.933700%2C11.733400&point=45.151050%2C10.832520&point=45.752730%2C9.799300&point=45.244370%2C8.723300&point=44.762340%2C8.833010&point=44.402390%2C9.646000&elevation=false&locale=en-US&use_miles=false&weighting=fastest&vehicle=car& 127.0.0.1 it_IT Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0 [44.0718,10.37109, 44.9337,11.7334, 45.15105,10.83252, 45.75273,9.7993, 45.24437,8.7233, 44.76234,8.83301, 44.40239,9.646], took:0.029609261, , fastest, car, alternatives: 1, distance0: 836413.6882393945, weight0: 37016.722285222844, time0: 617min, points0: 6659, debugInfo: idLookup:0.001535803s; , algoInit:1.15249E-4s, dijkstrabi|ch-routing:0.002127511s, extract time:5.15389E-4;visited nodes sum: 387;, algoInit:2.39574E-4s, dijkstrabi|ch-routing:0.001386596s, extract time:3.4599E-4;visited nodes sum: 657;, algoInit:8.567E-5s, dijkstrabi|ch-routing:0.001539026s, extract time:1.79684E-4;visited nodes sum: 987;, algoInit:8.0259E-5s, dijkstrabi|ch-routing:0.001627627s, extract time:3.40914E-4;visited nodes sum: 1309;, algoInit:8.1002E-5s, dijkstrabi|ch-routing:9.26019E-4s, extract time:1.82918E-4;visited nodes sum: 1503;, algoInit:7.6826E-5s, dijkstrabi|ch-routing:0.001187071s, extract time:3.34244E-4;visited nodes sum: 1809;, algoInit:1.15249E-4s, dijkstrabi|ch-routing:0.002127511s, extract time:5.15389E-4;visited nodes sum: 387;, algoInit:2.39574E-4s, dijkstrabi|ch-routing:0.001386596s, extract time:3.4599E-4;visited nodes sum: 657;, algoInit:8.567E-5s, dijkstrabi|ch-routing:0.001539026s, extract time:1.79684E-4;visited nodes sum: 987;, algoInit:8.0259E-5s, dijkstrabi|ch-routing:0.001627627s, extract time:3.40914E-4;visited nodes sum: 1309;, algoInit:8.1002E-5s, dijkstrabi|ch-routing:9.26019E-4s, extract time:1.82918E-4;visited nodes sum: 1503;, algoInit:7.6826E-5s, dijkstrabi|ch-routing:0.001187071s, extract time:3.34244E-4;visited nodes sum: 1809, simplify (9097->9097)

Very same route a few hours later, 1.3 secs:

INFO  [2018-11-08 17:09:27,862] com.graphhopper.resources.RouteResource: point=44.071800%2C10.371090&point=44.933700%2C11.733400&point=45.151050%2C10.832520&point=45.752730%2C9.799300&point=45.244370%2C8.723300&point=44.762340%2C8.833010&point=44.402390%2C9.646000&elevation=false&locale=en-US&use_miles=false&weighting=fastest&vehicle=car& 127.0.0.1 en_US Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0 [44.0718,10.37109, 44.9337,11.7334, 45.15105,10.83252, 45.75273,9.7993, 45.24437,8.7233, 44.76234,8.83301, 44.40239,9.646], took:1.0017631, , fastest, car, alternatives: 1, distance0: 836413.6882393945, weight0: 37016.722285222844, time0: 617min, points0: 6659, debugInfo: idLookup:0.0682984s; , algoInit:0.009989078s, dijkstrabi|ch-routing:0.028843064s, extract time:0.019230144;visited nodes sum: 387;, algoInit:0.012513436s, dijkstrabi|ch-routing:0.02119674s, extract time:0.016239831;visited nodes sum: 657;, algoInit:0.01274486s, dijkstrabi|ch-routing:0.014257008s, extract time:0.00810288;visited nodes sum: 987;, algoInit:0.020222537s, dijkstrabi|ch-routing:0.012229018s, extract time:0.008115922;visited nodes sum: 1309;, algoInit:0.008919194s, dijkstrabi|ch-routing:0.017351024s, extract time:0.010871838;visited nodes sum: 1503;, algoInit:0.00816546s, dijkstrabi|ch-routing:0.020060647s, extract time:0.016663527;visited nodes sum: 1809;, algoInit:0.009989078s, dijkstrabi|ch-routing:0.028843064s, extract time:0.019230144;visited nodes sum: 387;, algoInit:0.012513436s, dijkstrabi|ch-routing:0.02119674s, extract time:0.016239831;visited nodes sum: 657;, algoInit:0.01274486s, dijkstrabi|ch-routing:0.014257008s, extract time:0.00810288;visited nodes sum: 987;, algoInit:0.020222537s, dijkstrabi|ch-routing:0.012229018s, extract time:0.008115922;visited nodes sum: 1309;, algoInit:0.008919194s, dijkstrabi|ch-routing:0.017351024s, extract time:0.010871838;visited nodes sum: 1503;, algoInit:0.00816546s, dijkstrabi|ch-routing:0.020060647s, extract time:0.016663527;visited nodes sum: 1809, simplify (9097->9097)

It really seems each step is much slower… however sysbench cpu/memory tests reports similar figures - but maybe I’m not benchmarking the right thing here (any suggestion on which benchmark to run is appreciated). Note that a few minutes after the second (and then several others) attempt performance went back to the expected one.

karussell · November 9, 2018, 9:57am

How many cores (or better threads) are allowed on this server? Maybe the problem is the garbage collector working in parallel to the actual route request?

gspeed · November 10, 2018, 4:23pm

4 cores and (cat /proc/sys/kernel/threads-max) 63368 threads

We may be getting closer, thanks for the suggestion to look at the garbage collector: jdk11 has an experimental option to completely disable GC, not a solution but it will help to understand if it is the root cause - need to test more but it seems to have effect. One for sure is on the startup time: 7 secs vs more than 100!

Let me test also with -XX:+UseG1GC as suggested in the deployment guide… I should have done it before.

karussell · November 11, 2018, 12:07pm

When is the startup time more than 100 seconds? You can try something really experimental like graalvm native compilation that should reduce the startup time massively.

Instead of investing too much time on other issues: can you just try to use a non-VPS system? (Again we have seen slower response times for VPS but only in the 20% range)

UseG1GC

Unlikely that this improves something as jdk 11 has this as default option now.

Other GCs you can try are shenandoah GC or for linux systems ZGC.

gspeed · November 12, 2018, 10:08am

Right, I found about the fact G1GC is the default now (as well as -Xconcurrentio from what I understood).

I replicated the problem also by completely disabling GC “-XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC”, thus at least we know GC is not the issue.

100secs to start does not seem right. My local system is likely 2x faster than the cloud VPS and it starts in less than 5 secs for a comparable graph.

Now… so far I’ve been using a recent master compiling it with Java 8 - I then switched to Java 11 but did not recompile.

I just rebuilt (&recompiled&run) 0.11, stopped and restarted and it started in just 7secs. Let’s now see if I sill get the issue. If yes, I’m going to raise to the VPS provider the concern something goes wrong - maybe disk io.

Questions:

after graphhopper startup my understanding is all the graph is loaded in memory hence there should not be further disk read access, excluding swap. Correct?
if I run the same query multiple times, should I expect the performance to be similar, correct? (assuming nothing else is impacting cpu/memory)

Thanks

karussell · November 12, 2018, 10:44am

Correct. Also you should ensure that you have enough memory for your application, the JVM and the OS without swapping (disable swap).

if I run the same query multiple times, should I expect the performance to be similar, correct? (assuming nothing else is impacting cpu/memory)

Yes. (There are GC activities that influences this a bit but for >99.9% of the time this is correct)

gspeed · November 14, 2018, 1:27pm

Lots of learnings, key point: it is definitively not a graphhopper related issue.

The problems lies in how VPS cloud servers work, memory of a VM can be swapped out in the host system and this is likely to happen for those processes that are idle for a while, at least with my (cheap) provider. When this happens the existing process memory needs to be swapped in if needed by the process and this takes time. This explains why after a “warm up” period stressing graphhopper with requests performance goes back to the usual (good) one.

What is weird about this is that any system benchmark I run returned good result and my VPS thinks it is all going great: behind the scene, only the memory used by the existing gh process was swapped out.

One obvious way to solve the issue is to schedule a job to calculate a route every few minutes (at least during the low peak hours), I’m in touch with my provider to tune the system…