How to use Graphhopper in Spark?

I’m trying to use Graphhopper in Spark to find routes from a very large data set. Does anyone know of any documentation/examples for doing this?

Alternatively, some specific questions which I’m stuck with:

  • calling importOrLoad is unlikely to work as it’s referring to a folder on a single filesystem (which clearly won’t work distributed, but also means there might be problems with the locks used.) It’d work best in Spark if I could have my *.pbf file in HDFS … but I’m not sure how I’d get that to work in Graphhopper.
  • alternatively it seems I could serialize a Graphhopper instance (with the OSM info already loaded), but I’m not sure if that’s possible …

PS - I’ve managed to get it running in Spark as a non-distributed task, but that’s not too useful.

I’ve no experience with Spark but HDFS is just a distributed file system - why not put the GraphHopper folder there? E.g. run importOrLoad for one GraphHopper and then a graphhopper folder is created and you can then use the distributed nature of HDFS. I.e. other GraphHopper instances do not need to import the OSM data again and will just load the data from the folder into the RAM. You can even tell GH to avoid writing to this folder just to make sure or if the FS is read only (hopper.setAllowWrites(false))

Admittedly I’m new to HDFS/Spark too, but the “problem” is that it’s not exactly like a distributed file system - e.g. the data is split into chunks (which may be on different machines). GH/Java’s loading (which assumes a normal filesystem) probably won’t work in this case. However, I suspect it’s tweakable, so maybe I’ll try to write a new DataReader which handles this. Alternatively, I may be able to just “broadcast” the graph - I’ll have to see.

I will update here with any progress.

Any updates on the progress for making graphhopper running in spark?

I managed to get it working, but it was pretty painful and ugly. I’m sure it shouldn’t be too hard to tweak graphhopper to make it all a little easier … but I’d need to be more familiar with the internals.

cool! I am looking forward to your achievements.

Would it help if GraphHopper could be told to load (not import) its graph and index from a location specified by a URL instead of a local file system path? Like (approximately):

graphHopper.load(“hdfs://www.sirius-cybernetics.com:9000/user/john/graphhopper-location”);

I understand that Java storage libraries often come with custom URL handlers (we wouldn’t have to include them - the user registers them with the JVM), so this may be a good interface to also allow loading a graph from Amazon S3 or some such.

Interesting idea. This was discussed and implemented here Readonly DataAccess for InputStreams instead of Files not in the master though …

1 Like

I’m not sure @michaz - though that is how I e.g. interface with it Hadoop/Spark. I was thinking (naively) of an overwrite-able method for getting the bytes for the OSM file, and leave it up to the user how to provide that. A quick look at the link @karussell provided shows it kind of does this (with openStream).

As an aside - larray looks very cool.

You can use Hadoop/Yarn distributed cache to have Graphopper files availble to Spark executors.

We have implemented Graphopper base routing in Spark and MapReduce tasks, both used distributed YARN cache to deliver routing database onto each cluster node.

Nothign ugly and relatively easy in implementation. If you have only .pbf available, you should prepare all files in Spark driver and put it into cache before executors will start processing data.

1 Like