Test data for public transport routing


as we’ve discussed before, I’ve started implementing tools to produce test data. The project is here:

So far I’ve implemented the Public Transport Enabler module in the first draft:

This module is unde GPLv3 due to PTE license, so don’t link to it/derive from it.

This tool takes a CSV file with queries (datetime/from/to), executes them agains the PTE provider (like bahn.de) and writes results to CSV and JSON files.

Assume this is a CSV file with a single query:

c08b2cb1,2016-10-01T00:13:00,255606,"Blumenauer Weg",8.448225,49.555514,242306,Hombuschstraße,8.460743,49.516252

As a result you’ll get:

c08b2cb1,2016-10-01T00:13:00,255606,"Blumenauer Weg",8.448225,49.555514,242306,Hombuschstraße,8.460743,49.516252,2016-10-01T09:20:00,2016-10-01T09:53:00,5

(You depart at 9:20, arrive at 9:53 and there are 5 legs in this trip.)

The tool uses GTFS file as basis, so the stops are taken from the stops.txt of the GTFS file.

However, (at least in the case of RNV) the problem is that bahn.de apparently uses a little bit different data compared to what GTFS file provides. So to match GTFS data with PTE data we first need to generate a mapping between GTFS stops and PTE “locations”. The tool can handle it, you get the mapping file like:

202935,Ebertpark/Fichtestr.,8.42054,49.4917,STATION,507997,8420906,49491721,"Ludwigshafen a","Friesenheim Ebertpark/Fichtestraße",TRAM;BUS,50

This matching is done based on the coordinates, the last field gived the distance range in meters (rounded to 50 meters).

Based on this mapping file the tool can generate a CSV file with queries - i.e. randomly selected from/to stops and timestamp from the given range.

Finally, having CSV files with stop/location mappings and queries, the tool executes the specified queries and writes results to CSV and JSON files. CSV contains basic data we’ll need for testing, JSON files are created one per trip to be able to follow, how exactly the query was rooted by bahn.de

Since bahn.de allows only a few (I think it’s 8) trip queries per minute, I had to implement a basic retry strategy with progression of pause. I.e. if PTE provider responds with SERVICE_DOWN we pause and retry and each time the pause is doubled (starting with 125ms). It is not so fast but seems to work well.

I’m now running ca. 1000 queries for the RNV dataset (ca. 50%/1h remaining), can provide it for testing later on.

I’d be grateful for feedback, like if we’ll need more data in the results.

Best wishes,

1 Like

Thanks, sounds all reasonable. For the information that we can grab you currently use

trip_first_departure_date_time,trip_last_arrival_date_time and trip_legs_count

Here I think also the vehicle types (bus, train, …) and trip_last_departure_date_time as well as trip_first_arrival_date_time could be interesting?

Why trip_last_departure_date_time and trip_first_arrival_date_time? I mean, that won’t be a big deal to implement it, but I don’t quite see why it would be of much use at the moment.

I was considering whether we’d need/want to check not just departure/arrival but also the intermediate stops of the trip. This may be difficult as GTFS stops have *:0…1 mapping to PTE locations (several stops may correspond to one location). So translating PTE trips into GTFS stops may be ambiguous.

Why trip_last_departure_date_time and trip_first_arrival_date_time?

last_departure just for quality comparison reasons. And first_arrival I would find the most important? Maybe I misunderstood something as I see e.g. last_arrival not that important similar to last_departure.

I’ve probably did not make this clear enough.

For each of the queries, the tool chooses just one trip (with the earliest possible arrival). So trip_first_departure_date_time, trip_last_arrival_date_time, trip_legs_count all refer to the same trip. trip_first_departure_date_time is the departure time of the first leg, trip_last_arrival_date_time is the arrival time of the last leg of this trip.

last_departure would be the the departure on the last leg of the trip, first_arrival, the arrival on the first leg. I don’t think this is what we want.

What I guess you meant with first_arrival was “earliest arrival at the destination”. Which is exported as trip_last_arrival_date_time. I’m not quite sure about last_departure, I guess you mean “if there are several earliest arrival trips with the same arrival at destination, select the one with latest possible departure”. Did I get it correctly? If yes, then it’s trip_first_departure_date_time. (I have to implement this selection logic, however, at the moment the tool does not consider several EA trips.)

Ah, ok. So ‘last’ corresponds to the legs :slight_smile:

What I mean that could be interesting for us (but not sure if already in this status) if we somehow can grab two routes from the Pareto set, e.g. the fastest and the one with the fewest transfers.

I think this can be done. In PTE there are optimization options - like you can optimize for shortest path or for minimum transfers. It’s exactly “grab what we want from the Pareto set”, but will be close enough.

So I can change this to generate the following results:

  • least_duration_trip_first_departure_date_time
  • least_duration_trip_last_arrival_date_time
  • least_duration_trip_legs_count
  • least_changes_trip_first_departure_date_time
  • least_changes_trip_last_arrival_date_time
  • least_changes_trip_legs_count

There’s also LEAST_WALKING optimization options, but I’ll leave it out.

So basically we’ll run the query twice with two optimization options (LEAST_DURATION, LEAST_CHANGES). For each query PTE returns three best matching trips. From those we’ll select the one with the earliest arrival at destination and latest departure from the source (if some trips have the same arrival). Results will be output in 2x3 columns in the CVS file as well as two JSON files (<id>.least_duration.json and <id>.least_changes.json).

What do you think?

Sounds good.

The json file is for the details? Why then not put everything in those JSONs? (BTW: I would name them some_name.<id>.json for better overview in a directory or put them in a separate folder)

I’ve executed a few hundred of queries for RNV. I don’t get different result for LEAST_DURATION and LEAST_CHANGES. May be that bahn.de does not support this. But I’ll leave the structure as designed, with export of two trips (even if they are equal at the moment).

Why not everything in JSON?

I’ve mentioned that PTE uses “locations” which are different from “stops” in the GTFS file. It’s a 0…N:1 mapping, several stops may map to one location. For this reason resolving PTE trips from locations to stops would be ambiguous. Either we select stop by the best guess or we output a set of possible stops. None of this is really good for testing, either ambiguous or hard to match. At the same time re-mapping of PTE trips is quite some work. This is why I thought we could start with CSV files for testing and JSON files for human control.

I’m no creating a CSV file like rnv-trips.csv and write JSON files to a structure like:

  • trips//query.json
  • trips//leastDurationTrip.json
  • trips//leastChangesTrip.json

I’m now running ca. 1000 queries for RNV, will upload the results later on.

1 Like