I recently started to play around with the different CHParameters.
First of all, it’s really great that it’s possible to customize these settings .
However, using them is not straight forward. I couldn’t find any documentation about these parameters, so I had to look at the code. Maybe it’s good that the parameters are not documented, since this is really low level stuff and it could produce more questions than actually helping .
Something that makes using these parameters really hard, at least for me, is that they are called percentage, e.g., LastNodesLazyUpdatePercentage, but this values isn’t really a percentage? Also the allowed values range from 0 to 100.
It is calculated by: long lastNodesLazyUpdates = Math.round(sortedNodes.getSize() / 100d * params.getLastNodesLazyUpdatePercentage());
On the other hand, NodesContractedPercentage is actually a percentage: Math.round((100 - params.getNodesContractedPercentage()) / 100d * sortedNodes.getSize()). When I pass 100, it will be 0, when I pass 0 it will be sortedNodes.getSize(). I can choose between the full range.
For both, LastNodesLazyUpdatePercentage and NodesContractedPercentage I can only pass values from 0 to 100. This makes sense for NodesContractedPercentage, but limits the value range that I can set for LastNodesLazyUpdatePercentage.
BTW: when setting LastNodesLazyUpdatePercentage to 0, it looks like there will be a division by 0? I haven’t tried this, but this looks like an issue.
Would it make sense to change these “non percentages” to actual percentages?
For example for LastNodesLazyUpdatePercentage, this could be similar to NodesContractedPercentage:
long lastNodesLazyUpdates = Math.round(((100-params.getLastNodesLazyUpdatePercentage()) * sortedNodes.getSize()) / 100d);
This one basically controls after how many contracted nodes we do a full refresh of the priority queue of nodes that shall be contracted (the sortedNodes). This number is called periodicUpdatesCount. Since such a refresh is the more expensive the more nodes there are this number is specified relative to the total number of nodes in the (initial) graph (by the way I am not entirely sure if this is such a good idea). We calculate it like this:
long periodicUpdatesCount = Math.round(Math.max(10, sortedNodes.getSize() / 100d * params.getPeriodicUpdatesPercentage()));
So specifying prepare.ch.updates.periodic=20 on a graph with a hundred nodes means we rebuild the queue every time we contracted 20 nodes. If the graph had 1000 nodes we would rebuild the queue every 200 nodes. The higher the value is the less updates there will be (this is confusing IMO). When we specify prepare.ch.updates.periodic=100 we do not do any periodic updates (or maybe one, not sure :)). What is even more confusing: There is also this:
if (params.getPeriodicUpdatesPercentage() == 0)
periodicUpdate = false;
So when we specify the lowest possible value there won’t be any periodic updates (although when using very low values (but not zero) there will be most updates.
This controls how often there will be lazy updates when contracting nodes: When we take a node from the priority queue we re-calculate its priority and compare it with the next one in the queue. If we find out that it now has a higher priority than the next one (i.e. its not up to date, note that we contract nodes with a lower ‘priority’ first!) we put it back into the queue and poll the queue again. We repeat this until we find a node that does not have a higher priority than the next. These lazy updates are only performed for a certain fraction of the last nodes we contract. We have:
// disable lazy updates for last x percentage of nodes as preparation is then a lot slower
// and query time does not really benefit
final long lastNodesLazyUpdates = Math.round(sortedNodes.getSize() / 100d * params.getLastNodesLazyUpdatePercentage());
First of all I think the comment is wrong ? It says we are disabling lazy updates for the last x percentage of nodes, but what we are really doing is we are enabling lazy updates if sortedNodes.getSize() < lastNodesLazyUpdates, so for the last lastNodesLazyUpdates nodes.
Anyway, prepare.ch.updates.lazy=100 means we do lazy updates for every node and prepare.ch.updates.lazy=10 means we are doing lazy updates for the last 10% of nodes. The higher the number we do the more updates we do, I think this makes sense.
There are two things that can be improved/tried IMO: 1) Instead (or in addition) of specifying how many of the last nodes should be subject to lazy updates rather specify a probability this happens. 2) Somehow limit the number of times we poll the queue without actually contracting the node we polled.
This one only controls how often there is logging output. It works the same as periodic updates: A value of 0 means no logging, a value of 20 means the logging interval is equal to 20% of the nodes in the graph (if the graph has 100 nodes we log every 20 contracted nodes) and a value of 100 means we never log.
Yes the allowed values are 0 to 100 and e.g. 20 means we do lazy updates for the last 20% of nodes, why is this not a percentage or should not be called a ‘percentage’ ?
This I do not understand ? Its the same formula except we use 100-x instead of x. Why is this more a ‘percentage’ than the formula for lazy updates ? For last lazy updates its the same (except the other way around): When you pass 0 its 0 and when you pass 100 its sortedNodes.size().
Ah here is the confusion: No! sortedNodes.getSize() / 100d * params.getLastNodesLazyUpdatePercentage() looks like sortedNodes.getSize() / (100d * params.getLastNodesLazyUpdatePercentage()), but really its the same as sortedNodes.getSize() * (params.getLastNodesLazyUpdatesPercentage() / 100d).
Yes since these are really expert settings I think its fine they are documented in the code only ? We could put the documentation into config-example.yml, but then we would have it in two places ? Not sure whats better.
Thanks so much for the detailed explanation this helps a lot .
Yes indeed, that’s a bit counterintuitive .
Ah yes, this one got me. Sorry for the troubles! Maybe we could add braces to the calculation to make clearer what happens and helps to avoid these kind of mistakes .
Yes that makes sense indeed. I think we should keep it only in the code and not in the config, but maybe we can improve the documentation in the code a bit .
Do you mean that it would work similar to updates.periodic that it would do this for every X node and if we find that the priority is too far off, we would recalculate the priorities for all nodes or start to lazy update from that point?
Maybe we can improve this somehow as well? Maybe 0 could mean maximum logging or even some form of verbose logging?
Also to me it feels like, if it’s called logMessagesPercentage, a value of 100 should be a lot of logs and a value of 0 should be no logs .
Yes that is also possible: Trigger a full update (like in periodic) whenever there were more than some threshold of lazy updates. I think counting the number of times a lazy update was necessary (simply by checking if the priority is higher than for the next node) is easier than using a threshold how far the priority can be off (because the value for such a threshold will strongly depend on the way the priority is calculated and will probably be hard to choose). But actually what I meant is always performing lazy updates, but using some probability for doing it or not on every node as with neighbor updates. Anyway trying these things can be quite cumbersome and we would have to make sure these changes work for different map sizes etc.
Yes this is the same as with periodic updates. We want to do a periodic update every time we contracted n nodes, but we want to specify n relative to the total number of nodes (so it works for every graph size). It would be more intuitive to have higher values = more updates.