Half-million sites

16 Jan 2016

The largest computation we ever performed in GEM in terms of sheer number of sites, is a computation for South America with 481,401 sites. This is actually a small computation compared to SHARE, because if has a trivial logic tree with a single realization, so it can be ran in a decent amount of time and it is a perfect test bed example to understand the features of the tiling in absence of the logic tree complications.

I will run an analysis by using the latest calculators, which are not in master yet. The newest calculators have some additional features that makes a lot of difference in this calculation. In particular, it is possible to turn off the filtering.

The problem of filtering is that in this case there are so many sites that the filtering time is very significant: since the source filtering is done in the controller node, what happens is that all the time is spent there, whereas the workers just wait for the filtered sources to arrive. This is an exceptional situation; normally, even with 100,000 sites, the filtering is fast enough that it makes sense to do it in the controller node, to save a lot of time in data transfer. Even if there are a lot of sites, but the source model is relatively small (for instance when there are few area sources) it makes sense to filter up front.

Actually the sources are filtered twice: up front on the controller node and then again on the workers. This is not a waste, it is actually a key idea that improves the performance a lot in all situations except the most rare. The up front filtering has a few substantial advantages:

it reduces the data transfer: only the relevant sources are sent
it reduces the filtering on the workers, since only the prefiltered sources arrives and are refiltered, not all of them
it improves the task distribution: without up-front filtering, a source that theoretically is heavy, could be filtered away in the workers and actually contribute next to nothing to the running time, thus tasks that in theory should be heavy could be light.

Point 2 is particularly important, because with a bad task distribution your computation will be waiting for the slow tasks and wasting time. So up-front filtering is usually good, but here we are in a very special situation: the up-front filtering is so slow that it is worse that the disadvanges. To cope with this situation the new calculators have a filter_sources flag that can be set to False to disable filtering. Since the sources are no more filtered in the controller node you have to send all of them, but you save time anyway, since the filtering time is larger than the data transfer time. Also, waiting for the slow tasks is less inefficient than doing the up-front filtering.

Single tile

Before running the tiling calculator, for sake of comparison I run a test with a single tile. Then the computation is expected to be a lot slower since the tile is so big. Indeed, with concurrent_tasks=2000 it takes 14 hours and 27 minutes and the data transfer is huge, over 100 GB, at the limits of rabbitmq:

operation	cumulative time
computing poes	11,138,363 s
making contexts	1,608,340 s
agg_curves	404 s
managing curves	132 s
combine/save curves	12 s
total run time	867 m

number of tasks: 2016

data transfer: 21.78 GB sent, 137.39 GB received

The reason for the huge data transfer from the controller node to the workers is that the site collection is large (11 MB) and we are transferring it 2016 times. This is dominating factor, since the sources (divided in 2016 blocks) are small in size compared to a site collection with half million sites. When tiling is enabled the opposite is true, since the site collection is split and so a lot less data in transferred there, whereas the sources are send multiple times and then the sources are the objects dominating the data transfer.

For what concerns the data transfer from the workers back to the controller node, this is a case with 481,401 sites, 19 levels and 8 bytes for each float returned, so each task returns 69.8 MB of data. Multiplied by 2016 we get the huge figure quoted before. With tiling we can reduce this number by N times, where N is the number of tiles.

Tiles with 20,000 sites

My second attempt was to run the same computation split in 25 tiles and 1100 tasks:

operation	cumulative time
computing poes	324,242 s
making contexts	249,334 s
managing curves	376 s
agg_curves	217 s
combine/save curves	14 s
total run time	86 m

number of tasks: 1100

data transfer: 2.28 GB sent, 3.01 GB received

As you see the results are impressively better. The ratio computing poes/making contexts is down to 1.3, which is a decent value, but still larger than 1. It probably means that we can improve the figures by increasing the number of tiles. The smaller the tile, the smaller the computing poes because until the computation is dominated by making contexts and it does not make sense anymore to reduce the tiles.

It should be noticed that the time needed to save the curves it totally insignificant (14 seconds) because we are using the HDF5 datastore. The time to aggregate the curves (217 seconds) is fairly small, half than before because we are aggregating smaller arrays.

The data transfer is 10 times less than before in the controller->workers direction and 46 times smaller in the workers->controller direction. This is expected since there are 25 tiles (i.e. 25 times smaller arrays) and the number of tasks is nearly half as before.

Tiles with 5,000 sites

Then I tried to increase the number of tiles to 97, with 1337 tasks, to see if we could improve the figures again:

operation	cumulative time
computing poes	260,652 s
making contexts	284,608 s
managing curves	1,337 s
agg_curves	258 s
combine/save curves	14 s
total run time	72 m

number of tasks: 1358

data transfer: 7.18 GB sent, 0.98 GB received

The ratio computing poes/making contexts is now down to 0.92 and the data transfer back is three times smaller than before (because the tiles are smaller). Unfortunately, the data transfer to the workers is over three times larger, since there are many tiles. To time spent in managing curves is of 22 minutes over a total of 72 minutes: it is clear that we cannot increase the number of tiles more, otherwise that operation will start dominating the computation. It would probabily be better to reduce the number of tiles and/or tasks, since it seems that we are already seeing the effects of excessive tiling (making contexts is below 1 and managing curves is significant).

As an exercise, I also tried to run the computation with tiles of 1000 sites, i.e. with 481 tiles, but it was disaster: the engine ran for several hours and then rabbitmq died for excessive data transfer, consuming over 35 GB of RAM and disk space in the mnesia directory.

The Unofficial OpenQuake Engine Blog About Author Archive Feed

Half-million sites

Single tile

Tiles with 20,000 sites

Tiles with 5,000 sites

Related Posts

Running extra-large PSHA calculations (Canada 2015) 16 Feb 2018

Changes in the hazard outputs 02 May 2017

Recent progress in the classical PSHA calculator 03 Sep 2016