The new GMFs and hazard curves calculator

24 Jun 2015

Historically, the biggest bottleneck of the event based hazard calculator was the storage of the ground motion fields. An HDF5 file is way more efficient than Postgres for this task, so the new calculator uses this approach. Moreover, in the case of complex logic trees, it avoids storing redundant data. The two optimizations combined produce an impressive (actually an insane) speedup. Let me show off the improvement by considering a few specific cases.

The case of a single realization

Let’s start from the good old Miriam’s island, which is a computation with 1,792 sites and a lot of ground motion fields, (4.6 GB of them). In such case the old calculator gives the following figures, starting from a database half empty and with the parameter concurrent_tasks set to 1000:

Operation	Cum. Time	Memory
computing gmfs	9,348 s	80 MB
saving gmfs	13,119 s	28 MB

As you see the saving time is bigger than the computing time. It is actually not much bigger, because in the past we did a lot of performance improvements, but originally the saving time was dozens of times bigger than the calculation time. Still, with the new calculators the numbers are much better:

Operation	Cum. Time	Memory
computing gmfs	4,450 s	1 MB
saving gmfs	11 s	0.19 MB

As you see, the improvement in the saving time is incredibile: from 13,119 to 11s! A factor of 1200x!!

On top of that, the computing time has been reduced by more than half: this is possible because we are saving a lot of time when allocating memory. The old calculator was using dictionaries and lists of Django objects: the new calculator uses numpy arrays. So the same computation requires 80 times less memory. The abysmal difference in the saving time is explained by the fact that the GMFs are saved sequentially as numpy arrays on a HDF5 file instead of doing parallel bulk imports in Postgres with 256 workers writing simultaneously.

The case of a large logic tree

Whereas Miriam’s island is a good test case, it is even more interesting to look at a real life case, like the Germany computation, which is being used by one of our sponsors. This is a case with not many ruptures (only 48,918) but with a lot of GMFs because there are 8,232 sites and a complex logic tree with 100 realizations (it is a SHARE computation). In such a case an even bigger speedup is expected, since the new calculator takes a great care to avoid saving more ground motion fields than needed. To understand this point, you first must digest the concept of effective realizations. Read the documentation before continuing.

THe important point is that it is possible to generate the ground motion fields corresponding to all realizations from the (much smaller) set of GMFs corresponding to the realization associations. It is possible to compute the size of the reduced GMFs with the following formula:

   nbytes = 0
   for rupcol, gsims in zip(collections, rlzs_assoc.get_gsims_by_col()):
       bytes_per_record = 4 + 8 * len(gsims) * num_imts
       for rup in rupcol.values():
           nbytes += bytes_per_record * num_affected_sites(rup)

Here collections is an array of dictionaries; each dictionary contains ruptures keyed by a string (the so-called “tag”). The rupture dictionaries are numbered, and their ordinal is called col_id (collection ID). The method rlzs_assoc.get_gsims_by_col returns the associations col_id -> list of GSIMs. Notice that the details will likely change in the near future, in particular I may replace the dictionaries of ruptures with arrays of ruptures.

The size of a GMF record in the HDF5 file is 4 bytes for the ruptured index (it is an unsigned 32 bit integer) plus 8 bytes for each float; the number of floats per record is num_gsims * num_imts where num_gsims=len(gsims) is the number of GSIMs associated to the given rupture collection and num_imts is the number of Intensity Measure Types. In the case of full enumeration there is a rupture collection for each tectonic region type, otherwise the number of rupture collections is equal to the number of samples times

The size of the full GMFs (i.e. for all realizations) is given by the formula

   nbytes = 0
   for rlz in rlzs:
       col_ids = list(rlzs_assoc.csm_info.get_col_ids(rlz))
       for rupcol in collections[col_ids]:
           bytes_per_record = 8 + 8 * num_imts
            for rup in rupcol.values():
               nbytes += num_affected_sites(rup)

and it is much larger. For instance, we can consider the case of Germany. This is an event based calculation using the SHARE model, with 100 realizations and 8232 sites, i.e. a pretty large calculation. The size of the GMFs is

Germany, all realizations: 17.67 GB
Germany, compact format: 1.31 GB

In other words, the compact format is 13+ times smaller than the full format. The saving time is proportionally smaller. On top of that, the storage on the database is particular inefficient in this case (technically it is because we are storing short arrays) so that the old calculator is storing around 46 GB of data. This includes the ruptures, which however are smaller than the GMFs. The saving time on the database is affected by the high level of concurrency (I run this computation with 1000 tasks on a cluster with 256 cores), by the fact that the database was half full, and of course by the various database constraints (primary keys and others). So the HDF5 saving is expected to be much faster. In reality it is insanely fast. But let me first show the numbers for the computation times:

Operation	Cum. Time	Memory
computing gmfs (old)	3,769 s	29 MB
computing gmfs (new)	2,375 s	0.73 MB

The speedup when computing the GMFs is consistent with our findings of before (it is significant); in particular it requires a lot less memory, 40 times less memory. This is the advantage of using numpy arrays and not Django objects.

The situation for the saving times is beyond imagination:

Operation	Cum. Time	Memory
saving gmfs (old)	636,613 s	4 MB
saving gmfs (new)	3.2 s	0.25 MB

We went from 636,613 s to 3.2 s, a factor of 200,000 times!!

Generation of hazard curves

The new calculator is also able to generate hazard curves from the GMFs and to save the curves without saving the GMFs, just as the old calculator. But clearly the new calculator is faster. Here are some numbers for Miriam’s island with 256 tasks:

Operation	Cum. Time	Memory
(old) total compute_gmfs_and_curves	10,483 s	77 MB
(new) total compute_gmfs_and_curves	8,153 s	131 MB

It is not clear to me why we are using more memory, it is something that I have yet to investigate.

The case of sampling

The new calculator is still not reliable in the case of sampling, so I will not provide any figure for this situation. I will write a separate blog post when this use case will be under control.

The Unofficial OpenQuake Engine Blog About Author Archive Feed

The new GMFs and hazard curves calculator

The case of a single realization

The case of a large logic tree

Generation of hazard curves

The case of sampling

Related Posts

Running extra-large PSHA calculations (Canada 2015) 16 Feb 2018

Changes in the hazard outputs 02 May 2017

Recent progress in the classical PSHA calculator 03 Sep 2016