To split or not to split
10 Jan 2016With the new year I will start a series of blog posts about the new hazard calculators. The reason is that the Postgres-based calculators will be removed soon.
Clearly, it is imperative to have the HDF5-based calculators ready and robust enough to substitute the old ones, but there is a problem with the tiling calculator: its current version in oq-lite is not splitting the sources (to reduce the data transfer) and as a consequence it is distributing badly, i.e. there are slow sources dominating the calculation time. I needed to fix this, so I spent the last weeks trying to solve the issue. As usual, it was more difficult than expected, and I ended up rewriting completely the source management mechanism for all all of the hazard calculators.
In doing so I discovered several interesting things. The one I want to talk about today is the source splitting mechanism, which is subtler than one can think.
First of all, let me explain that the oq-lite code base contains utilities to split two kind of sources:
- area sources, by using the function
area_to_point_sources
- fault sources, by using the function
split_fault_source
Such utilities have been written by the hazard scientists a long time ago and they are the key to solving the problem of slow sources: the idea is to split a big source into a set of small sources and then distribute the small sources equally among the available cores.
As usual, the devil is in the details:
-
if you split too much (i.e. if the parameters
area_source_discretization
,rupture_mesh_spacing
,complex_fault_mesh_spacing
are too small) you will run into problems. First of all, your computation will take a lot more memory. For instance, if you split a fault source into 100,000 ruptures you will need 100,000 times more memory. Secondly, you will need to transfer a lot more data and you can easily reach the limits of the technologies we are using now (rabbitmq/celery), thus making the calculation unfeasible: it will just break. -
the hazard curves depends on the splitting: this is due to numerical rounding issues, the effect is small but still it may break all of your tests unless you are very careful.
-
for event based calculations the sampling of the ruptures depends on the splitting: this seems to be an artifact of the current implementation and I am still working to see if it is possible to remove such dependence.
Here I want to focus in detail on the second issue and I will give
an example where the problem is manifest. Consider our hazard demo
AreaSourceClassicalPSHA
. This is a case with a single area source.
The engine and oq-lite split such area source into 205 point sources and then
group such sources trying to generate 10 tasks, since the parameter
concurrent_tasks
is explicitly set to 10 in this demo.
What if we did not split the source? Would we get the same hazard
curves or not? And what if we changed the area_source_discretization
parameter and split in a different number of sources?
The answer is that the hazard curves are slightly different. I will give now a script that you can use to investigate the splitting-dependency.
from openquake.commonlib import readinput, datastore, sourceconverter
from openquake.hazardlib.calc.hazard_curve import hazard_curves_per_trt
def compute_and_save(sources, sitecol, imtls, gsims):
dstore = datastore.DataStore()
curves_by_gsim = hazard_curves_per_trt(
sources, sitecol, imtls, gsims)
# save the hazard curves generated by each GSIM
for gsim, curves in zip(gsims, curves_by_gsim):
dstore[str(gsim)] = curves
dstore.close()
print 'See the results with hdfview %s' % dstore.hdf5path
def compute_classical_psha(job_ini):
oq = readinput.get_oqparam(job_ini)
oq.region_grid_spacing = 50 # to reduce the number of sites
sitecol = readinput.get_site_collection(oq)
csm = readinput.get_composite_source_model(oq)
[src] = csm.get_sources() # there is a single source
gsims = csm.get_rlzs_assoc().gsims_by_trt_id[src.trt_model_id]
print 'Calculation with area source unsplit'
compute_and_save([src], sitecol, oq.imtls, gsims)
sources = list(sourceconverter.split_source(src))
print('Calculation with area source split in %d sources' %
len(sources))
compute_and_save(sources, sitecol, oq.imtls, gsims)
if __name__ == '__main__':
compute_classical_psha('/home/michele/oq-risklib/demos/hazard/'
'AreaSourceClassicalPSHA/job.ini')
To run it, just edit the path to the AreaSourceClassicalPSHA
demo.
The output will be a couple of HDF5 files. The first one will contain
the hazard curves in the case of no splitting, the second on the
hazard curves in the case of splitting. For instance, considering
the first site and the first intensity measure type (PGA) you will get
the following numbers:
IML | poe no-split | poe split |
---|---|---|
level01 | 0.9999996528232017 | 0.9999996528232672 |
level02 | 0.9999992054272225 | 0.9999992054271548 |
level03 | 0.9999965335517379 | 0.9999965335482233 |
level04 | 0.9999714994022764 | 0.9999714993251188 |
level05 | 0.9996478597756364 | 0.9996478580821415 |
level06 | 0.9957172654762922 | 0.9957172376004891 |
level07 | 0.9649787725521564 | 0.964978525160557 |
level08 | 0.8370144485061319 | 0.8370134218900904 |
level09 | 0.5773185211490328 | 0.577316542903742 |
level10 | 0.3010764243425089 | 0.30107436649319863 |
level11 | 0.11731627553747037 | 0.11731491421623153 |
level12 | 0.03673955288550712 | 0.036738893816042895 |
level13 | 0.009330768261857747 | 0.009330520326357727 |
level14 | 0.0019203457614811459 | 0.0019202724005433769 |
level15 | 3.112012858415003E-4 | 3.111846470034152E-4 |
level16 | 3.9505076113499626E-5 | 3.950221319604097E-5 |
level17 | 3.80525863807879E-6 | 3.804896278825076E-6 |
level18 | 2.8646715921620824E-7 | 2.864325182594385E-7 |
level19 | 1.5438832212666398E-8 | 1.543652028423992E-8 |
As you see the differences are small but visible. Please notice that
this has nothing to do with the parallelization, since this example
uses the hazardlib function hazard_curves_per_trt
which does not
parallelize. In reality there may be even more differences since
parallelizing may introduce different errors: even an operation as
trivial as the sum can give issues in a parallel environment, since
a + b + c
can be different from a + c + b
if the floating point
variables a
, b
and c
comes from different tasks and the order of
the tasks is not specified. This is very visible when using 32 bit
floating point numbers and it is the reason why we are still using 64 bit
numbers for the hazard curves.
The lesson learned is that the tests must have a tolerance high enough to be insensitive to the splitting issue.
In the next posts of this series I will discuss how I changed the splitting mechanism and how I got a much better distribution of the sources.