The Unofficial OpenQuake Engine Blog     About     Author     Archive     Feed

To split or not to split

With the new year I will start a series of blog posts about the new hazard calculators. The reason is that the Postgres-based calculators will be removed soon.

Clearly, it is imperative to have the HDF5-based calculators ready and robust enough to substitute the old ones, but there is a problem with the tiling calculator: its current version in oq-lite is not splitting the sources (to reduce the data transfer) and as a consequence it is distributing badly, i.e. there are slow sources dominating the calculation time. I needed to fix this, so I spent the last weeks trying to solve the issue. As usual, it was more difficult than expected, and I ended up rewriting completely the source management mechanism for all all of the hazard calculators.

In doing so I discovered several interesting things. The one I want to talk about today is the source splitting mechanism, which is subtler than one can think.

First of all, let me explain that the oq-lite code base contains utilities to split two kind of sources:

  1. area sources, by using the function area_to_point_sources
  2. fault sources, by using the function split_fault_source

Such utilities have been written by the hazard scientists a long time ago and they are the key to solving the problem of slow sources: the idea is to split a big source into a set of small sources and then distribute the small sources equally among the available cores.

As usual, the devil is in the details:

  1. if you split too much (i.e. if the parameters area_source_discretization, rupture_mesh_spacing, complex_fault_mesh_spacing are too small) you will run into problems. First of all, your computation will take a lot more memory. For instance, if you split a fault source into 100,000 ruptures you will need 100,000 times more memory. Secondly, you will need to transfer a lot more data and you can easily reach the limits of the technologies we are using now (rabbitmq/celery), thus making the calculation unfeasible: it will just break.

  2. the hazard curves depends on the splitting: this is due to numerical rounding issues, the effect is small but still it may break all of your tests unless you are very careful.

  3. for event based calculations the sampling of the ruptures depends on the splitting: this seems to be an artifact of the current implementation and I am still working to see if it is possible to remove such dependence.

Here I want to focus in detail on the second issue and I will give an example where the problem is manifest. Consider our hazard demo AreaSourceClassicalPSHA. This is a case with a single area source. The engine and oq-lite split such area source into 205 point sources and then group such sources trying to generate 10 tasks, since the parameter concurrent_tasks is explicitly set to 10 in this demo.

What if we did not split the source? Would we get the same hazard curves or not? And what if we changed the area_source_discretization parameter and split in a different number of sources?

The answer is that the hazard curves are slightly different. I will give now a script that you can use to investigate the splitting-dependency.

from openquake.commonlib import readinput, datastore, sourceconverter
from openquake.hazardlib.calc.hazard_curve import hazard_curves_per_trt

def compute_and_save(sources, sitecol, imtls, gsims):
    dstore = datastore.DataStore()
    curves_by_gsim = hazard_curves_per_trt(
        sources, sitecol, imtls, gsims)
    # save the hazard curves generated by each GSIM
    for gsim, curves in zip(gsims, curves_by_gsim):
        dstore[str(gsim)] = curves
    dstore.close()
    print 'See the results with hdfview %s' % dstore.hdf5path

def compute_classical_psha(job_ini):
    oq = readinput.get_oqparam(job_ini)
    oq.region_grid_spacing = 50  # to reduce the number of sites
    sitecol = readinput.get_site_collection(oq)
    csm = readinput.get_composite_source_model(oq)
    [src] = csm.get_sources()  # there is a single source
    gsims = csm.get_rlzs_assoc().gsims_by_trt_id[src.trt_model_id]

    print 'Calculation with area source unsplit'
    compute_and_save([src], sitecol, oq.imtls, gsims)

    sources = list(sourceconverter.split_source(src))
    print('Calculation with area source split in %d sources' %
          len(sources))
    compute_and_save(sources, sitecol, oq.imtls, gsims)

if __name__ == '__main__':
    compute_classical_psha('/home/michele/oq-risklib/demos/hazard/'
                           'AreaSourceClassicalPSHA/job.ini')

To run it, just edit the path to the AreaSourceClassicalPSHA demo. The output will be a couple of HDF5 files. The first one will contain the hazard curves in the case of no splitting, the second on the hazard curves in the case of splitting. For instance, considering the first site and the first intensity measure type (PGA) you will get the following numbers:

IML poe no-split poe split
level01 0.9999996528232017 0.9999996528232672
level02 0.9999992054272225 0.9999992054271548
level03 0.9999965335517379 0.9999965335482233
level04 0.9999714994022764 0.9999714993251188
level05 0.9996478597756364 0.9996478580821415
level06 0.9957172654762922 0.9957172376004891
level07 0.9649787725521564 0.964978525160557
level08 0.8370144485061319 0.8370134218900904
level09 0.5773185211490328 0.577316542903742
level10 0.3010764243425089 0.30107436649319863
level11 0.11731627553747037 0.11731491421623153
level12 0.03673955288550712 0.036738893816042895
level13 0.009330768261857747 0.009330520326357727
level14 0.0019203457614811459 0.0019202724005433769
level15 3.112012858415003E-4 3.111846470034152E-4
level16 3.9505076113499626E-5 3.950221319604097E-5
level17 3.80525863807879E-6 3.804896278825076E-6
level18 2.8646715921620824E-7 2.864325182594385E-7
level19 1.5438832212666398E-8 1.543652028423992E-8

As you see the differences are small but visible. Please notice that this has nothing to do with the parallelization, since this example uses the hazardlib function hazard_curves_per_trt which does not parallelize. In reality there may be even more differences since parallelizing may introduce different errors: even an operation as trivial as the sum can give issues in a parallel environment, since a + b + c can be different from a + c + b if the floating point variables a, b and c comes from different tasks and the order of the tasks is not specified. This is very visible when using 32 bit floating point numbers and it is the reason why we are still using 64 bit numbers for the hazard curves.

The lesson learned is that the tests must have a tolerance high enough to be insensitive to the splitting issue.

In the next posts of this series I will discuss how I changed the splitting mechanism and how I got a much better distribution of the sources.

comments powered by Disqus