Big datasets

Big datasets#

If you have data containing a lot of particles, then there are some config settings that will significantly speed up processing. Here are some pointers.

When processing, use the non-appending functionality in pyopia.io.StatsToDisc

    [steps.output]
    pipeline_class = 'pyopia.io.StatsToDisc'
    output_datafile = 'proc/test' # prefix path for output nc file
    append = false

Using the above output step in you pipeline will create a directory ‘proc’ filled with nc files conforming to the pattern: ‘test-Image-D*-STATS.nc’

These can be combined using pyopia.io.merge_and_save_mfdataset() of command line tool pyopia merge-mfdata, which will produce a new single -STATS.nc file of the whole dataset (for faster loading). Or you can do this manually like this:

xstats, image_stats = pyopia.io.combine_stats_netcdf_files('proc/')

And the make a new nc file of the whole dataset for faster loading later:

settings = pyopia.pipeline.steps_from_xstats(xstats)

pyopia.io.write_stats(xstats.to_dataframe(),
                      'proc/test2-test',
                      settings,
                      image_stats=image_stats.to_dataframe())

xstats = pyopia.io.load_stats('proc/test2-test-STATS.nc')

Parallell processing#

If you have data containing a lot of particles and/or a lot of raw images, you can use the num-chunks functionality in the {ref}(pyopia-process) command line tool e.g.:

pyopia process config.toml --num-chunks 4

This will split the list of raw files into 4 chunks to be processed in parallell using multiprocessing. This tool will organise the chunks of file names so that the appropriate background files into the correct places (i.e. for moving background, the last average_window number of files in the previous chunk are added to the start of the next chunk; and for fixed background the same initial average_window number of files are added to the top of each chunk).