Processing raw data

Processing raw data#

0) Activate the pyopia environment#

If you installed PyOPIA within as per the guide here, then you should activate this environment first, e.g.:

uv sync

and

source .venv/bin/activate

1) Create a new project folder with a config file and metadata template#

To start a new image processing project with PyOPIA, you can use the ‘init-project’ command (here called ‘myproject’):

pyopia init-project myproject

If you want help and additional options for this command, do: pyopia init-project --help

You should now have a new project folder (‘myproject’) contaning a config file (‘config.toml’) and a README file with suggestions for steps to perform before starting processing. Several other input files and subfolders are also generated:

myproject/
├── auxillarydata
│   └── auxillary_data.csv
├── config.toml
├── images
├── metadata.json
├── processed
├── pyopia-default-classifier-20250409.keras
└── README

2) Make sure you are happy with your config file#

Refer to the comments in the examples given here Pipeline config files

If you need detailed help on arguments specific to a pipeline class, then you may wish to refer to the API documentation for that specific class.

Particle classification is provided by [steps.classifier], which points to a pre-trained Keras CNN model. A default classifier for PyOPIA was provided by default using the init-project command.

3) Add project-relevant metadata#

PyOPIA generates a self-describing netCDF file during processing, which in addition to particle statistics contain some basic metadata. These are in part taken from the ‘metadata.json’ file generated in the previous step.

The generated template file ‘metadata.json’ contains several items that should be filled out, such as ‘title’ and ‘creator_name’. Also check that you are happy with the default license proposed (CC BY-SA).

You can add your own metadata items in this file as well.

4) Add auxillary data#

A typical image dataset will be associated with some auxillary data variables, e.g. temperature, salinity and depth for a profiling setup deployed at sea. This information can optionally be incorporated into the particle statistics netCDF that PyOPIA generates, to ease post-processing of the data. Such information should be added as time series in the auxillary data file (‘auxillary_data.csv’). Each row in this file should consist of a time stamp and one or more auxillary data elements. The time stamps are interpolated to match each image being processed, so they need not match exactly, but should cover the same time period. See the generated template file for more information (‘auxillarydata/auxillary_data.csv’).

5) Process!#

Run the command line processing which simply needs to know which config file you want it to work on, e.g.:

pyopia process config.toml

4) Output#

You should expect an output folder defined by the output_datafile argument within the [steps.output] step.
- This will either contain a new .nc file or several .nc files, depending on if you used the append = false option (intended for Big datasets) or not.
If you defined the export_outputpath argument in [steps.statextract], then you will also have a folder containing a series of .h5 files, that contains all the particle ROIs