Docker-in-Docker (DinD) capabilities of public runners deactivated. More info

Commit ddaecaba authored by Clément Haëck's avatar Clément Haëck
Browse files

Add docs

parent 812b23e4
......@@ -9,34 +9,8 @@ Collocate submesoscale fronts to phytoplankton levels using satellite imagery
- Compute: Process data, compute diagnostics,...
- Plots: Plotting scripts
## Setup
The environment variable `$SUBMESO_COLOR_CODE_DIR` must be defined and point to
this repository location.
Bash scripts will source the `` file in it, which defines other necessary
variables, notably `$SUBMESO_COLOR_DATA_DIR` for the location of data and plots.
It also appends this repository to the `$PYTHONPATH` variable.
The organisation of `` is left to the user.
An example is given which works on a personnal computer and on the [Ciclad]( cluster.
## Doc
## Requirements
- [dateloop](
A conda environment file is provided ([./env.yml](./env.yml))
Some documentation is available in orgmode format. See ([./docs/index](Index)).
#+TITLE: Args
All scripts are written with a command-line use in mind.
To change some parameters, command-line arguments are retrieved. As always, things are quite automated.
All the magic happens in ~lib.get_args()~.
* Usage
Typically a script is written as:
#+begin_src python
import lib
def main(args):
if __name__ == '__main__':
args = lib.get_args(['region', 'days', 'fixes'])
`args` is a dictionnary containing arguments name and their values.
* Default arguments
The list of arguments names passed above are from a list of default arguments.
I won't keep this updated so see the source code of ~lib.get_args()~ to have a list of thoses.
These arguments have a default value that can be set in `lib/conf.ini`
Some arguments go through some processing.
** date
date is a YYYY/MM/[DD] format argument. The '/' separator can also be '-', a space, or nothing. ~args['date_str']~ is a tuple of the strings of year, month, and eventually day. ~args['date']~ stores a datetime object.
** fixes
Fixes are supplied as '-fix <matcher> <string>'. They can then be found as a dictionnary with the matchers name as keys. See [[*Fixes]]
* Adding other arguments
Other arguments can be retrieved like so:
#+begin_src python
def add_args(parser):
parser.add_argument('-argument_name', type=int, default=0)
args = lib.get_args(['...'], add_args)
* Fixes
Fixes are a important feature of the filefinder package. When [[*Pregex][finding datafiles]], one can 'fix' a matcher (part of the filename that vary, the date for daily files for instance) to a certain value.
That can be a string (which can be a regex!), or a value. Quick example, we only want the month of may:
#+begin_src python
finder =
Now, we can do this automatically from the command line by supplying the appropriate argument: ~python -fix m 05~ (we can only supply a string, so it must match the date format). This will end up in the args dictionnary, where it can be passed to ~get_finder()~ or ~get_data()~. So:
#+begin_src python
args = lib.get_args(['fixes'])
ds =
`ds` will only have data for the month of may. Convenient when testing a script on smaller data.
* Argument processing
You may have noticed that the function described in [[*create_data][create_data]] have the following signature ~(args: Dict = None, **kwargs: Any)~.
Thoses functions process thoses arguments using, args, replace_defaults=None, **kwargs)~. The idea is that we define the argument the function *needs* with args_names. We then supply both args and kwargs. From those are only kept the needed arguments. If arguments are missing a default value is used (either defined with lib/conf.ini or replace_defaults). Arguments found in kwargs will replace those from args.
This system is a bit heavy but convenient, as the variable args can be passed to all functions from different datasets without worrying about which needs what. And if argument are missing, or must be overwritten kwargs can still be used.
#+TITLE: Boxes
We sometime needs to define a rectangular area (a zone clean of clouds for instance).
`` defines several classes to help with this notion.
With those, it is easy to define a rectangular area. It can be used to slice a dataset easily. It can also be plotted easily.
#+TITLE: Datasets
* Location
All data is found in ~$SUBMESOCOLOR_DATA_DIR/[Region]/[dataset name / abbreviation]~.
The region folder is made to be able to work on different locations at the same time.
For now I am personally using 'GS' for 'Gulf-Stream'.
* Module file
To load a dataset in memory, its corresponding python file can be used. It can be found in `lib/data/[module].py`.
In this file can be found several things.
** Some constants
*** Directory arguments
A set of necessary arguments to get the root directory of data (typically its ~{'region', 'days'}~). See [[]].
*** Pregex
The files pre-regex. See [[*create_data]].
*** Grid
The grid for this data. See [[]].
** get_root
A function that returns the root directory of data (including the [[*Time dependency]] folder).
** create_data
At this point, we have all the information we need to find data files and load the data in memory. The steps described next are automated using the function. All functions receive the same arguments,
The first step is to retrieve the data files. Typically there is one file per day, or sometimes per set of parameter. We could use a simple loop or globbing, but this nowhere near as fun as what we do.
We use the [[]] package. It allows to find in the root directory all files corresponding to a specific filename structure, specified with the [[*Pregex]]. See the package readme and documentation to see what its other nice features.
For that step we use the ~get_finder()~ function, which returns a Finder object.
The next step is to feed those files to Xarray with the function ~get_data()~.
It returns a dataset using ~xr.open_mfdataset()~.
These functions are created automatically, but some options can be passed to customize some behaviors. See the function docstring.
* Time dependency
Datasets are typically available daily, as n-days average, or as climatology.
Each variation has its own subfolder of form '1days', '{n}days', 'climato'.
For more on the climatology, see [[]].
* Zones
We define some static 'zones', this dataset is quite special and have its own doc [[][here]].
#+TITLE: Grids
Because of the datasets that I use, I currently uses 3 different grids.
* 1km
A 1km grid for the MODIS dataset.
* 4km
Two 4km grids for GlobColour and OSTIA.
Thoses grids have a slightly different resolution. But we need to overlap (pixel by pixel) the Chl-a data with the HI data (computed from SST), so on 2 different grids.
To avoid this hassle, we regrid the OSTIA SST on the GlobColour grid with `Compute/`. We also don't have to deal with 2 grids anymore.
This was choosen recently, so for a time I dealt with 2 grids. This is why most scripts (and zone/land rasters, see [[]]) actually support choosing one grid or the other, using the grid argument.
** OSTIA grid
** GlobColour grid
* HI definition
* Compilation
How to compile.
How it works (the script).
#+TITLE: Histograms
* Method of computation
advantages / defaults of using xhistogram
#+TITLE: Index
This is this project documentation.
It is far from covering everything but should put you on the right track.
* Folders
** Compute
For scripts that compute stuff.
** Download
For downloading and pre-processing data.
** Docs
You're looking at it.
** lib
General modules
*** data
** Plots
Plots. Folder is a hot mess. Lot of scripts are not kept up to date. Use at your own risk.
* Other
** [[][Setup]]
#+TITLE: Setup
First step is to clone this repository into a folder of your choice.
* Requirements
** Python
Python >= 3.8 is necessary.
Thoses packages:
- numpy
- scipy
- pandas
- xarray
- dask
- matplotlib
- netcdf4
- motuclient
- cftime
- shapely
- global-land-mask
- filefinder
A conda environment file with pinned versions is provided at [[file:../env.yml]].
To install: ~conda env create -f ./env.yml~ and to update: ~conda env update -f ./env.yml~.
** Dateloop
A small script to loop over dates. It is used in some scripts.
Installation details over at: [[]]
* Setup
Some other work is necessary to finalize the setup.
The environment variable ~$SUBMESO_COLOR_CODE_DIR~ must be defined and point to
this repository root directory.
Bash scripts will source the `` file in it, which defines other necessary
variables, notably ~$SUBMESO_COLOR_DATA_DIR~ for the location of data and plots.
It also appends this repository to the ~$PYTHONPATH~ variable, so that all modules are available.
The organisation of `` is left to the user. An example is given which works on a personnal computer and on the [[][Ciclad]] cluster.
On ciclad, python script must be executed by using [[file:../]].
#+TITLE: Zones and land
Zones and land rasters are accessible from lib.zones.
* Zones
We sometime have to define static, geometric zones.
We use a custom Zone class. An instance is created by passing a list of points coordinates. We keep a shapely object, it can be useful for some geometric consideration (see interactions with [[]]).
When running `lib/`, all zones are rasterized on the different [[][grids]] defined. Those rasters are saved to files that can be loaded as any other dataset.
* Land
It is useful to have a raster that say for every pixel if there is land or not.
Those rasters can be loaded with ~lib.zones.get_land()~ which works as any other get_data() function.
Those rasters are created using `Compute/`. This script uses either downloaded data containing masks, or the [[][global_land_mask]] package.
#!/usr/bin/env bash
# Launch a python script on Ciclad
# All arguments are passed to the python script.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment