Commit c936271b authored by Clément Haëck's avatar Clément Haëck
Browse files

Up dataset creation

More flexible: pregex can depend on options
Easier: argument processing is automated

Not retro-compatible !

adapt lib.data.ostia dataset
Change documentation
parent 2e1ee8f6
......@@ -60,4 +60,4 @@ ds = lib.data.dataset.get_data(args)
You may have noticed that the function described in [[file:datasets.org::*~create_data~][~create_data~]] have the following signature ~(args: Dict = None, **kwargs: Any)~.
Thoses functions process thoses arguments using ~lib.data.process_args(args_names, args, replace_defaults=None, **kwargs)~. The idea is that we define the argument the function *needs* with ~args_names~. We then supply both args and kwargs. From those are only kept the needed arguments. If arguments are missing a default value is used (either defined with file:../lib/conf.ini or ~replace_defaults~). Arguments found in kwargs will replace those from args.
This system is a bit heavy but convenient, as the variable args can be passed to all functions from different datasets without worrying about which needs what. And if argument are missing, or must be overwritten kwargs can still be used.
This system is a bit heavy but convenient, as the variable args can be passed to all functions from different datasets without worrying about which needs what. And if argument are missing or must be overwritten, kwargs can still be used.
......@@ -8,25 +8,49 @@ For now I am personally using 'GS' for 'Gulf-Stream'.
* Module file
To load a dataset in memory, its corresponding python file can be used. It can be found in ~lib/data/[module].py~.
In this file can be found several things.
** Some constants
*** Directory arguments
A set of necessary arguments to get the root directory of data (typically its ~{'region', 'days'}~). See [[file:args.org]].
*** Pregex
The files pre-regex. See [[*~create_data~]].
*** Grid
The grid for this data. See [[file:grids.org]].
** ~get_root~
A function that returns the root directory of data (including the [[*Time dependency][time dependency]] folder).
** ~create_data~
At this point, we have all the information we need to find data files and load the data in memory. The steps described next are automated using the ~lib.data.create_data~ function. All functions receive the same arguments,
The first step is to retrieve the data files. Typically there is one file per day, or sometimes per set of parameter. We could use a simple loop or globbing, but this nowhere near as fun as what we do.
We use the [[https://github.com/Descanonge/filefinder]] package. It allows to find in the root directory all files corresponding to a specific filename structure, specified with the pre-regex. See the package readme and documentation to see what its other nice features.
For that step we use the ~get_finder()~ function, which returns a Finder object.
The next step is to feed those files to Xarray with the function ~get_data()~.
It returns a dataset using ~xr.open_mfdataset()~.
In this files are defined some basic information about the data. A function is then used to automatically create a number of functions that can be used.
** The basic information
*** Arguments names / Dataset parameters
Each dataset can depend on different parameters, supplied by [[file:args.org::Argument processing][arguments]]. The list of argument names used by the set must be supplied as a set.
Typically:
#+begin_src python
ARGS = {'region', 'days'}
#+end_src
/ie/ the region, and the temporal resolution.
The `fixes` and `climato` arguments are automatically added.
*** Root directory function
To create a dataset, we need a function that can return the root directory containing the data (which can depend on those parameters defined above). The function receives the processed arguments (again, see [[file:args.org]]). It should return a list of folders names that will be transformed into a path. Example:
#+begin_src python
def ROOT(args):
root = [lib.root_data, args['region'], 'NAME', lib.data.get_time_folder(args)]
#+end_src
*** Pre-regex function
Typically, each dataset contains a number of files: one per time step, or one per set of parameters.
We could loop through files, or use globbing, but this is nowhere near as fun as what we do here.
We use the [[https://github.com/Descanonge/filefinder]] package. It allows to find in the root directory all files corresponding to a specific filename structure, specified with the pre-regex. See the package readme and documentation to see what its features, or look through the existing dataset files to see working examples.
These functions are created automatically, but some options can be passed to customize some behaviors. See the function docstring.
To define the pre-regex (the structure of the file), as for the root directory, we define a function taking processed arguments. It should return the pre-regex as a string:
#+begin_src python
def PREGEX(args):
return "%(Y)/DATA_%(Y)%(m)%(d).nc"
#+end_src
** The created functions
With the information given above, we can automatically create a number of useful functions, detailled below.
All functions take as arguments `(args=None, **kwargs)`.
*** get_root
This simply return the directory containing all data.
*** get_pregex
This simply return the pre-regex string.
*** get_finder
This returns the `Finder` (object from the `filefinder` package). This object is in charge of finding files corresponding to our selection.
*** get_filename
This returns a filename for given arguments. Here the arguments correspond to the matchers name. As we define our filename structure with the pre-regex, we must fill in the blanks in it. So for the pre-regex example of above:
#+begin_src python
lib.data.name.get_filename(Y=2007, m=3, d=8)
#+end_src
*** get_data
This returns a dataset.
details on open_mf_kw
* Time dependency
Datasets are typically available daily, as n-days average, or as climatology.
......
import functools
import importlib
import logging
from typing import Any, Dict, List, Optional
from os import path
from typing import Any, Callable, Dict, List, Optional, Set
from filefinder import Finder
import xarray as xr
......@@ -12,62 +14,97 @@ log = logging.getLogger(__name__)
logging.basicConfig()
def create_data(module, pregex, get_root, ARGS_DIR,
defaults=None,
open_mf_kw: Optional[Dict] = None):
def create_dataset(module: str, pregex_func: Callable,
root_func: Callable, args_names: Set[str],
defaults: Optional[Dict] = None,
open_mf_kw: Optional[Dict] = None):
"""Create a dataset.
def get_pregex(pregex, args):
if args.get('climato', None) is not None:
pregex = '{0}/{0}_%(m)%(d).nc'.format(args['climato'])
return pregex
def get_args_fix(finder):
Set various function in `module`.
See doc on datasets for more details.
"""
# Fixes and climato are default arguments names
args_names |= {'fixes', 'climato'}
# Process arguments before feeding it to func
# See process_args below for details
def with_args(func):
@functools.wraps(func)
def decorated(args=None, **kwargs):
args = process_args(args_names, args, defaults, **kwargs)
return func(args)
return decorated
# Create function to obtain pre-regex
get_pregex = with_args(pregex_func)
# Create function to obtain data root directory
@with_args
def get_root(args):
return path.join(*root_func(args))
def get_matchers_names(finder):
"""Return list of matchers names in a Finder."""
return list(set([m.name for m in finder.matchers]))
def get_finder(args=None, **kwargs):
args = process_args(ARGS_DIR | {'fixes', 'climato'}, args,
replace_defaults=defaults, **kwargs)
# Create function to get finder
@with_args
def get_finder(args):
root = get_root(args)
finder = Finder(root, get_pregex(pregex, args))
pregex = get_pregex(args)
finder = Finder(root, pregex)
finder.fix_matchers({n: v for n, v in args['fixes'].items()
if n in get_args_fix(finder)})
if n in get_matchers_names(finder)})
return finder
def get_filename(args=None, finder=None, **kwargs):
if finder is None:
finder = get_finder(args, **kwargs)
# Add DEFAULTS['fixes'] to defaults.
# Create function to get filename
def get_filename(args=None, **kwargs):
finder = get_finder(args, **kwargs)
# Add lib.DEFAULTS['fixes'] to dataset-specific defaults args
if defaults is None:
defaults_ = {}
else:
defaults_ = defaults.copy()
defaults_.update(lib.DEFAULTS.get('fixes', {}))
args = process_args(get_args_fix(finder), args,
replace_defaults=defaults_, **kwargs)
# Process args, their names are the matchers
args = process_args(get_matchers_names(finder),
args, defaults_, **kwargs)
return finder.get_filename(args)
# Default values for open_mf_kw dictionnary
if open_mf_kw is None:
open_mf_kw = dict()
open_mf_kw_def: Dict = dict(
parallel=True
)
open_mf_kw_def.update(open_mf_kw)
def get_data(args=None, **kwargs):
finder = get_finder(args, **kwargs)
if 'preprocess_finder' in open_mf_kw_def:
pp_args = open_mf_kw_def.pop('preprocess_finder_args', [])
pp_kw = open_mf_kw_def.pop('preprocess_finder_kwargs', {})
open_mf_kw_def['preprocess'] = finder.get_func_process_filename(
open_mf_kw_def.pop('preprocess_finder'), *pp_args, **pp_kw)
for k, v in open_mf_kw_def.items():
open_mf_kw.setdefault(k, v)
# Create function to obtain data
@with_args
def get_data(args):
finder = get_finder(args)
# preprocess_finder is special case
# Apply finder.get_func_process_filename
if 'preprocess_finder' in open_mf_kw:
pp_args = open_mf_kw.pop('preprocess_finder_args', [])
pp_kw = open_mf_kw.pop('preprocess_finder_kwargs', {})
open_mf_kw['preprocess'] = finder.get_func_process_filename(
open_mf_kw.pop('preprocess_finder', *pp_args, **pp_kw)
)
if len(finder.files) == 0:
raise RuntimeError("No files found. \n{:s}".format(str(finder)))
ds = xr.open_mfdataset(finder.get_files(), **open_mf_kw_def)
ds = xr.open_mfdataset(finder.get_files(), **open_mf_kw)
return ds
mod = importlib.import_module(module)
setattr(mod, 'get_pregex', get_pregex)
setattr(mod, 'get_root', get_root)
setattr(mod, 'get_finder', get_finder)
setattr(mod, 'get_filename', get_filename)
setattr(mod, 'get_data', get_data)
......@@ -87,7 +124,7 @@ def get_time_folder(args) -> str:
return '{:d}days'.format(days)
def process_args(args_names: List[str],
def process_args(args_names: Set[str],
args: Optional[Dict] = None,
replace_defaults: Optional[Dict] = None,
**kwargs: Any) -> Dict:
......@@ -102,7 +139,7 @@ def process_args(args_names: List[str],
return args
def get_data_args(args_names: Optional[List[str]] = None,
def get_data_args(args_names: Optional[Set[str]] = None,
args: Optional[Dict] = None,
**kwargs: Any) -> Dict:
"""Put `kwargs` into args.
......@@ -127,7 +164,7 @@ def get_data_args(args_names: Optional[List[str]] = None,
return args
def put_defaults(args_names: List[str],
def put_defaults(args_names: Set[str],
args: Optional[Dict] = None, **kwargs: Any) -> Dict:
"""Put defaults arguments in args.
......
......@@ -5,24 +5,26 @@ Initially downloaded from CMEMS
https://resources.marine.copernicus.eu/product-detail/SST_GLO_SST_L4_REP_OBSERVATIONS_010_024/INFORMATION
"""
from os import path
import lib
import lib.data
ARGS_DIR = {'region', 'days'}
pregex = '%(Y)/SST_processed_%(Y)%(m)%(d).nc'
pregex_original = '%(Y)/SST_%(Y)%(m)%(d).nc'
grid = '4km_EPSG32662'
grid_original = '4km_EPSG4326'
ARGS = {'region', 'days', 'processed'}
DEFAULTS = dict(processed=True)
def PREGEX(args):
pregex = "%(Y)/SST{}_%(Y)%(m)%(d).nc".format(
'_processed' if args['processed'] else '')
return pregex
def get_root(args=None, **kwargs):
args = lib.data.process_args(ARGS_DIR | {'climato'}, args, **kwargs)
root = path.join(lib.root_data, args['region'], 'OSTIA',
lib.data.get_time_folder(args))
def ROOT(args):
root = [lib.root_data, args['region'],
'OSTIA', lib.data.get_time_folder(args)]
return root
lib.data.create_data(__name__, pregex, get_root, ARGS_DIR)
grid = '4km_EPSG32662'
grid_original = '4km_EPSG4326'
lib.data.create_dataset(__name__, PREGEX, ROOT, ARGS, DEFAULTS)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment