-
Lionel GUEZ authoredLionel GUEZ authored
documentation.tex 11.51 KiB
\documentclass[a4paper,french]{article}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage{babel}
\usepackage{graphicx}
\usepackage[super]{nth}
\usepackage[np,autolanguage]{numprint}
\usepackage{hyperref}
\hypersetup{pdftitle={Documentation convert Matlab}, pdfauthor={Lionel
Guez}, hypertexnames=false}
%---------------
\graphicspath{{Graphiques/}}
\title{Documentation for scripts converting Matlab data}
\author{Lionel GUEZ}
%---------------
\begin{document}
\maketitle
\tableofcontents
\section{Identification of instantaneous eddies}
\label{sec:identification}
Instantaneous eddies with a given orientation are identified in two
equivalent ways: either by a couple date $d$ and eddy index $e$ at
that date, or by a unique integer identifier $n$, which we can call a
\og node index\fg{}.
For a given date, let $d$ be the corresponding number of days since
January \nth{1}, 1950. Let $d_\mathrm{init}$ be the value of $d$ for
the first date of the dataset. We call date index the integer:
\begin{equation*}
k = d - d_\mathrm{init}
\end{equation*}
So the first value of $k$ is 0.
Let $e_\mathrm{max}(k)$ be the number of instantaneous eddies at date
index $k$. $e_\mathrm{max}(k)$ is stored in Matlab variables Nanti and
Ncyclo. We assume we know an overestimate $E$ of $e_\mathrm{max}(k)$
for all $k$. That is, we guarantee that:
\begin{equation*}
\forall k, e_\mathrm{max}(k) \le E
\end{equation*}
($E$ may not be the same for cyclones and anticyclones.)
The eddy indices $e$ for eddies at date index $k$ start at 1 and
increment 1 by 1, without any jump. So they go from 1 to
$e_\mathrm{max}(k)$.
The relation between node index $n$, date index $k$ and eddy index $e$
is:
\begin{equation*}
n = k E + e
\end{equation*}
So $n \ge 1$ and usually jumps at each change of date. The
anticyclones at the first date have a node index between 1 and $E$, at
the second date between $E + 1$ and $2 E$, and so on. Same for
cyclones. Cf. table (\ref{tab:eddy_id_Matlab}).
\begin{table}[htbp]
\centering
\begin{tabular}{lll}
date index 0 & & \\
& anticyclones & $n = 1, \dots, e_\mathrm{max,anti}(0)$ \\
& cyclones & $n = 1, \dots, e_\mathrm{max,cyclo}(0)$ \\
date index k & & \\
& anticyclones & $n = k E + 1, \dots, k
E + e_\mathrm{max,anti}(k)$ \\
& cyclones & $n = k E + 1, \dots, k E
+ e_\mathrm{max,cyclo}(k)$
\end{tabular}
\caption{Node indices. $k$ is the date index, starting at 0.}
\label{tab:eddy_id_Matlab}
\end{table}
Conversely, from the definition of $E$, knowing $n$ and $E$, we can
obtain $(k, e)$:
\begin{equation*}
k = \left \lfloor \frac{n - 1}{E} \right \rfloor
\end{equation*}
\begin{align*}
e & = n - k E \\
& = 1 + (n - 1) \bmod E
\end{align*}
In the Matlab program, the node index $n'$ is different:
\begin{equation*}
n' = k' E + e
\end{equation*}
where $k'$ is the index, starting at 0, of the date $d$ in the suite
of processed dates. So $k' \ne k$ if there is a missing date in the
suite of processed dates.
\begin{equation*}
k' = \left \lfloor \frac{n' - 1}{E} \right \rfloor
\end{equation*}
The suite of processed dates is stored in the array
\verb+date_num+. From $n'$ we can compute $k'$ and $e$, and then $d$
from \verb+date_num[k']+, without any search. But the reverse
conversion needs a search: from $d$ and $e$, we need to search
\verb+date_num+ to get $k'$, then $n'$.
To convert from $n'$ to $n$, we note that:
\begin{equation*}
k = \mathtt{date\_num[k']} - \mathtt{date\_num[0]}
\end{equation*}
and:
\begin{equation*}
n = n' + (k - k') E
\end{equation*}
Note that $E$ is specific to a given orientation.
\section{Instantaneous eddies}
The data for instantaneous eddies is stored in shapefiles, in the
directories \verb+SHPC_(anti|cyclo)+. Cf. figure \ref{fig:convert_Matlab}.
\begin{figure}[htbp]
\centering
\includegraphics[width=\textwidth]{convert_Matlab}
\caption{Correspondance between input Matlab files and files
converted from them. Below the names of Matlab files are indicated
the names of the variables which are used inside those files, and
the fields used inside those variables.}
\label{fig:convert_Matlab}
\end{figure}
The directory \verb+SHPC_(anti|cyclo)+ contains a set of four
shapefiles: centroid, extremum, \verb+max_speed_contour+ and
\verb+outermost_contour+. The four shapefiles correspond to four \og
layers\fg{} of eddies. (\og layers\fg{} is a term that you can often
find in the documentation of software dealing with geographical data.)
Each layer corresponds to a given type of geometry. Here we have only
two types of geometry: points and polygons. The layers centroid and
extremum contain points, while the layers \verb+max_speed_contour+ and
\verb+outermost_contour+ contain polygons. The centroid layer is for
the geometric center of the maximum-speed contour. The extremum layer
is for the position of the extremum of SSH, which is called center in
the Matlab files.
A shapefile contains \og shapes\fg{}. The shapes are identified by a
shape index $i$, starting at 0. Each eddy has a shape, at the same
shape index $i$, in the four layers. Cf. figure \ref{fig:SHPC}.
\begin{figure}[htbp]
\centering
\includegraphics[width=\textwidth]{SHPC}
\caption{Contents of a shapefile collection: SHPC\_(anti|cyclo).}
\label{fig:SHPC}
\end{figure}
The shapefile collection in \verb+SHPC_(anti|cyclo)+ contains data for
all the dates. The dates are stored successively in the layers. From
the computer point of view, each shapefile is made up of three files,
ending with suffixes \verb+.shp+, \verb+.dbf+ and \verb+.shx+. The
\verb+.shp+ file contains the positions, the \verb+.dbf+ file contains
the metadata, and the \verb+.shx+ file is an index. So the \verb+.shp+
file is the largest file of the three. But the three files form a
logical unit and you should never separate them.
The instantaneous eddies are identified, for a given orientation, by a
date and an eddy index at that date. See §
\ref{sec:identification}. These are fields (column headers) in the dbf
files. The date is given as the number of days since January \nth{1},
1950, in the field \verb+days_1950+. The eddy index is between 1 and
the number of eddies at the date, in the field \verb+eddy_index+.
The file \verb+ishape_last.txt+ in the directory
\verb+SHPC_(anti|cyclo)+ is used to access directly any instantaneous
eddy at any date. Let $d_0$ be the first date in an SHPC. Let $l(d)$
be the shape index in the SHPC of the last instantaneous eddy at date
$d$:
\begin{equation*}
l(d) = \sum_{d' = d_0} ^d e_\mathrm{max}(d') - 1
\end{equation*}
So the shape index $i$ of the eddy with eddy index $e$ at date $d$ is:
\begin{equation*}
i =
\begin{array}{|ll}
e - 1 & \mathrm{if}\ d = d_0 \\
l(d - 1) + e & \mathrm{if}\ d > d_0
\end{array}
\end{equation*}
Also, note that, if $d > d_0$ then:
\begin{equation*}
e_\mathrm{max}(d) = l(d) - l(d - 1)
\end{equation*}
\verb+ishape_last.txt+ gives $l(d)$ for all $d$ in an SHPC.
Finally, the file \verb+grid_nml.txt+ in the directory
\verb+SHPC_(anti|cyclo)+ gives (in Fortran namelist format) the grid
of SSH data from which the eddies were detected.
The shapefiles are in binary format, so you need special software to
read them. There is actually a large number of programs to read
them. The metadata, in the \verb+.dbf+ file, is a spreadsheet, and you
can open it with LibreOffice. There are also shell commands that can
read shapefiles: dbfinfo, shpinfo, dbfdump and shpdump (analogous to
ncdump for NetCDF files). If you have a Linux Debian-type distribution
(Ubuntu, LinuxMint etc.), you can install these tools with:
\begin{verbatim}
sudo apt-get install shapelib
\end{verbatim}
The tools are available on Ciclad. You can also access the shapefiles
programmatically with a number of computer languages. From Python, one
of the modules you can use is
\href{https://fiona.readthedocs.io/en/latest/}{Fiona}. Matlab has the
shaperead function.
Sur le domaine Eurec4A NRT, toutes les dates (soit 117 dates), au
total pour les deux orientations, \verb+inst_eddies_v6.py+ prend
environ 12 mn et produit 3 MiB, pour 2951 tourbillons instantanés.
Sur le domaine global, pour une seule date, \verb+inst_eddies_v6.py+
prend 25 s (dont environ 21 s de conversion au format
v6). \verb+inst_eddies_HDF5.py+ prend 38 s. Il reste le problème du
stockage intermédiaire pour les fichiers v6. Le fichier produit par
\verb+inst_eddies.m+ prend environ le quart de l'espace du fichier de
départ.
\section{Overlapping}
Pour convertir les données Matlab, les scripts Matlab de conversion et
les fichiers d'entrée doivent être dans le répertoire courant, où je
dois avoir l'autorisation d'écriture. Créer donc des liens symboliques.
\begin{verbatim}
ln -s .../Detection_eddies/Convert_Matlab/get_N_eddies.m
ln -s .../Parameters_Anticyclonic_Eddies.mat
ln -s .../Parameters_Cyclonic_Eddies.mat
matlab -batch get_N_eddies
\end{verbatim}
Puis le traitement est indépendant pour chaque orientation. Par
exemple pour les anticyclones :
\begin{verbatim}
ln -s .../Trajectories/Association_eddies_Anti_max.mat \
Association_eddies_max.mat
ln -s .../Detection_eddies/Convert_Matlab/get_id_child.m
matlab -batch get_id_child
ln -s Nanti.mat N_eddies.mat
.../Detection_eddies/Convert_Matlab/overlap_v6.py
mv edgelist.csv edgelist_anti.csv
\end{verbatim}
The data on overlapping instantaneous eddies is in the file
\verb+edgelist.csv+. The overlapping of eddies is represented
by an abstract graph. In the graph, a node is an instantaneous
eddy. Two nodes of the graph are connected by an edge if the two
corresponding eddies overlap. The graph is directed, which means that
the edges have a direction: the direction is chronological.
The file \verb+edgelist.csv+ stores, for a given orientation
of eddies, the whole graph as a list of edges. (This is a common graph
storage format.) Each line stores one edge: the origin node of the
edge followed by the target node of the edge. A node is identified by
a node index. Knowledge of the node index is equivalent to knowledge
of the date and the eddy index. See § \ref{sec:identification}.
Sur le domaine Eurec4A NRT, toutes les dates (soit 117 dates), au
total pour les deux orientations. \verb+get_N_eddies.m+ prend 0
mn. \verb+overlap_v6.py+ prend 0 mn et produit 48 KiB, pour 2793
arêtes.
Sur le domaine \verb+PhD_Lax+, le script \verb+overlap_HDF5.py+ prend
2 h 21 mn au total pour les deux graphes. Le script
\verb+overlap_v6.py+ prend 3 mn, auxquelles il faut ajouter 12 mn de
conversion au format v6 (total pour les fichiers nécessaires aux deux
graphes).
Sur le domaine global, toutes les dates, pour chaque
orientation. \verb+get_id_child.m+ prend environ 1 h et a besoin de 12
GiB de mémoire virtuelle. \verb+overlap_v6.py+
prend 7 mn et produit \np{0.5} GiB, pour \np{2.8e7} arêtes.
\section{Survival}
The last part of the Matlab program has the task of recognizing
several instantaneous eddies as a same evolving physical object. We
call this part of the analysis the survival part. The result of this
analysis is a dictionary of trajectories, stored in a file
\verb+traj_(anti|cyclo).json+. The key for each trajectory is the
identifying number of the trajectory and the value is the
corresponding list of instantaneous eddies. Here, an instantaneous
eddy is identified by its node index.
Sur le domaine Eurec4A NRT, toutes les dates (soit 117 dates), au
total pour les deux orientations. \verb+survival.m+ prend 0 mn.
\verb+survival.py+ prend 0 mn et produit 64 KiB pour 2863 n\oe{}uds et
169 trajectoires.
\end{document}