\documentclass[a4paper,french]{article} \usepackage[utf8]{inputenc} \usepackage{amsmath} \usepackage[T1]{fontenc} \usepackage{lmodern} \usepackage{babel} \usepackage{graphicx} \usepackage[super]{nth} \usepackage[np,autolanguage]{numprint} \usepackage{hyperref} \hypersetup{pdftitle={Documentation convert Matlab}, pdfauthor={Lionel Guez}, hypertexnames=false} %--------------- \graphicspath{{Graphiques/}} \title{Documentation for scripts converting Matlab data} \author{Lionel GUEZ} %--------------- \begin{document} \maketitle \tableofcontents \section{Identification of instantaneous eddies} \label{sec:identification} Instantaneous eddies with a given orientation are identified in two equivalent ways: either by a couple date $d$ and eddy index $e$ at that date, or by a unique integer identifier $n$, which we can call a \og node index\fg{}. For a given date, let $d$ be the corresponding number of days since January \nth{1}, 1950. Let $d_\mathrm{init}$ be the value of $d$ for the first date of the dataset. We call date index the integer: \begin{equation*} k = d - d_\mathrm{init} \end{equation*} So the first value of $k$ is 0. Let $e_\mathrm{max}(k)$ be the number of instantaneous eddies at date index $k$. $e_\mathrm{max}(k)$ is stored in Matlab variables Nanti and Ncyclo. We assume we know an overestimate $E$ of $e_\mathrm{max}(k)$ for all $k$. That is, we guarantee that: \begin{equation*} \forall k, e_\mathrm{max}(k) \le E \end{equation*} ($E$ may not be the same for cyclones and anticyclones.) The eddy indices $e$ for eddies at date index $k$ start at 1 and increment 1 by 1, without any jump. So they go from 1 to $e_\mathrm{max}(k)$. The relation between node index $n$, date index $k$ and eddy index $e$ is: \begin{equation*} n = k E + e \end{equation*} So $n \ge 1$ and usually jumps at each change of date. The anticyclones at the first date have a node index between 1 and $E$, at the second date between $E + 1$ and $2 E$, and so on. Same for cyclones. Cf. table (\ref{tab:eddy_id_Matlab}). \begin{table}[htbp] \centering \begin{tabular}{lll} date index 0 & & \\ & anticyclones & $n = 1, \dots, e_\mathrm{max,anti}(0)$ \\ & cyclones & $n = 1, \dots, e_\mathrm{max,cyclo}(0)$ \\ date index k & & \\ & anticyclones & $n = k E + 1, \dots, k E + e_\mathrm{max,anti}(k)$ \\ & cyclones & $n = k E + 1, \dots, k E + e_\mathrm{max,cyclo}(k)$ \end{tabular} \caption{Node indices. $k$ is the date index, starting at 0.} \label{tab:eddy_id_Matlab} \end{table} Conversely, from the definition of $E$, knowing $n$ and $E$, we can obtain $(k, e)$: \begin{equation*} k = \left \lfloor \frac{n - 1}{E} \right \rfloor \end{equation*} \begin{align*} e & = n - k E \\ & = 1 + (n - 1) \bmod E \end{align*} In the Matlab program, the node index $n'$ is different: \begin{equation*} n' = k' E + e \end{equation*} where $k'$ is the index, starting at 0, of the date $d$ in the suite of processed dates. So $k' \ne k$ if there is a missing date in the suite of processed dates. \begin{equation*} k' = \left \lfloor \frac{n' - 1}{E} \right \rfloor \end{equation*} The suite of processed dates is stored in the array \verb+date_num+. From $n'$ we can compute $k'$ and $e$, and then $d$ from \verb+date_num[k']+, without any search. But the reverse conversion needs a search: from $d$ and $e$, we need to search \verb+date_num+ to get $k'$, then $n'$. To convert from $n'$ to $n$, we note that: \begin{equation*} k = \mathtt{date\_num[k']} - \mathtt{date\_num[0]} \end{equation*} and: \begin{equation*} n = n' + (k - k') E \end{equation*} Note that $E$ is specific to a given orientation. \section{Instantaneous eddies} The data for instantaneous eddies is stored in shapefiles, in the directories \verb+SHPC_(anti|cyclo)+. Cf. figure \ref{fig:convert_Matlab}. \begin{figure}[htbp] \centering \includegraphics[width=\textwidth]{convert_Matlab} \caption{Correspondance between input Matlab files and files converted from them. Below the names of Matlab files are indicated the names of the variables which are used inside those files, and the fields used inside those variables.} \label{fig:convert_Matlab} \end{figure} The directory \verb+SHPC_(anti|cyclo)+ contains a set of four shapefiles: centroid, extremum, \verb+max_speed_contour+ and \verb+outermost_contour+. The four shapefiles correspond to four \og layers\fg{} of eddies. (\og layers\fg{} is a term that you can often find in the documentation of software dealing with geographical data.) Each layer corresponds to a given type of geometry. Here we have only two types of geometry: points and polygons. The layers centroid and extremum contain points, while the layers \verb+max_speed_contour+ and \verb+outermost_contour+ contain polygons. The centroid layer is for the geometric center of the maximum-speed contour. The extremum layer is for the position of the extremum of SSH, which is called center in the Matlab files. A shapefile contains \og shapes\fg{}. The shapes are identified by a shape index $i$, starting at 0. Each eddy has a shape, at the same shape index $i$, in the four layers. Cf. figure \ref{fig:SHPC}. \begin{figure}[htbp] \centering \includegraphics[width=\textwidth]{SHPC} \caption{Contents of a shapefile collection: SHPC\_(anti|cyclo).} \label{fig:SHPC} \end{figure} The shapefile collection in \verb+SHPC_(anti|cyclo)+ contains data for all the dates. The dates are stored successively in the layers. From the computer point of view, each shapefile is made up of three files, ending with suffixes \verb+.shp+, \verb+.dbf+ and \verb+.shx+. The \verb+.shp+ file contains the positions, the \verb+.dbf+ file contains the metadata, and the \verb+.shx+ file is an index. So the \verb+.shp+ file is the largest file of the three. But the three files form a logical unit and you should never separate them. The instantaneous eddies are identified, for a given orientation, by a date and an eddy index at that date. See § \ref{sec:identification}. These are fields (column headers) in the dbf files. The date is given as the number of days since January \nth{1}, 1950, in the field \verb+days_1950+. The eddy index is between 1 and the number of eddies at the date, in the field \verb+eddy_index+. The file \verb+ishape_last.txt+ in the directory \verb+SHPC_(anti|cyclo)+ is used to access directly any instantaneous eddy at any date. Let $d_0$ be the first date in an SHPC. Let $l(d)$ be the shape index in the SHPC of the last instantaneous eddy at date $d$: \begin{equation*} l(d) = \sum_{d' = d_0} ^d e_\mathrm{max}(d') - 1 \end{equation*} So the shape index $i$ of the eddy with eddy index $e$ at date $d$ is: \begin{equation*} i = \begin{array}{|ll} e - 1 & \mathrm{if}\ d = d_0 \\ l(d - 1) + e & \mathrm{if}\ d > d_0 \end{array} \end{equation*} Also, note that, if $d > d_0$ then: \begin{equation*} e_\mathrm{max}(d) = l(d) - l(d - 1) \end{equation*} \verb+ishape_last.txt+ gives $l(d)$ for all $d$ in an SHPC. Finally, the file \verb+grid_nml.txt+ in the directory \verb+SHPC_(anti|cyclo)+ gives (in Fortran namelist format) the grid of SSH data from which the eddies were detected. The shapefiles are in binary format, so you need special software to read them. There is actually a large number of programs to read them. The metadata, in the \verb+.dbf+ file, is a spreadsheet, and you can open it with LibreOffice. There are also shell commands that can read shapefiles: dbfinfo, shpinfo, dbfdump and shpdump (analogous to ncdump for NetCDF files). If you have a Linux Debian-type distribution (Ubuntu, LinuxMint etc.), you can install these tools with: \begin{verbatim} sudo apt-get install shapelib \end{verbatim} The tools are available on Ciclad. You can also access the shapefiles programmatically with a number of computer languages. From Python, one of the modules you can use is \href{https://fiona.readthedocs.io/en/latest/}{Fiona}. Matlab has the shaperead function. Sur le domaine Eurec4A NRT, toutes les dates (soit 117 dates), au total pour les deux orientations, \verb+inst_eddies_v6.py+ prend environ 12 mn et produit 3 MiB, pour 2951 tourbillons instantanés. Sur le domaine global, pour une seule date, \verb+inst_eddies_v6.py+ prend 25 s (dont environ 21 s de conversion au format v6). \verb+inst_eddies_HDF5.py+ prend 38 s. Il reste le problème du stockage intermédiaire pour les fichiers v6. Le fichier produit par \verb+inst_eddies.m+ prend environ le quart de l'espace du fichier de départ. \section{Overlapping} Pour convertir les données Matlab, les scripts Matlab de conversion et les fichiers d'entrée doivent être dans le répertoire courant, où je dois avoir l'autorisation d'écriture. Créer donc des liens symboliques. \begin{verbatim} ln -s .../Detection_eddies/Convert_Matlab/get_N_eddies.m ln -s .../Parameters_Anticyclonic_Eddies.mat ln -s .../Parameters_Cyclonic_Eddies.mat matlab -batch get_N_eddies \end{verbatim} Puis le traitement est indépendant pour chaque orientation. Par exemple pour les anticyclones : \begin{verbatim} ln -s .../Trajectories/Association_eddies_Anti_max.mat \ Association_eddies_max.mat ln -s .../Detection_eddies/Convert_Matlab/get_id_child.m matlab -batch get_id_child ln -s Nanti.mat N_eddies.mat .../Detection_eddies/Convert_Matlab/overlap_v6.py mv edgelist.csv edgelist_anti.csv \end{verbatim} The data on overlapping instantaneous eddies is in the file \verb+edgelist.csv+. The overlapping of eddies is represented by an abstract graph. In the graph, a node is an instantaneous eddy. Two nodes of the graph are connected by an edge if the two corresponding eddies overlap. The graph is directed, which means that the edges have a direction: the direction is chronological. The file \verb+edgelist.csv+ stores, for a given orientation of eddies, the whole graph as a list of edges. (This is a common graph storage format.) Each line stores one edge: the origin node of the edge followed by the target node of the edge. A node is identified by a node index. Knowledge of the node index is equivalent to knowledge of the date and the eddy index. See § \ref{sec:identification}. Sur le domaine Eurec4A NRT, toutes les dates (soit 117 dates), au total pour les deux orientations. \verb+get_N_eddies.m+ prend 0 mn. \verb+overlap_v6.py+ prend 0 mn et produit 48 KiB, pour 2793 arêtes. Sur le domaine \verb+PhD_Lax+, le script \verb+overlap_HDF5.py+ prend 2 h 21 mn au total pour les deux graphes. Le script \verb+overlap_v6.py+ prend 3 mn, auxquelles il faut ajouter 12 mn de conversion au format v6 (total pour les fichiers nécessaires aux deux graphes). Sur le domaine global, toutes les dates, pour chaque orientation. \verb+get_id_child.m+ prend environ 1 h et a besoin de 12 GiB de mémoire virtuelle. \verb+overlap_v6.py+ prend 7 mn et produit \np{0.5} GiB, pour \np{2.8e7} arêtes. \section{Survival} The last part of the Matlab program has the task of recognizing several instantaneous eddies as a same evolving physical object. We call this part of the analysis the survival part. The result of this analysis is a dictionary of trajectories, stored in a file \verb+traj_(anti|cyclo).json+. The key for each trajectory is the identifying number of the trajectory and the value is the corresponding list of instantaneous eddies. Here, an instantaneous eddy is identified by its node index. Sur le domaine Eurec4A NRT, toutes les dates (soit 117 dates), au total pour les deux orientations. \verb+survival.m+ prend 0 mn. \verb+survival.py+ prend 0 mn et produit 64 KiB pour 2863 n\oe{}uds et 169 trajectoires. \end{document}