Commit 84397ba5 authored by Maude Le Jeune's avatar Maude Le Jeune
Browse files

adamis presentation import

parents
%% abtract
%% L'outil pipelet
% En traitement de données, lorsque l'on s'est affranchi des deux
% premiers obstacles que sont l'accès aux données, et la conception de
% l'algorithme que l'on souhaite leur appliquer, il reste encore la
% douloureuse étape de l'implémentation, et l'obtention de résultats.
% Cette partie, généralement dominée par les contraintes liées au volume
% des données, aux temps de calcul et à la complexité des chaînes, font
% la part belle à l'informatique. Malheureusement trop souvent au
% détriment de la physique.
% L'outil pipelet a été développé dans l'objectif de faciliter cette
% étape, en proposant une infrastructure qui répond à la problématique
% par deux fonctionnaliés phares:
% - une parallèlisation native des traitements, et la sauvegarde des
% produits intermédiaires
% - une interface web facilitant la comparaison et traçabilité des chaînes.
% Ce séminaire est une présentation technique de l'outil pipelet, que
% j'articulerai autour d'un exemple simple de traitement de données
% ``CMB''.
\documentclass[hyperref={colorlinks=true}]{beamer}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage[utf8]{inputenc}
\usepackage{multicol}
\usepackage{ulem}
\usepackage{color}
\usepackage{xspace}
\usepackage{listings}
\usepackage{wasysym}
\useoutertheme{infolines}
\usepackage{hangcaption}
\newcommand{\pipelet}{\textbf{\scriptsize{PIPELET}}\xspace}
\title[]{The \pipelet software (1.0)}
\author[Pipelet]{ Maude \textsc{Le Jeune}, Marc \textsc{Betoule}}
\institute[v1.0]{}
\date[2010/11/26]{November, 26th, 2010}
\newcommand{\unnumberedcaption}%
% {\@dblarg{\@unnumberedcaption\@captype}}
\begin{document}
\lstset{language=Python}
\begin{frame}
\titlepage
\end{frame}
\begin{frame}{\pipelet}
\tableofcontents
\end{frame}
\section{Context}
\begin{frame}
\tableofcontents[currentsection]
\end{frame}
\begin{frame}{Context and needs}
Usually in scientific data processing:
\begin{itemize}
\item Big data sets and/or big CPU time
\item Optimal parameters unknown
\item Complex processing (multiple interdependant steps)
\end{itemize}
\begin{centering} $\rightarrow$ Computation \textbf{and development} cost a lot.\\
\end{centering}
\begin{figure}
\includegraphics[width=0.50\textwidth]{img/pipelet_scheme_small2.pdf}
\end{figure}
The \pipelet framework helps with these 3 points:
\begin{itemize}
\item Native parallelisation and CPU time saving\\
\hspace{0.5cm} (recompute only the needed parts)
\item Offer comparison facilities
\item Take care of traceability
\end{itemize}
\end{frame}
\begin{frame}[fragile]
{Usual issues}
\begin{enumerate}
\item The processing is cut in 2/3 steps, intermediate products are
saved on disk with approximate filenames\\
\vspace{0.2cm}
\begin{scriptsize}
\hspace{0.5cm}\verb!map-nside2048-ps5sigma-masksmall!\\
\hspace{0.5cm}\verb!map-nside2048-ps5sigma-masksmall-nodip!\\
\hspace{0.5cm}\verb!map-nside2048-ps5sigma-masksmall-nodip-2!\\
\end{scriptsize}
\vspace{0.2cm}
\item In the best case: low level routines are documented and can be
reused by someone else, but pipelines are always trashed !
\vspace{0.2cm}
\item A prototype pipeline is built using an high level programming
language, and smaller dimensions. What's next ?
\begin{itemize}
\item use another programming language to perform real processing
\item use interfacing
\item what about code parallelisation ?
\item what about portability ? (smp machines, clusters, ...)
\end{itemize}
\end{enumerate}
\end{frame}
\begin{frame}{Pipe scheme and intermediate products}
\begin{enumerate}
\item \textit{The processing is cut in 2/3 steps, intermediate products are
saved on disk with approximate filenames}\\
\end{enumerate}
\begin{columns}
\begin{column}{0.45\textwidth}
$\rhd$ Cut the whole processing into \textcolor{blue}{segments} \\
\vspace{0.1cm}
$\rhd$ Save intermediate products on disk \\
\vspace{0.1cm}
$\rhd$ Use an unique indentifier wrt code, parameters and I/Os.
\begin{figure}
\includegraphics[width=1.5\textwidth]{img/pipelet_scheme_small3.pdf}
\end{figure}
\end{column}
\begin{column}{0.55\textwidth}
$\Longrightarrow$ Filenames are provided by the \pipelet engine. \\
\hspace{0.3cm} - key inputs (not the key itself) readable from the web
interface. \\
\vspace{0.5cm}
$\Longrightarrow$ The pipe scheme is defined by user (any directed
acyclic graph allowed) \\
\hspace{0.3cm} - small number of segments for disk saving \\
\hspace{0.3cm} - right number of segments for readability \\
\end{column}
\end{columns}
\end{frame}
\begin{frame}{Permanence and collaborative work}
\begin{enumerate}[\hspace{0.3cm}2.]
\item \textit{In the best case: low level routines are documented and can be
reused by someone else, but pipelines are always trashed !}
\end{enumerate}
\begin{columns}
\begin{column}{0.5\textwidth}
$\rhd$ Pipeline scheme written and displayed using graphviz dot language. \\
\vspace{1cm}
$\rhd$ Segments can be documented using Python docstring syntax. \\
\vspace{1cm}
$\rhd$ Collaborative work eased by a web interface and \textit{code
repositories}.
\end{column}
\begin{column}{0.5\textwidth}
\begin{figure}
\includegraphics[width=0.45\textwidth]{img/p.pdf}
\includegraphics[width=0.35\textwidth]{img/p2.pdf}
\end{figure}
$\Longrightarrow$ Non trivial dependency scheme are easy to read. \\
\vspace{0.5cm}
$\Longrightarrow$ Results can be accessed by different users with a min amount of indications (tags).\\
\end{column}
\end{columns}
\end{frame}
\begin{frame}{Prototyping and parallelization}
\begin{enumerate}[\hspace{0.3cm}3.]
\item \textit{A prototype pipeline is built using an high level programming
language, and smaller dimensions. What's next ? }
\end{enumerate}
\begin{figure}
\includegraphics[width=0.50\textwidth]{img/parral.pdf}
\end{figure}
\begin{columns}
\begin{column}{0.5\textwidth}
$\rhd$ Native parallelisation scheme applied on data to process
(\textbf{tasks}). \\
\vspace{1cm}
$\rhd$ \textbf{Workers} empty the \textbf{task queue} in different
modes (sequential for debugging, using process on smp machine, batch
mode on clusters).
\end{column}
\begin{column}{0.5\textwidth}
$\Longrightarrow$ Parallelisation at the highest level. No need to
learn OpenMP/MPI. \\
\vspace{1cm}
$\Longrightarrow$ Scalability and portability offered by the different running
modes (\textbf{launchers}).
\end{column}
\end{columns}
\end{frame}
\section{How it works}
\begin{frame}
\tableofcontents[currentsection]
\end{frame}
\begin{frame}{The \pipelet big scheme}
\begin{figure}
\includegraphics[width=0.90\textwidth]{img/pipelet_scheme.pdf}
\end{figure}
\end{frame}
\subsection{Building a pipeline}
\begin{frame}[fragile]{Building a pipeline}
\begin{verbatim}P = Pipeline(pipedot, codedir='./', prefix='/data/...')
\end{verbatim}
\begin{figure}
\includegraphics[width=0.5\textwidth]{img/pipelet_scheme_small.pdf}
\end{figure}
\begin{itemize}
\item \verb pipedot is the string description of the pipeline
\begin{verbatim}pipedot = """
1->2->4;
3->4;
"""
\end{verbatim}
\item \verb codedir is the path of the processing code files (.py)
\item \verb prefix is the path of the processed data repository
\end{itemize}
\end{frame}
\subsection{Writing segment scripts}
\begin{frame}[fragile]{Writing segment scripts}
\begin{itemize}
\item A segment is a python script (\verb .py file)
\end{itemize}
\begin{figure}
\includegraphics[width=0.9\textwidth]{img/seg_scheme.pdf}
\end{figure}
\begin{itemize}
\item It benefits from an improved namespace to:
\begin{itemize}
\item provide filenames, save and load I/O's;
\item save and load parameters;
\item execute or include subprocess (and log);
\item control the pipe parallelisation scheme.
\end{itemize}
\end{itemize}
\end{frame}
\subsection{Running a pipeline}
\begin{frame}[fragile]{Running a pipeline}
The pipe engine converts each pair of (processing code, data to
process) into a \textcolor{blue}{task list}.
\begin{figure}
\includegraphics[width=0.80\textwidth]{img/task_scheme.pdf}
\end{figure}
One can empty the \textcolor{blue}{task list} by launching \textcolor{blue}{workers} in different
modes:
\begin{itemize}
\item the process/thread mode (for smp machine)\\
\hspace{2cm}\begin{scriptsize}\verb!python main.py -p 4!\end{scriptsize}
\item the interactive mode (or debugging mode)\\
\hspace{2cm}\begin{scriptsize}\verb!ipython: %pdb!\\
\hspace{2cm}\verb!ipython: run main.py -d!\end{scriptsize}
\item the batch mode (for cluster)\\
\hspace{2cm}\begin{scriptsize}\verb!python main.py -b!\\
\hspace{2cm}\verb!python main.py -a 8!\end{scriptsize}
\end{itemize}
\end{frame}
\subsection{Browsing a pipeline}
\begin{frame}[fragile]{Browsing a pipeline : \href{http://localhost:8080}{http://localhost:8080}}
\begin{figure}
\includegraphics[width=0.70\textwidth]{img/snapshot.png}
\end{figure}
From the web interface one can: \\
\vspace{0.5cm}
\begin{tabular}{ll}
$\bullet$ Filter/delete pipe instances & from the pipeline page\\
$\bullet$ Highlight dependencies & from the segment page\\
$\bullet$ Read code & from the segment page\\
$\bullet$ Read log files & from the log page\\
$\bullet$ Download/visualize/delete product files & from the product page\\
\end{tabular}
\end{frame}
\section{A CMB demo pipeline}
\begin{frame}
\tableofcontents[currentsection]
\end{frame}
\subsection{Problematic}
\begin{frame}
{Problematic}
Evaluate the performances of inverse noise weighting
spectral estimation via simulations.\\
\vspace{0.2cm}
The wish list is:
\begin{itemize}
\item Design a prototype pipeline to get a first result.
\item Perform Monte Carlo studies to get error bars. \\
\end{itemize}
\hspace{0.8cm}$\rhd$ Test different weighting masks. \\
\hspace{0.8cm}$\rhd$ Save coupling matrix computation as much as possible.
\begin{figure}
\includegraphics[width=0.35\textwidth]{img/map.png}
\hspace{1cm}
\includegraphics[width=0.3\textwidth]{img/cl.jpg}
\end{figure}
\end{frame}
\subsection{A pipelet solution}
\begin{frame}[fragile]
{A \pipelet solution}
\only<1>{
\begin{figure}
\includegraphics[width=0.7\textwidth]{img/cmbdemo4.pdf}
\includegraphics[width=0.3\textwidth]{img/blank.png}
\end{figure}}
\only<2>{
\begin{figure}
\includegraphics[width=0.7\textwidth]{img/cmbdemo3.pdf}
\includegraphics[width=0.3\textwidth]{img/blank.png}
\end{figure}}
\only<3>{
\begin{figure}
\includegraphics[width=0.7\textwidth]{img/cmbdemo2.pdf}
\includegraphics[width=0.3\textwidth]{img/blank.png}
\end{figure}}
\only<4>{
\begin{figure}
\includegraphics[width=0.7\textwidth]{img/cmbdemo1.pdf}
\includegraphics[width=0.3\textwidth]{img/blank.png}
\end{figure}}
\only<5>{
\begin{figure}
\includegraphics[width=0.7\textwidth]{img/cmbdemo.pdf}
\includegraphics[width=0.3\textwidth]{img/cls-plot.pdf}
\end{figure}}
\end{frame}
\subsection{Zooming the code}
\begin{frame}[fragile]
{Zooming the code (1/3): main.py}
\scriptsize{\begin{lstlisting}
pipe_dot = """
cmb->clcmb->clplot;
noise->clcmb;
noise->clnoise->clplot;
"""
P = Pipeline(pipe_dot, code_dir='./', prefix='./cmb')
P.push(cmb =[1,2,3,4,5,6]) ## push as many inputs as CMB realizations
if options.debug: ## Interactive mode
w, t = launch_interactive(P)
w.run()
elif options.process: ## Process mode
launch_process(P, options.process)
else: ## Batch mode
launch_pbs(P, 10, job_name="pipelet job",cpu_time="00:30:00")
\end{lstlisting}}
\end{frame}
\begin{frame}[fragile]
{Zooming the code (2/3): cmb.py}
\scriptsize{\begin{lstlisting}
""" cmb.py
Generate a cmb map from lambda-CDM power spectrum.
"""
import healpy as hp
import pylab as pl
lst_par = ['lmax', 'nside']
sim_id = seg_input[0] ## pushed from main
lmax = 2*nside
input_cl = "lambda_best_fit.txt"
cmb_cl = pl.loadtxt(input_cl)[0:lmax+1,0] ## load cl
cmb_map = hp.synfast(cmb_cl, nside, lmax=lmax) ## make a map
cmb_map_fn = get_data_fn ('map_cmb.fits')
hp.write_map(cmb_map_fn, cmb_map)
cmb_map_fig = cmb_map_fn.replace('.fits', '.png')
hp.mollview(cmb_map, title="cmb")
pl.savefig (cmb_map_fig)
seg_output = [sim_id] ## forward as many childs as sim ids
\end{lstlisting}}
\end{frame}
\begin{frame}[fragile]
{Zooming the code (3/3): clplot.py}
\scriptsize{\begin{lstlisting}
""" clplot.py
Make a plot.
"""
import pylab as pl
### Gather all sim_ids
#multiplex cross_prod group_by '0'
### Retrieve some global parameters
load_param('cmb', globals(), ["lmax"])
load_param("noise", globals(), ["noise_power"])
## Get mean cl values and error bars
pseudo_cls = glob_seg('clcmb', 'cls.txt')
mll_cls = glob_seg('clcmb', 'cl*mll*.txt')
nsims = len(mll_cls)
for sim_id in range(nsims):
...
## make a plot
...
\end{lstlisting}}
\end{frame}
\subsection{Browsing the result}
\begin{frame}{Browsing the result}
\href{http://localhost:8080}{http://localhost:8080}
\end{frame}
%%
%%
\subsection{Deployment}
\begin{frame}[fragile]
{Deployment}
\begin{enumerate}
\item Development phase : on your laptop.
\begin{itemize}
\item use the \pipelet interactive mode and python debugger \verb!%pdb!
\end{itemize}
\item Production phase : on a desktop machine / adamis cluster.
\begin{itemize}
\item use the process or batch mode to dispatch tasks between cores
\end{itemize}
\item Release phase : @ CCin2p3 or Magique3.
\begin{itemize}
\item use a cutosmized segment environment (DMC database)
\item use a customized batch launcher (BQS)
\end{itemize}
\end{enumerate}
\begin{figure}
\includegraphics[width=0.9\textwidth]{img/frise.pdf}
\end{figure}
\end{frame}
\section{Getting started}
\begin{frame}
\tableofcontents[currentsection]
\end{frame}
\begin{frame}[fragile]{Getting \pipelet}
$\rhd$ Download from \url{http://gitorious.org/pipelet}
\begin{itemize}
\item Git repository\\
\begin{centering}\verb!git clone git://gitorious.org/pipelet/pipelet.git!
\end{centering}
\item Open wiki including documentation
\end{itemize}
\vspace{0.5cm}
$\rhd$ Features and bugs are tracked from the IN2P3 forge.\\
\hspace{0.3cm}Don't hesitate to subscribe the \pipelet project to give your feedback and follow the project news.\\
\vspace{0.5cm}
$\rhd$ CMB demo pipeline under : \verb!pipelet/test/cmb!
\end{frame}
\begin{frame}[fragile]{A word on Python}
\includegraphics[width=0.2\textwidth]{img/python-logo.png} is a free programming language:
\begin{itemize}
\item high level language easy to use (as compared to C) \\
\url{http://mathesaurus.sourceforge.net}
\item runs on Linux/Unix (native), Windows, Mac OS X
\item fast (as compared to Matlab, Octave, IDL)
\item interfaces nicelly with other languages : C/C++ extensions or SWIG/Boost wrapping.
\item Extensive standard library
\item and most of all : Pythonic !
\end{itemize}
Python tools:
\begin{itemize}
\item interpreter: ipython
\item numerical libraries: numpy + matplotlib $rightarrow$ pylab
\item CMB librairies: healpy, spherelib
\end{itemize}
\end{frame}
\begin{frame}{Advanced usage and discussion}
\begin{itemize}
\item Customizing the segment environment \\
\hspace{0.5cm}$\rhd$ Segment
environment is a namespace, provided with default utilities
(filenames, parameters, subprocesses, ...) \\
\hspace{0.5cm}$\rhd$ This namespace is loaded from a Python \textcolor{blue}{object}
which can be derived (heritage) or changed. \\
\item Customizing launchers \\
\hspace{0.5cm}$\rhd$ Right now available : interactive, process, batch
(PBS, BQS) \\
\item Using repositories for segment scripts:\\
\hspace{0.5cm}$\rhd$ Git, CVS, SVN, ... \\
\item Improving the web interface.
\item Suggestions are welcomed !
\end{itemize}
\end{frame}
\end{document}
File added
File added
#!/usr/bin/env python
def main():
import glob
import os
dia = glob.glob("./*.dia")
for f in dia:
eps = f.replace(".dia", ".eps")
os.system("dia -e %s %s"%(eps,f))
os.system("epstopdf %s"%eps)
if __name__ == "__main__":
main()
File added
File added
File added
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment