Commit 33ccac29 authored by Maude Le Jeune's avatar Maude Le Jeune
Browse files

remove 2010-nov slideshow

parent 0eba1ebe
#!/usr/bin/env python
def main():
import glob
import os
dia = glob.glob("./*.dia")
for f in dia:
eps = f.replace(".dia", ".eps")
os.system("dia -e %s %s"%(eps,f))
os.system("epstopdf %s"%eps)
if __name__ == "__main__":
main()
digraph pipeline {1->2->4;
3->4;
}
\ No newline at end of file
%% abtract
%% L'outil pipelet
% En traitement de données, lorsque l'on s'est affranchi des deux
% premiers obstacles que sont l'accès aux données, et la conception de
% l'algorithme que l'on souhaite leur appliquer, il reste encore la
% douloureuse étape de l'implémentation, et l'obtention de résultats.
% Cette partie, généralement dominée par les contraintes liées au volume
% des données, aux temps de calcul et à la complexité des chaînes, font
% la part belle à l'informatique. Malheureusement trop souvent au
% détriment de la physique.
% L'outil pipelet a été développé dans l'objectif de faciliter cette
% étape, en proposant une infrastructure qui répond à la problématique
% par deux fonctionnaliés phares:
% - une parallèlisation native des traitements, et la sauvegarde des
% produits intermédiaires
% - une interface web facilitant la comparaison et traçabilité des chaînes.
% Ce séminaire est une présentation technique de l'outil pipelet, que
% j'articulerai autour d'un exemple simple de traitement de données
% ``CMB''.
\documentclass[hyperref={colorlinks=true}]{beamer}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage[utf8]{inputenc}
\usepackage{multicol}
\usepackage{ulem}
\usepackage{color}
\usepackage{xspace}
\usepackage{listings}
\usepackage{wasysym}
\useoutertheme{infolines}
\usepackage{hangcaption}
\newcommand{\pipelet}{\textbf{\scriptsize{PIPELET}}\xspace}
\title[]{The \pipelet software (1.0)}
\author[Pipelet]{ Maude \textsc{Le Jeune}, Marc \textsc{Betoule}}
\institute[v1.0]{}
\date[2010/11/26]{November, 26th, 2010}
\newcommand{\unnumberedcaption}%
% {\@dblarg{\@unnumberedcaption\@captype}}
\begin{document}
\lstset{language=Python}
\begin{frame}
\titlepage
\end{frame}
\begin{frame}{\pipelet}
\tableofcontents
\end{frame}
\section{Context}
\begin{frame}
\tableofcontents[currentsection]
\end{frame}
\begin{frame}{Context and needs}
Usually in scientific data processing:
\begin{itemize}
\item Big data sets and/or big CPU time
\item Complex processing (multiple interdependant steps)
\item Optimal parameters unknown
\end{itemize}
\begin{centering} $\rightarrow$ Computation \textbf{and development} cost a lot.\\
\end{centering}
\begin{figure}
\includegraphics[width=0.50\textwidth]{pipelet_scheme_small2.pdf}
\end{figure}
The \pipelet software answers the 3 above items:
\begin{itemize}
\item Native parallelisation and CPU time sparing
\item Guarranty traceability
\item Offer comparison facilities
\end{itemize}
\end{frame}
\begin{frame}[fragile]
{Usual issues}
\begin{enumerate}
\item The processing is cut in 2/3 steps, intermediate products are
saved on disk with approximate filenames\\
\vspace{0.2cm}
\begin{scriptsize}
\hspace{0.5cm}\verb!map-nside2048-ps5sigma-masksmall!\\
\hspace{0.5cm}\verb!map-nside2048-ps5sigma-masksmall-nodip!\\
\hspace{0.5cm}\verb!map-nside2048-ps5sigma-masksmall-nodip-2!\\
\end{scriptsize}
\vspace{0.2cm}
\item A prototype pipeline is built using an high level programming
language, and smaller dimensions. What's next ?
\begin{itemize}
\item use another programming language to perform real processing
\item use interfacing
\item what about code parallelisation ?
\item what about portability ? (smp machines, clusters, ...)
\end{itemize}
\vspace{0.2cm}
\item In the best case: low level routines are documented and can be
reused by someone else, but pipelines are always trashed !
\end{enumerate}
\end{frame}
\begin{frame}{The \pipelet software}
\begin{enumerate}
\item \textit{The processing is cut in 2/3 steps, intermediate products are
saved on disk with approximate filenames}\\
\end{enumerate}
\begin{itemize}
\item Cut the whole processing into \textbf{segments}
\item Save intermediate products on disk
\item Use an unique indentifier wrt code, parameters and I/Os.
\begin{figure}
\includegraphics[width=0.75\textwidth]{pipelet_scheme_small3.pdf}
\end{figure}
\end{itemize}
\hspace{0.5cm}$\rhd$ The pipe scheme is defined by user. Any directed acyclic graph allowed. \\
\hspace{0.5cm}$\rhd$ Filenames are provided by the \pipelet engine. \\
\end{frame}
\begin{frame}{The \pipelet software}
\begin{enumerate}[\hspace{0.3cm}2.]
\item \textit{A prototype pipeline is built using an high level programming
language, and smaller dimensions. What's next ? }
\end{enumerate}
\begin{itemize}
\item \pipelet segments are written in Python.
\item Native parallelisation scheme applied on data to process (\textbf{tasks}).
\item Tasks can be executed in different modes (sequential
for debugging, using process on smp machine, batch mode on
clusters).
\end{itemize}
\begin{figure}
\includegraphics[width=0.50\textwidth]{parral.pdf}
\end{figure}
\hspace{0.5cm}$\rhd$ Python interfaces nicelly with
other languages : C/C++ extensions or SWIG/Boost wrapping. \\
\hspace{0.5cm}$\rhd$ Parallelisation at the highest level. No need to
learn OpenMP/MPI. \\
\hspace{0.5cm}$\rhd$ Portability offered by the different running
modes (\textbf{launchers}).
\end{frame}
\begin{frame}{The \pipelet software}
\begin{enumerate}[\hspace{0.3cm}3.]
\item \textit{In the best case: low level routines are documented and can be
reused by someone else, but pipelines are always trashed !}
\end{enumerate}
\begin{itemize}
\item Segments can be documented using Python docstring syntax.
\item Pipeline scheme written and displayed using graphviz dot language.
\item Collaborative work eased by a web interface and \textit{code
repositories}.
\end{itemize}
\begin{figure}
\includegraphics[width=0.25\textwidth]{p.pdf}
\includegraphics[width=0.15\textwidth]{p2.pdf}
\end{figure}
\hspace{0.5cm}$\rhd$ Non trivial dependency scheme are easy to read. \\
\hspace{0.5cm}$\rhd$ Results can be accessed by different users with a minimum amount of indications (tags, docstrings).\\
\end{frame}
\section{How it works}
\begin{frame}
\tableofcontents[currentsection]
\end{frame}
\begin{frame}{The \pipelet big scheme}
\begin{figure}
\includegraphics[width=0.90\textwidth]{pipelet_scheme.pdf}
\end{figure}
\end{frame}
\subsection{Building a pipeline}
\begin{frame}[fragile]{Building a pipeline}
\begin{verbatim}P = Pipeline(pipedot, codedir='./', prefix='/data/...')
\end{verbatim}
\begin{figure}
\includegraphics[width=0.5\textwidth]{pipelet_scheme_small.pdf}
\end{figure}
\begin{itemize}
\item \verb pipedot is the string description of the pipeline
\begin{verbatim}pipedot = """
1->2->4;
3->4;
"""
\end{verbatim}
\item \verb codedir is the path of the processing code files (.py)
\item \verb prefix is the path of the processed data repository
\end{itemize}
\end{frame}
\subsection{Writing segment scripts}
\begin{frame}[fragile]{Writing segment scripts}
\begin{itemize}
\item A segment is a python script (\verb .py file)
\end{itemize}
\begin{figure}
\includegraphics[width=0.9\textwidth]{seg_scheme.pdf}
\end{figure}
\begin{itemize}
\item It benefits from an improved namespace to:
\begin{itemize}
\item control the pipe parallelisation scheme;
\item save and load I/O's and provide filenames;
\item save and load parameters;
\item execute or include subprocess
\end{itemize}
\end{itemize}
\end{frame}
\subsection{Running a pipeline}
\begin{frame}[fragile]{Running a pipeline}
The pipe engine converts each pair of (processing code, data to
process) into a \textcolor{blue}{task list}.
\begin{figure}
\includegraphics[width=0.80\textwidth]{task_scheme.pdf}
\end{figure}
One can empty the \textcolor{blue}{task list} in different modes:
\begin{itemize}
\item the interactive mode (or debugging mode)
\item the process/thread mode (for smp machine)
\item the batch mode (for cluster)
\end{itemize}
\end{frame}
\subsection{Browsing a pipeline}
\begin{frame}[fragile]{Browsing a pipeline : \href{http://localhost:8080}{http://localhost:8080}}
\begin{figure}
\includegraphics[width=0.70\textwidth]{snapshot.png}
\end{figure}
From the web interface one can: \\
\vspace{0.5cm}
\begin{tabular}{ll}
$\bullet$ Filter/delete pipe instances & from the pipeline page\\
$\bullet$ Highlight dependencies & from the segment page\\
$\bullet$ Read code & from the segment page\\
$\bullet$ Read log files & from the log page\\
$\bullet$ Download/visualize/delete product files & from the product page\\
\end{tabular}
\end{frame}
\section{A CMB demo pipeline}
\subsection{Problematic}
\begin{frame}
{Problematic}
One wants to perform some spectral estimation using inverse noise
weighting on simulated data. \\
The wish list is:
\begin{itemize}
\item Design a prototype pipeline to get a first result.
\item Test different weighting masks.
\item Spare coupling matrix computation as much as possible.
\item Perform Monte Carlo studies to get some error bars.
\end{itemize}
\begin{figure}
\includegraphics[width=0.35\textwidth]{map.png}
\hspace{1cm}
\includegraphics[width=0.3\textwidth]{cl.jpg}
\end{figure}
\end{frame}
\subsection{A pipelet solution}
\begin{frame}[fragile]
{A \pipelet solution}
\begin{figure}
\includegraphics[width=0.7\textwidth]{cmbdemo.pdf}
\includegraphics[width=0.3\textwidth]{cls-plot.pdf}
\end{figure}
\end{frame}
\subsection{Zooming the code}
\begin{frame}[fragile]
{Zooming the code (1/3): main.py}
\scriptsize{\begin{lstlisting}
pipe_dot = """
noise->clcmb->clplot;
cmb->clcmb;
noise->clnoise;
"""
P = Pipeline(pipe_dot, code_dir='./', prefix='./cmb')
nside = 512 ## pipeline input
sim_ids = [1,2,3,4,5,6] ## pipeline input
cmbin = []
for sim_id in sim_ids:
cmbin.append((nside, sim_id))
P.push(cmb =cmbin) ## push as many inputs as CMB realizations
P.push(noise=[nside])
if options.debug: ## Interactive mode
w, t = launch_interactive(P)
w.run()
elif options.process: ## Process mode
launch_process(P, options.process)
else: ## Batch mode
launch_pbs(P, 10, job_name=job_name,cpu_time=cpu_time,job_header=job_header)
\end{lstlisting}}
\end{frame}
\begin{frame}[fragile]
{Zooming the code (2/3): cmb.py}
\scriptsize{\begin{lstlisting}
""" cmb.py
Generate a cmb map from lambda-CDM power spectrum.
"""
import healpy as hp
import pylab as pl
lst_par = ['lmax', 'nside']
nside = seg_input.values()[0][0] ## pushed from main
sim_id = seg_input.values()[0][1] ## pushed from main
lmax = 2*nside
input_cl = "lambda_best_fit.txt"
cmb_cl = pl.loadtxt(input_cl)[0:lmax+1,0] ## load cl
cmb_map = hp.synfast(cmb_cl, nside, lmax=lmax) ## make a map
cmb_map_fn = get_data_fn ('map_cmb.fits')
hp.write_map(cmb_map_fn, cmb_map)
cmb_map_fig = cmb_map_fn.replace('.fits', '.png')
hp.mollview(cmb_map, title="cmb in %s"%cmb_unit)
pl.savefig (cmb_map_fig)
seg_output = [sim_id] ## forward as many childs as sim ids
\end{lstlisting}}
\end{frame}
\begin{frame}[fragile]
{Zooming the code (3/3): clplot.py}
\scriptsize{\begin{lstlisting}
""" clplot.py
Make a plot.
"""
import pylab as pl
### Gather all sim_ids
#multiplex cross_prod group_by '0'
### Retrieve some global parameters
load_param('cmb', globals(), ["input_cl", "lmax", "cmb_unit"])
load_param("noise", globals(), ["noise_unit", "noise_power"])
## Get mean cl values and error bars
pseudo_cls = glob_seg('clcmb', 'cls.txt')
mll_cls = glob_seg('clcmb', 'cl*mll*.txt')
nsims = len(mll_cls)
for sim_id in range(nsims):
...
## make a plot
...
\end{lstlisting}}
\end{frame}
\subsection{Browsing the result}
\subsection{Deployment}
\begin{frame}[fragile]
{Deployment}
\begin{enumerate}
\item Development phase : on your laptop.
\begin{itemize}
\item single CMB realization.
\item include \verb!clnoise.py! segment for cross check.
\item run the pipe using the python debugger \verb!%pdb!
\item and the \pipelet interactive mode (sequential)
\end{itemize}
\item Production phase : on a desktop machine / adamis cluster.
\begin{itemize}
\item 100 CMB realizations (a single $m_{\ell \ell}$ computation !)
\item use the process mode to dispatch tasks between cores.
\item use the batch mode to dispatch tasks between cores and nodes.
\end{itemize}
\item Release phase : @ CCin2p3 or Magique3.
\begin{itemize}
\item 1000 CMB realizations.
\item use a cutosmized segment environment (DMC database)
\item use a customized batch launcher (BQS)\\
(Those 2 customizations are already done.)
\end{itemize}
\end{enumerate}
\end{frame}
\section{Getting started}
\begin{frame}
\tableofcontents[currentsection]
\end{frame}
\begin{frame}[fragile]{Getting \pipelet}
$\rhd$ Download from \url{http://gitorious.org/pipelet}
\begin{itemize}
\item Git repository\\
\begin{centering}\verb!git clone git@gitorious.org:pipelet/pipelet.git!
\end{centering}
\item Open wiki including documentation
\end{itemize}
\vspace{0.5cm}
$\rhd$ Features and bugs are tracked from the IN2P3 forge.\\
\hspace{0.3cm}Don't hesitate to subscribe the \pipelet project to give your feedback and follow the project news.\\
\vspace{0.5cm}
$\rhd$ CMB demo pipeline under : \verb!pipelet/test/cmb!
\end{frame}
\begin{frame}[fragile]{A word on Python}
Python is free a programming language:
\begin{itemize}
\item runs on Linux/Unix (native), Windows, Mac OS X
\item high level language easy to use (as compared to C)
\item fast (as compared to Matlab, Octave, IDL)
\end{itemize}
Python tools:
\begin{itemize}
\item interpreter: ipython
\item numerical libraries: numpy , matplotlib
\item CMB librairies: healpy, spherelib
\end{itemize}
\end{frame}
\end{document}
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment