README.org 29.9 KB
Newer Older
Maude Le Jeune's avatar
Maude Le Jeune committed
1 2 3
#+TITLE: The Pipelet Readme
#+STYLE: <link rel="stylesheet" type="text/css" href="org.css" />

Marc Betoule's avatar
Marc Betoule committed
4 5
The Pipelet Readme

Maude Le Jeune's avatar
Maude Le Jeune committed
6 7
Pipelet is a free framework allowing for the creation, execution and
browsing of scientific data processing pipelines. It provides:
8

Marc Betoule's avatar
Marc Betoule committed
9 10 11 12 13 14 15 16 17 18 19
+ easy chaining of interdependent elementary tasks,
+ web access to data products,
+ branch handling,
+ automated distribution of computational tasks.

Both engine and web interface are written in Python.

* Tutorial
** Introduction
*** Why using pipelines

Maude Le Jeune's avatar
Maude Le Jeune committed
20
The pipeline mechanism allows you to apply a sequence of processing
Marc Betoule's avatar
Marc Betoule committed
21 22 23 24 25 26 27 28
steps to your data, in a way that the input of each process is the
output of the previous one. Making visible these different processing
steps, in the right order, is essential in data analysis to keep track
of what you did, and make sure that the whole processing remains
consistent.

*** How it works

Maude Le Jeune's avatar
Maude Le Jeune committed
29 30 31 32
Pipelet is based on the possibility to save on disk every intermediate
input or output of your pipeline, which is usually not a strong
constraint but offers a lot of benefits. It means that you can stop
the processing whenever you want, and begin it again without
Marc Betoule's avatar
Marc Betoule committed
33 34 35 36 37 38 39
recomputing the whole thing: you just take the last products you have
on disk, and continue the processing where it stopped. This logic is
interesting when the computation cost is higher than the cost of disk
space required by intermediate products.

*** The Pipelet functionalities

Maude Le Jeune's avatar
Maude Le Jeune committed
40 41
Pipelet is a free framework which helps you : 
+ to write and manipulate pipelines with any dependency scheme, 
Marc Betoule's avatar
Marc Betoule committed
42 43 44
+ to dispatch the computational tasks on parallel architectures, 
+ to keep track of what processing has been applied to your data and perform comparisons.

Maude Le Jeune's avatar
Maude Le Jeune committed
45

Marc Betoule's avatar
Marc Betoule committed
46
** Getting started
Maude Le Jeune's avatar
Maude Le Jeune committed
47 48 49 50 51 52
*** Pipelet installation 
**** Dependencies 

+ Running the Pipelet engine requires Python >= 2.6.

+ The web interface of Pipelet requires the installation of the
53
  cherrypy3 Python module (on Debian: aptitude install python-cherrypy3).
Maude Le Jeune's avatar
Maude Le Jeune committed
54

55
You may find useful to install some generic scientific tools that nicely interact with Pipelet: 
Maude Le Jeune's avatar
Maude Le Jeune committed
56 57 58
+ numpy
+ matplotlib
+ latex 
Marc Betoule's avatar
Marc Betoule committed
59

Maude Le Jeune's avatar
Maude Le Jeune committed
60
**** Getting Pipelet
Marc Betoule's avatar
Marc Betoule committed
61

62
There is not any published stable release of Pipelet right now.
Marc Betoule's avatar
Marc Betoule committed
63

Maude Le Jeune's avatar
Maude Le Jeune committed
64
=git clone git://gitorious.org/pipelet/pipelet.git -b v1.0=
Marc Betoule's avatar
Marc Betoule committed
65

Maude Le Jeune's avatar
Maude Le Jeune committed
66
**** Installing Pipelet
Marc Betoule's avatar
Marc Betoule committed
67

Marc Betoule's avatar
Marc Betoule committed
68
=sudo python setup.py install=
Marc Betoule's avatar
Marc Betoule committed
69

70
*** Running a simple test pipeline
Marc Betoule's avatar
Marc Betoule committed
71 72 73

1. Run the test pipeline

Maude Le Jeune's avatar
Maude Le Jeune committed
74
   =cd test/first_test=
Marc Betoule's avatar
Marc Betoule committed
75

Maude Le Jeune's avatar
Maude Le Jeune committed
76
   =python main.py=
Marc Betoule's avatar
Marc Betoule committed
77 78 79

2. Add this pipeline to the web interface

Maude Le Jeune's avatar
Maude Le Jeune committed
80
   =pipeweb track test ./.sqlstatus=
Marc Betoule's avatar
Marc Betoule committed
81 82 83

3. Set the access control and launch the web server

Maude Le Jeune's avatar
Maude Le Jeune committed
84
   =pipeutils -a username -l 2 .sqlstatus=
Marc Betoule's avatar
Marc Betoule committed
85

Maude Le Jeune's avatar
Maude Le Jeune committed
86
   =pipeweb start=
Marc Betoule's avatar
Marc Betoule committed
87 88 89 90

4. You should be able to browse the result on the web page
   http://localhost:8080

91 92
*** Getting a new pipe framework

Maude Le Jeune's avatar
Maude Le Jeune committed
93
To get a new pipeline framework, with example main and segment scripts : 
94

Marc Betoule's avatar
Marc Betoule committed
95
=pipeutils -c pipename=
96

Maude Le Jeune's avatar
Maude Le Jeune committed
97 98 99 100 101
This command ends up with the creation of directory named pipename wich contains: 
+ a main script (named main.py) providing functionnalities to execute
  your pipeline in various modes (debug, parallel, batch mode, ...)
+ an example of segment script (seg_default_code.py) which illustrates
  the pipelet utilities with comments. 
102

Maude Le Jeune's avatar
Maude Le Jeune committed
103
The next section describes those two files in more details. 
104

Marc Betoule's avatar
Marc Betoule committed
105

Maude Le Jeune's avatar
Maude Le Jeune committed
106
** Writing Pipes
Marc Betoule's avatar
Marc Betoule committed
107 108 109 110 111 112 113 114 115 116
*** Pipeline architecture

The definition of a data processing pipeline consists in :
+ a succession of python scripts, called segments, coding each step
  of the actual processing.
+ a main script that defines the dependency scheme between segments,
  and launch the processing.

The dependencies between segments must form a directed acyclic
graph. This graph is described by a char string using a subset of the
117
graphviz dot language (http://www.graphviz.org). For example the string:
Marc Betoule's avatar
Marc Betoule committed
118

Marc Betoule's avatar
Marc Betoule committed
119
#+begin_src python
Marc Betoule's avatar
Marc Betoule committed
120
"""
Marc Betoule's avatar
Marc Betoule committed
121 122 123
a -> b -> d;
c -> d;
c -> e;
Marc Betoule's avatar
Marc Betoule committed
124
"""
Marc Betoule's avatar
Marc Betoule committed
125 126 127 128 129 130 131 132
#+end_src

defines a pipeline with 5 segments ={"a", "b", "c", "d", "e"}=. The
relation =a->b= ensures that the processing of the segment "a" will be
done before the processing of its child segment =b=. Also the output
of =a= will be fed as input for =b=. In the given example, the node
=d= has two parents =b= and =c=. Both will be executed before =d=. As
their is no relation between =b= and =c= which of the two will be
Marc Betoule's avatar
Marc Betoule committed
133 134
executed first is not defined.

Marc Betoule's avatar
Marc Betoule committed
135 136 137
When executing the segment =seg=, the engine looks for a python script
named =seg.py=. If not found, it looks iteratively for script files
named =se.py= and =s.py=. This way, different segments of the pipeline
Maude Le Jeune's avatar
Maude Le Jeune committed
138
can share the same code, if they are given a name with a common root
139
(this mechanism is useful to write generic segment and is completed by
Maude Le Jeune's avatar
Maude Le Jeune committed
140
the hooking system, described in the advanced usage section). The code
Marc Betoule's avatar
Marc Betoule committed
141 142
is then executed in a specific namespace (see below [[*The%20segment%20environment][The execution
environment]]).
Maude Le Jeune's avatar
Maude Le Jeune committed
143

Marc Betoule's avatar
Marc Betoule committed
144 145
*** The Pipeline object

Maude Le Jeune's avatar
Maude Le Jeune committed
146
Practically, the creation of a Pipeline object needs 3 arguments:
Marc Betoule's avatar
Marc Betoule committed
147

Marc Betoule's avatar
Marc Betoule committed
148 149 150
#+begin_src python
P = Pipeline(pipedot, codedir="./", prefix="./")
#+end_src
Marc Betoule's avatar
Marc Betoule committed
151

Marc Betoule's avatar
Marc Betoule committed
152 153 154 155

- =pipedot= is the string description of the pipeline
- =codedir= is the path where the segment scripts can be found
- =prefix=  is the path to the data repository (see below [[*Hierarchical%20data%20storage][Hierarchical data storage]])
Marc Betoule's avatar
Marc Betoule committed
156

157 158 159
It is possible to output the graphviz representation of the pipeline
(needs graphviz installed). First, save the graph string into a .dot
file with the pipelet function:
Maude Le Jeune's avatar
Maude Le Jeune committed
160

Marc Betoule's avatar
Marc Betoule committed
161 162

#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
163
P.to_dot('pipeline.dot')
Marc Betoule's avatar
Marc Betoule committed
164
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
165 166 167

Then, convert it to an image file with the dot command: 

Marc Betoule's avatar
Marc Betoule committed
168
=dot -Tpng -o pipeline.png pipeline.dot=
Maude Le Jeune's avatar
Maude Le Jeune committed
169

Marc Betoule's avatar
Marc Betoule committed
170 171
*** Dependencies between segments

Marc Betoule's avatar
Marc Betoule committed
172 173 174 175 176
The modification of the code of one segment will trigger its
recalculation and the recalculation of all the segments which
depend on it.

The output of a segment is a list of python objects. If a segment as
Maude Le Jeune's avatar
Maude Le Jeune committed
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
no particular output this list can be empty and do not need to be
specified. Elements of the list are allowed to be any kind of
pickleable python objects. However, a good practice is to fill the
list with the minimal set of characteristics relevant to describe the
output of the segment and to defer the storage of the data to
appropriate structures and file formats. For example, a segment which
performs computation on large images could virtually pass the results
of its computation to the following segment using the output list. It
is a better practice to store the resulting image in a dedicated file
and to pass in the list only the information allowing a non ambiguous
identification of this file (like its name or part of it) for the
following segments.

The input of a child segment is taken in a set build from the output
lists of its parents. The content of the input set is actually tunable
using the multiplex directive (see below). However the simplest and
193
default behavior of the engine is to form the Cartesian product of
Maude Le Jeune's avatar
Maude Le Jeune committed
194 195
the output list of its parent.

196
To illustrate this behavior let us consider the following pipeline,
Maude Le Jeune's avatar
Maude Le Jeune committed
197 198
build from three segments:

Marc Betoule's avatar
Marc Betoule committed
199 200 201 202 203 204
#+begin_src python
"""
knights -> melt;
quality -> melt;
"""
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
205 206 207 208

and assume that the respective output lists of segments knights and
quality are:

Marc Betoule's avatar
Marc Betoule committed
209 210 211
#+begin_src python
["Lancelot", "Galahad"]
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
212
and:
Marc Betoule's avatar
Marc Betoule committed
213 214 215
#+begin_src python
['the Brave', 'the Pure']
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
216

217
The Cartesian product of the previous set is:
Marc Betoule's avatar
Marc Betoule committed
218
#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
219 220
 [('Lancelot','the Brave'), ('Lancelot,'the Pure'), ('Galahad','the Brave'), ('Galahad','the
Pure')]
Marc Betoule's avatar
Marc Betoule committed
221
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
222

Marc Betoule's avatar
Marc Betoule committed
223
Four instances of segment =melt= will thus be run, each one receiving
Maude Le Jeune's avatar
Maude Le Jeune committed
224 225 226
as input one of the four 2-tuples.

At the end of the execution of all the instances of a segment, their
Marc Betoule's avatar
Marc Betoule committed
227
output lists are concatenated. If the action of segment =melt= is to
Maude Le Jeune's avatar
Maude Le Jeune committed
228
concatenate the two strings he receives separated by a space, the
Marc Betoule's avatar
Marc Betoule committed
229
final output set of segment =melt= will be: 
Maude Le Jeune's avatar
Maude Le Jeune committed
230

Marc Betoule's avatar
Marc Betoule committed
231 232 233
#+begin_src python
[('Lancelot the Brave'), ('Lancelot the Pure'), ('Galahad the Brave'), ('Galahad the Pure')].
#+end_src
Marc Betoule's avatar
Marc Betoule committed
234

Marc Betoule's avatar
Marc Betoule committed
235
This default behavior can be altered by specifying a =#multiplex=
Marc Betoule's avatar
Marc Betoule committed
236 237
directive in the commentary of the segment code. See section [[*Multiplex%20directive][Multiplex
directive]] for more details.
238

239 240 241 242
As the segment execution order is not uniquely determined by the pipe
scheme (several path may exists), it is not possible to retrieve an
ordered input tuple. To overcome this issue, segment inputs are
dictionaries, with keywords matching parent segment names.  In the
Marc Betoule's avatar
Marc Betoule committed
243
above example, one can read =melt= inputs using:
244

Marc Betoule's avatar
Marc Betoule committed
245
#+begin_src python
246 247
k = seg_input["knights"]
q = seg_input["quality"]
Marc Betoule's avatar
Marc Betoule committed
248
#+end_src
249

Marc Betoule's avatar
Marc Betoule committed
250
See section [[*The%20segment%20environment]['The segment environment']] for more details.
251

Marc Betoule's avatar
Marc Betoule committed
252
*** Orphan segments
Marc Betoule's avatar
Marc Betoule committed
253

254 255 256 257 258
By default, orphan segments have no input argument (an empty list),
and therefore are executed once. 
The possibility is offer to push an input list to an orphan segment.  
If P is an instance of the pipeline object:

Marc Betoule's avatar
Marc Betoule committed
259
#+begin_src python
260
P.push (segname=[1,2,3])
Marc Betoule's avatar
Marc Betoule committed
261
#+end_src
262 263

From the segment environment, inputs can be retrieve from the
Marc Betoule's avatar
Marc Betoule committed
264
usual dictionary, using the keyword =segnamephantom=. 
265

Marc Betoule's avatar
Marc Betoule committed
266
#+begin_src python
267
id = seg_input['segnamephantom']
Marc Betoule's avatar
Marc Betoule committed
268
#+end_src
269
or
Marc Betoule's avatar
Marc Betoule committed
270
#+begin_src python
271
id = seg_input.values()[0]
Marc Betoule's avatar
Marc Betoule committed
272
#+end_src
273

Marc Betoule's avatar
Marc Betoule committed
274
See section [[*The%20segment%20environment][The segment environment]] for more details.
275

Marc Betoule's avatar
Marc Betoule committed
276 277
*** Hierarchical data storage

278 279 280 281 282 283
The framework provides versioning of your data and easy access
through the web interface. It is also used to keep track of the code,
of the execution logs, and various meta-data of the processing. Of
course, you remain able to bypass the hierarchical storage and store
your actual data elsewhere, but you will loose the benefit of
automated versioning which proves to be quite convenient.
Marc Betoule's avatar
Marc Betoule committed
284 285

The storage is organized as follows:
286 287 288

- all pipeline instances are stored below a root which corresponds to
  the prefix parameter of the Pipeline object. 
Marc Betoule's avatar
Marc Betoule committed
289
      =/prefix/=
290
- all segment meta data are stored below a root which name corresponds
291
  to a unique hash of the segment code.
Marc Betoule's avatar
Marc Betoule committed
292
      =/prefix/segname_YFLJ65/=
293 294 295 296 297 298 299
- Segment's meta data are: 
  - a copy of the segment python script
  - a copy of all segment hook scripts
  - a parameter file (.args) which contains segment parameters value
  - a meta data file (.meta) which contains some extra meta data
- all segment instances data and meta data are stored in a specific subdirectory
  which name corresponds to a string representation of its input
Marc Betoule's avatar
Marc Betoule committed
300
  	=/prefix/segname_YFLJ65/data/1/=
301
- if there is a single segment instance, then data are stored in
Marc Betoule's avatar
Marc Betoule committed
302
       =/prefix/segname_YFLJ65/data/=
303 304
- If a segment has at least one parent, its root will be located below
  one of its parent's one : 
Marc Betoule's avatar
Marc Betoule committed
305
       =/prefix/segname_YFLJ65/segname2_PLMBH9/=
306
- etc...
Maude Le Jeune's avatar
Maude Le Jeune committed
307
  
Marc Betoule's avatar
Marc Betoule committed
308 309 310 311 312
*** The segment environment

The segment code is executed in a specific environment that provides:

1. access to the segment input and output
Marc Betoule's avatar
Marc Betoule committed
313
   - =seg_input=:  this variable is a dictionary containing the input of the segment.
314

Marc Betoule's avatar
Marc Betoule committed
315
     In the general case, =seg_input= is a python dictionary which
316
     contains as many keywords as parent segments. In the case of orphan
Marc Betoule's avatar
Marc Betoule committed
317 318
     segment, the keyword used is suffixed by the =phantom= word. 
     One exception to this is coming from the use of the =group_by=
319 320 321
     directive, which alters the origin of the inputs. In this case,
     seg_input contains the resulting class elements. 

Marc Betoule's avatar
Marc Betoule committed
322
   - =seg_output=: this variable has to be a list. 
Marc Betoule's avatar
Marc Betoule committed
323

324
2. Functionalities to use the automated hierarchical data storage system.
Marc Betoule's avatar
Marc Betoule committed
325 326
   - =get_data_fn(basename)=: complete the filename with the path to the working directory. 
   - =glob_seg(seg, regexp)=: return the list of filename in segment seg
327
     working directory matching regexp.
Marc Betoule's avatar
Marc Betoule committed
328
   - =get_tmp_fn()=: return a temporary filename.
Marc Betoule's avatar
Marc Betoule committed
329

330
3. Functionalities to use the automated parameters handling
Marc Betoule's avatar
Marc Betoule committed
331 332 333
   - =lst_par=: list of parameter names of the segment to save in the meta data.
   - =lst_tag=: list of parameter names which will be made visible from the web interface
   - =load_param(seg, globals(), lst_par)=: retrieve parameters from the meta data.
334 335

4. Various convenient functionalities
Marc Betoule's avatar
Marc Betoule committed
336
   - =save_products(filename, lst_par)=: use pickle to save a
337
     part of the current namespace.
Marc Betoule's avatar
Marc Betoule committed
338
   - =load_products(filename, lst_par)=: update the namespace by
Marc Betoule's avatar
Marc Betoule committed
339
     unpickling requested object from the file.
Marc Betoule's avatar
Marc Betoule committed
340 341 342
   - =logged_subprocess(lst_args)=: execute a subprocess and log its
     output in =processname.log= and =processname.err=.
   - =logger= is a standard =logging.Logger= object that can be used to
Betoule Marc's avatar
Betoule Marc committed
343
     log the processing
Marc Betoule's avatar
Marc Betoule committed
344

345
5. Hooking support 
Marc Betoule's avatar
Marc Betoule committed
346 347
   Pipelet enables you to write reusable generic
   segments by providing a hooking system via the hook function.
Marc Betoule's avatar
Marc Betoule committed
348 349 350
   =hook(hookname, globals())=: execute Python script =segname_hookname.py= and update the namespace.
   See the section [[*The%20hooking%20system][Hooking system]] for more details.

Marc Betoule's avatar
Marc Betoule committed
351 352


353
*** The example pipelines
354
**** fft
Maude Le Jeune's avatar
Maude Le Jeune committed
355 356 357 358 359 360 361 362 363

***** Highlights

This example illustrates a very simple image processing use.
The problematic is the following : one wants to apply a Gaussian
filter in Fourier domain on several 2D images. 

The pipe scheme is: 

Marc Betoule's avatar
Marc Betoule committed
364
#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
365 366 367 368
pipedot = """
mkgauss->convol;
fftimg->convol;
"""
Marc Betoule's avatar
Marc Betoule committed
369
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
370 371 372 373 374 375

where segment 'mkgauss' computes the Gaussian filter, 'fftimg' computes the
Fourier transforms of the input images, and 'convol' performs the
filtering in Fourier domain, and the inverse transform of the filtered
images. 

Marc Betoule's avatar
Marc Betoule committed
376
#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
377 378
P = pipeline.Pipeline(pipedot, code_dir=op.abspath('./'), prefix=op.abspath('./'))
P.to_dot('pipeline.dot')
Marc Betoule's avatar
Marc Betoule committed
379
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
380 381 382 383

The pipe scheme is output as a .dot file, that can be converted to an
image file with the command line: 

Marc Betoule's avatar
Marc Betoule committed
384
=dot -Tpng -o pipeline.png pipeline.dot=
Maude Le Jeune's avatar
Maude Le Jeune committed
385 386 387

To apply this filter to several images (in our case 4 input images),
the pipe data parallelism is used. 
Marc Betoule's avatar
Marc Betoule committed
388
From the main script, a 4-element list is pushed to the =fftimg=
Maude Le Jeune's avatar
Maude Le Jeune committed
389 390
segment. 

Marc Betoule's avatar
Marc Betoule committed
391
#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
392
P.push(fftimg=[1,2,3,4]) 
Marc Betoule's avatar
Marc Betoule committed
393
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
394

Marc Betoule's avatar
Marc Betoule committed
395
At execution, 4 instances of the =fftimg= segment will be
Maude Le Jeune's avatar
Maude Le Jeune committed
396 397
created, and each of them outputs one element of this list : 

Marc Betoule's avatar
Marc Betoule committed
398
#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
399 400
img = seg_input.values()[0] #(fftimg.py - line 16)
seg_output = [img]          #(fftimg.py - line 41)
Marc Betoule's avatar
Marc Betoule committed
401
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
402

Marc Betoule's avatar
Marc Betoule committed
403
On the other side, a single instance of the =mkgauss= segment will be
Maude Le Jeune's avatar
Maude Le Jeune committed
404 405
executed, as there is one filter to apply. 

Marc Betoule's avatar
Marc Betoule committed
406
The last segment =convol=, which depends on the two others, will be
Maude Le Jeune's avatar
Maude Le Jeune committed
407 408 409
executed with a number of instances that is the Cartesian product of
its 4+1 inputs (ie 4 instances)

Marc Betoule's avatar
Marc Betoule committed
410
The instance identifier which is set by the =fftimg= output, can be
Maude Le Jeune's avatar
Maude Le Jeune committed
411 412
retrieve with the following instruction: 

Marc Betoule's avatar
Marc Betoule committed
413
#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
414
img = seg_input['fftimg']   #(convol.py - line 12)
Marc Betoule's avatar
Marc Betoule committed
415
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
416 417 418 419 420 421 422

***** Running the pipe

Follow the same procedure than for the first example pipeline, to run
this pipe and browse the result. 


423 424
**** cmb
***** Running the pipe
Maude Le Jeune's avatar
Maude Le Jeune committed
425

426 427 428
This CMB pipeline depends on two external python modules: 
+ healpy   :  http://code.google.com/p/healpy/
+ spherelib:  http://gitorious.org/spherelib
Maude Le Jeune's avatar
Maude Le Jeune committed
429 430


Maude Le Jeune's avatar
Maude Le Jeune committed
431 432 433 434 435 436 437 438 439 440 441 442 443 444
***** Problematic

This example illustrates a very simple CMB data processing use.  

The problematic is the following : one wants to characterize the
inverse noise weighting spectral estimator (as applied to the WMAP 1yr
data). A first demo pipeline is built to check that the algorithm
has correctly been implemented. Then, Monte Carlo simulations are used
to compute error bars estimates. 

***** A design pipeline

The design pipe scheme is: 

Marc Betoule's avatar
Marc Betoule committed
445
#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
446 447 448 449 450
pipe_dot = """ 
cmb->clcmb->clplot;
noise->clcmb;
noise->clnoise->clplot;
"""
Marc Betoule's avatar
Marc Betoule committed
451
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
452 453

where: 
Marc Betoule's avatar
Marc Betoule committed
454 455 456
+ =cmb=: generate a CMB map from LCDM power spectrum. 
+ =noise=: compute the mode coupling matrix from the input hit-count map
+ =clnoise=: compute the empirical noise power spectrum from a noise
Maude Le Jeune's avatar
Maude Le Jeune committed
457
  realization. 
Marc Betoule's avatar
Marc Betoule committed
458
+ =clcmb=: generate two noise realizations, add them to the CMB map, to compute a
Maude Le Jeune's avatar
Maude Le Jeune committed
459 460 461
  first cross spectrum estimator. Then weighting mask and mode
  coupling matrix are applied to get the inverse noise weighting
  estimator
Marc Betoule's avatar
Marc Betoule committed
462
+ =clplot=: make a plot to compare pure cross spectrum vs inverse noise
Maude Le Jeune's avatar
Maude Le Jeune committed
463 464 465 466 467 468 469 470 471 472
  weighting estimators. 

As the two first orphan segments depends on a single shared parameter
which is the map resolution nside, this argument is pushed from the
main script. 

Another input argument of the cmb segment, is its simulation identifier,
which will be used for latter Monte Carlo. In order to push two
inputs to a single segment instance, we use python tuple data type.

Marc Betoule's avatar
Marc Betoule committed
473
#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
474 475
P.push(cmb=[(nside, 1)])
P.push(noise=[nside])
Marc Betoule's avatar
Marc Betoule committed
476
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
477 478 479

From the segment, those inputs are retrieved with : 

Marc Betoule's avatar
Marc Betoule committed
480
#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
481 482 483
nside  = seg_input.values()[0][0] ##(cmb.py line 13)
sim_id = seg_input.values()[0][1] ##(cmb.py line 14)
nside  = seg_input.values()[0]  ##(noise.py line 16)
Marc Betoule's avatar
Marc Betoule committed
484
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500

The last segment produces a plot in which we compare: 
+ the input LCDM power spectrum
+ the binned cross spectrum of the noisy CMB maps
+ the binned cross spectrum of which we applied hitcount weight and
  mode coupling matrix. 
+ the noise power spectrum computed by clnoise segment. 

In this plot we check that both estimators are corrects, and that the
noise level is the expected one.

***** From the design pipeline to Monte Carlo

As a second step, Monte Carlo simulations are used to compute error
bars. 

Marc Betoule's avatar
Marc Betoule committed
501
The =clnoise= segment is no longer needed, so that the new pipe scheme
Maude Le Jeune's avatar
Maude Le Jeune committed
502 503
reads : 

Marc Betoule's avatar
Marc Betoule committed
504
#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
505 506 507 508
pipe_dot = """ 
cmb->clcmb->clplot;
noise->clcmb;
"""
Marc Betoule's avatar
Marc Betoule committed
509
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
510 511

We now use the native data parallelization scheme of the pipe to build
Marc Betoule's avatar
Marc Betoule committed
512
many instances of the =cmb= and =clcmb= segments. 
Maude Le Jeune's avatar
Maude Le Jeune committed
513

Marc Betoule's avatar
Marc Betoule committed
514
#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
515 516 517 518
cmbin = []
for sim_id in [1,2,3,4,5,6]:
    cmbin.append((nside, sim_id))
P.push(cmb=cmbin)
Marc Betoule's avatar
Marc Betoule committed
519
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
520 521


Marc Betoule's avatar
Marc Betoule committed
522
** Running Pipes
523
   
524 525
*** The sample main file

526
A sample main file is made available when creating a new Pipelet
527 528
framework. It is copied from the reference file: 

Marc Betoule's avatar
Marc Betoule committed
529
=pipelet/pipelet/static/main.py=
530 531 532 533 534 535

This script illustrates various ways of running pipes. It describes
the different parameters, and also, how to write a
main python script can be used as any binary from the command line
(including options parsing). 

Maude Le Jeune's avatar
Maude Le Jeune committed
536
	   
537 538 539 540 541 542
*** Common options
Some options are common to each running modes.
**** log level

The logging system is handle by the python logging facility module. 
This module defines the following log levels : 
Marc Betoule's avatar
Marc Betoule committed
543 544 545 546 547
+ =DEBUG=
+ =INFO=
+ =WARNING=
+ =ERROR=
+ =CRITICAL=
548

549
All logging messages are saved in the different Pipelet log files,
550 551 552 553 554
available from the web interface (rotating file logging).  It is also
possible to print those messages on the standard output (stream
logging), by setting the desired log level in the launchers options:
For example: 

Marc Betoule's avatar
Marc Betoule committed
555
#+begin_src python
556 557
import logging
launch_process(P, N,log_level=logging.DEBUG)
Marc Betoule's avatar
Marc Betoule committed
558
#+end_src
559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578

If set to 0, stream logging will be disable. 

**** matplotlib

The matplotlib documentation says: 

"Many users report initial problems trying to use maptlotlib in web
application servers, because by default matplotlib ships configured to
work with a graphical user interface which may require an X11
connection. Since many barebones application servers do not have X11
enabled, you may get errors if you don’t configure matplotlib for use
in these environments. Most importantly, you need to decide what kinds
of images you want to generate (PNG, PDF, SVG) and configure the
appropriate default backend. For 99% of users, this will be the Agg
backend, which uses the C++ antigrain rendering engine to make nice
PNGs. The Agg backend is also configured to recognize requests to
generate other output formats (PDF, PS, EPS, SVG). The easiest way to
configure matplotlib to use Agg is to call:

Marc Betoule's avatar
Marc Betoule committed
579
=matplotlib.use('Agg')=
580

Marc Betoule's avatar
Marc Betoule committed
581
The =matplotlib= and =matplotlib_interactive= options turn the matplotlib
582 583 584
backend to Agg in order to allow the execution in non-interactive
environment.

Marc Betoule's avatar
Marc Betoule committed
585 586 587
Those two options are set to =True= by default in the sample main
script. Setting them to =matplotlib_interactive=
TODO
588

Marc Betoule's avatar
Marc Betoule committed
589
*** The interactive mode
Marc Betoule's avatar
Marc Betoule committed
590
This mode has been designed to ease debugging. If =P= is an instance of
591
the pipeline object, the syntax reads :
Marc Betoule's avatar
Marc Betoule committed
592

Marc Betoule's avatar
Marc Betoule committed
593
#+begin_src python
Marc Betoule's avatar
Marc Betoule committed
594 595 596
from pipelet.launchers import launch_interactive
w, t = launch_interactive(P)
w.run()
Marc Betoule's avatar
Marc Betoule committed
597
#+end_src
Marc Betoule's avatar
Marc Betoule committed
598 599 600 601

In this mode, each tasks will be computed in a sequential way. 
Do not hesitate to invoque the Python debugger from IPython : %pdb

Maude Le Jeune's avatar
Maude Le Jeune committed
602
To use the interactive mode, run: 
Marc Betoule's avatar
Marc Betoule committed
603
=main.py -d=
604 605


Marc Betoule's avatar
Marc Betoule committed
606
*** The process mode
607 608 609 610
In this mode, one can run simultaneous tasks (if the pipe scheme
allows to do so). 
The number of subprocess is set by the N parameter : 

Marc Betoule's avatar
Marc Betoule committed
611
#+begin_src python
612 613
from pipelet.launchers import launch_process
launch_process(P, N)
Marc Betoule's avatar
Marc Betoule committed
614
#+end_src
615

Maude Le Jeune's avatar
Maude Le Jeune committed
616
To use the process mode, run: 
Marc Betoule's avatar
Marc Betoule committed
617
=main.py=
Maude Le Jeune's avatar
Maude Le Jeune committed
618
or
Marc Betoule's avatar
Marc Betoule committed
619
=main.py -p 4=
Maude Le Jeune's avatar
Maude Le Jeune committed
620

Marc Betoule's avatar
Marc Betoule committed
621
*** The batch mode
622 623 624
In this mode, one can submit some batch jobs to execute the tasks. 
The number of job is set by the N parameter : 

Marc Betoule's avatar
Marc Betoule committed
625
#+begin_src python
626 627
from pipelet.launchers import launch_pbs
launch_pbs(P, N , address=(os.environ['HOST'],50000))
Marc Betoule's avatar
Marc Betoule committed
628
#+end_src
Marc Betoule's avatar
Marc Betoule committed
629

630 631 632 633 634 635 636 637 638 639 640
It is possible to specify some job submission options like: 
+ job name 
+ job header: this string is prepend to the PBS job scripts. You may
  want to add some environment specific paths. Log and error files are
  automatically handled by the pipelet engine, and made available from
  the web interface. 
+ cpu time: syntax is: "hh:mm:ss"

The 'server' option can be disable to add some workers to an existing
scheduler.

Maude Le Jeune's avatar
Maude Le Jeune committed
641
To use the batch mode, run: 
Marc Betoule's avatar
Marc Betoule committed
642
=main.py -b=
Maude Le Jeune's avatar
Maude Le Jeune committed
643 644 645

to start the server, and: 

Marc Betoule's avatar
Marc Betoule committed
646
=main.py -a 4=
647

Maude Le Jeune's avatar
Maude Le Jeune committed
648
to add 4 workers. 
Maude Le Jeune's avatar
Maude Le Jeune committed
649

Marc Betoule's avatar
Marc Betoule committed
650
** Browsing Pipes
651 652 653 654 655
*** The pipelet webserver and ACL

The pipelet webserver allows the browsing of multiple pipelines. 
Each pipeline has to be register using : 

Marc Betoule's avatar
Marc Betoule committed
656
=pipeweb track <shortname> sqlfile=
657

658 659
To remove a pipeline from the tracked list, use : 

Marc Betoule's avatar
Marc Betoule committed
660
=pipeweb untrack <shortname>=
661

662 663 664 665
As the pipeline browsing implies a disk parsing, some basic security
has to be set also. All users have to be register with a specific access
level (1 for read-only access, and 2 for write access).  

Marc Betoule's avatar
Marc Betoule committed
666
=pipeutils -a <username> -l 2 sqlfile=
Marc Betoule's avatar
Marc Betoule committed
667

668 669
To remove a user from the user list: 

Marc Betoule's avatar
Marc Betoule committed
670
=pipeutils -d <username> sqlfile=
671

672 673
Start the web server using : 

Marc Betoule's avatar
Marc Betoule committed
674
=pipeweb start=
675 676

Then the web application will be available on the web page http://localhost:8080
Marc Betoule's avatar
Marc Betoule committed
677

678 679
To stop the web server : 

Marc Betoule's avatar
Marc Betoule committed
680
=pipeweb stop=
681

Marc Betoule's avatar
Marc Betoule committed
682
*** The web application
683

684
In order to ease the comparison of different processing, the web
685 686 687 688
interface displays various views of the pipeline data : 

**** The index page 

689
The index page displays a tree view of all pipeline instances. Each
690 691 692 693 694 695
segment may be expand or reduce via the +/- buttons.  

The parameters used in each segments are resumed and displayed with
the date of execution and the number of related tasks order by
status. 

696
A check-box allows to performed operation on multiple segments :
697 698 699
  - deletion : to clean unwanted data
  - tag : to tag remarkable data

700
The filter panel allows to display the segment instances with respect to 2
701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726
criterions :
  - tag
  - date of execution

**** The code page

Each segment names is a link to its code page. From this page the user
can view all python scripts code which have been applied to the data.

The tree view is reduced to the current segment and its related
parents.

The root path corresponding to the data storage is also displayed.


**** The product page

The number of related tasks, order by status, is a link to the product
pages, where the data can be directly displayed (if images, or text
files) or downloaded. 
From this page it is also possible to delete a specific product and
its dependencies. 


**** The log page

727
The log page can be acceded via the log button of the filter panel.
728 729 730 731
Logs are ordered by date. 



Marc Betoule's avatar
Marc Betoule committed
732 733

* Advanced usage
734 735

** Multiplex directive
Marc Betoule's avatar
Marc Betoule committed
736
   
737

Marc Betoule's avatar
Marc Betoule committed
738
The default behavior can be altered by specifying a =#multiplex=
739 740 741 742 743
directive in the commentary of the segment code. If several multiplex
directives are present in the segment code the last one is retained.

The multiplex directive can be one of: 

Marc Betoule's avatar
Marc Betoule committed
744
+ =#multiplex cross_prod= : default behavior, return the Cartesian
745
  product. 
Marc Betoule's avatar
Marc Betoule committed
746
+ =#multiplex union= : make the union of the inputs
747

Marc Betoule's avatar
Marc Betoule committed
748
Moreover the =#multiplex cross_prod= directive admits filtering and
749 750
grouping by class similarly to SQL requests:

Marc Betoule's avatar
Marc Betoule committed
751
#+begin_src python
752
#multiplex cross_prod where "condition" group_by "class_function"
Marc Betoule's avatar
Marc Betoule committed
753
#+end_src
754

Marc Betoule's avatar
Marc Betoule committed
755
=condition= and =class_function= are python code evaluated for each element
756 757
of the product set. 

Marc Betoule's avatar
Marc Betoule committed
758 759
The argument of =where= is a condition. The element will be part of the
input set only if it evaluates to =True=.
760

Marc Betoule's avatar
Marc Betoule committed
761
The =group_by= directive groups elements into class according to the
762 763 764 765 766 767 768 769 770
result of the evaluation of the given class function. The input set
contains all the resulting class. For example, if the function is a
constant, the input set will contain only one element: the class
containing all elements.

During the evaluation, the values of the tuple elements are accessible
as variable wearing the name of the corresponding parents.


Marc Betoule's avatar
Marc Betoule committed
771 772 773 774 775 776 777 778
Given the Cartesian product set:
#+begin_src python
 [('Lancelot','the Brave'), ('Lancelot,'the Pure'), ('Galahad','the Brave'), ('Galahad','the
Pure')]
#+end_src

one can use :
#+begin_src python
779
#multiplex cross_prod where "quality=='the Brave'" 
Marc Betoule's avatar
Marc Betoule committed
780 781 782 783 784
#+end_src
to get 2 instances of the following segment (=melt=) running on: 
#+begin_src python
('Lancelot','the Brave'), ('Galahad','the Brave')
#+end_src
785

Marc Betoule's avatar
Marc Betoule committed
786
#+begin_src python
787
#multiplex cross_prod group_by "knights"
Marc Betoule's avatar
Marc Betoule committed
788 789 790 791 792
#+end_src
to get 2 instances of the =melt= segment running on:
#+begin_src python
('Lancelot'), ('Galahad')
#+end_src
793

Marc Betoule's avatar
Marc Betoule committed
794
#+begin_src python
795
#multiplex cross_prod group_by "0"
Marc Betoule's avatar
Marc Betoule committed
796 797
#+end_src
to get 1 instance of the =melt= segment running on: (0)
798

Marc Betoule's avatar
Marc Betoule committed
799
Note that to make use of =group_by=, elements of the output set have to be
800 801 802 803
hashable.

Another caution on the use of group: segment input data type is no
longer a dictionary in those cases as the original tuple is
Marc Betoule's avatar
Marc Betoule committed
804
lost and replaced by the result of the class function.
805

Marc Betoule's avatar
Marc Betoule committed
806
See section [[*The%20segment%20environment][The segment environment]] for more details.
807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827

** Depend directive

As explained in the introduction section, Pipelet offers the
possibility to spare CPU time by saving intermediate products on disk.
We call intermediate products the input/output data files of the
different segments.  

Each segment repository is identified by a unique key which depends
on: 
- the segment processing code and parameters (segment and hook
  scripts)
- the input data (identified from the key of the parent segments)

Every change made on a segment (new parameter or new parent) will then
give a different key, and tell the Pipelet engine to compute a new
segment instance.

It is possible to add some external dependencies to the key
computation using the depend directive: 

Marc Betoule's avatar
Marc Betoule committed
828
#+begin_src python
829
#depend file1 file2
Marc Betoule's avatar
Marc Betoule committed
830
#+end_src
831 832 833 834 835 836 837 838 839

At the very beginning of the pipeline execution, all dependencies will
be stored, to prevent any change (code edition) between the key
computation and actual processing.

Note that this mechanism works only for segment and hook
scripts. External dependencies are also read as the beginning of the
pipeline execution, but only used for the key computation.

Marc Betoule's avatar
Marc Betoule committed
840 841
** Database reconstruction

842 843 844
In case of unfortunate lost of the pipeline sql data base, it is
possible to reconstruct it from the disk : 

Marc Betoule's avatar
Marc Betoule committed
845
#+begin_src python
846 847
import pipelet
pipelet.utils.rebuild_db_from_disk (prefix, sqlfile)
Marc Betoule's avatar
Marc Betoule committed
848
#+end_src
Marc Betoule's avatar
Marc Betoule committed
849

850 851 852
All information will be retrieve, but with new identifiers. 

** The hooking system
853

854
As described in the 'segment environment' section, Pipelet supports
855 856 857 858 859 860
an hooking system which allows the use of generic processing code, and
code sectioning.

Let's consider a set of instructions that have to be systematically
applied at the end of a segment (post processing), one can put those
instruction in the separate script file named for example
Marc Betoule's avatar
Marc Betoule committed
861
=segname_postproc.py= and calls the hook function: 
862

Marc Betoule's avatar
Marc Betoule committed
863
#+begin_src python
864
hook('postproc', globals()) 
Marc Betoule's avatar
Marc Betoule committed
865
#+end_src
866

867
A specific dictionary can be passed to the hook script to avoid
868 869
confusion. 

Marc Betoule's avatar
Marc Betoule committed
870
The hook scripts are included into the hash key computation.
871

Marc Betoule's avatar
Marc Betoule committed
872 873
** Writing custom environments

874
The Pipelet software provides a set of default utilities available
Maude Le Jeune's avatar
Maude Le Jeune committed
875 876 877 878 879 880
from the segment environment. It is possible to extend this default
environment or even re-write a completely new one.

*** Extending the default environment

The different environment utilities are actually methods of the class
881
Environment. It is possible to add new functionalities by using the
Maude Le Jeune's avatar
Maude Le Jeune committed
882 883
python heritage mechanism: 

Marc Betoule's avatar
Marc Betoule committed
884
File : =myenvironment.py=
Maude Le Jeune's avatar
Maude Le Jeune committed
885

Marc Betoule's avatar
Marc Betoule committed
886 887 888 889 890 891 892 893 894
#+begin_src python
  from pipelet.environment import *
  
  class MyEnvironment(Environment):
        def my_function (self):
           """ My function do nothing
           """
           return 
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
895

896
The Pipelet engine objects (segments, tasks, pipeline) are available
Marc Betoule's avatar
Marc Betoule committed
897
from the worker attribut =self._worker=. See section [[*The%20Pipelet%20actors][The Pipelet actors]]
Marc Betoule's avatar
Marc Betoule committed
898
for more details about the Pipelet machinery.
Maude Le Jeune's avatar
Maude Le Jeune committed
899 900 901 902 903 904 905


*** Writing new environment

In order to start with a completely new environment, extend the base
environment: 

Marc Betoule's avatar
Marc Betoule committed
906 907
File : =myenvironment.py=
#+begin_src python
Maude Le Jeune's avatar
Maude Le Jeune committed
908 909 910 911 912 913 914 915 916 917 918 919
   from pipelet.environment import *

   class MyEnvironment(EnvironmentBase):
         def my_get_data_fn (self, x):
            """ New name for get_data_fn
	    """
	    return self._get_data_fn(x)

         def _close(self, glo):
            """ Post processing code
            """	 
	    return glo['seg_output']
Marc Betoule's avatar
Marc Betoule committed
920
#+end_src
Maude Le Jeune's avatar
Maude Le Jeune committed
921

922
From the base environment, the basic functionalities for getting file
Maude Le Jeune's avatar
Maude Le Jeune committed
923
names and executing hook scripts are still available through: 
Marc Betoule's avatar
Marc Betoule committed
924 925
- =self._get_data_fn=
- =self._hook=
Maude Le Jeune's avatar
Maude Le Jeune committed
926

Marc Betoule's avatar
Marc Betoule committed
927 928
The segment input argument is also stored in =self._seg_input=
The segment output argument has to be returned by the =_close(self, glo)=
Maude Le Jeune's avatar
Maude Le Jeune committed
929 930 931
method. 

The pipelet engine objects (segments, tasks, pipeline) are available
Marc Betoule's avatar
Marc Betoule committed
932
from the worker attribut =self._worker=. See section [[*The%20Pipelet%20actors][The Pipelet
Marc Betoule's avatar
Marc Betoule committed
933
actors]] for more details about the Pipelet machinery.
Maude Le Jeune's avatar
Maude Le Jeune committed
934 935


Marc Betoule's avatar
Marc Betoule committed
936

Maude Le Jeune's avatar
Maude Le Jeune committed
937 938
*** Loading another environment