README.org 12.1 KB
Newer Older
Marc Betoule's avatar
Marc Betoule committed
1 2 3 4
WARNING Pipelet is currently under active development and highly
unstable. There is good chance that it becomes incompatible from one
commit to another.

5 6 7 8
Pipelet is a free framework allowing for creation, manipulation,
execution and browsing of scientific data processing pipelines. It
provides:

Marc Betoule's avatar
Marc Betoule committed
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
+ easy chaining of interdependent elementary tasks,
+ web access to data products,
+ branch handling,
+ automated distribution of computational tasks.

Both engine and web interface are written in Python.

* Tutorial
** Introduction
*** Why using pipelines

The pipeline mechanism allows you to perform a sequence of processing
steps to your data, in a way that the input of each process is the
output of the previous one. Making visible these different processing
steps, in the right order, is essential in data analysis to keep track
of what you did, and make sure that the whole processing remains
consistent.

*** How it works

Pipelet is based on the fact that you have the possibility to save on
disk every intermediate input or output of your pipeline, which is not
a strong constraint but offers a lot of benefits. It means that you
can stop the processing whenever you want, and begin it again without
recomputing the whole thing: you just take the last products you have
on disk, and continue the processing where it stopped. This logic is
interesting when the computation cost is higher than the cost of disk
space required by intermediate products.

*** The Pipelet functionalities

Pipelet implements an automatic tool which helps you : 
+ to write and manipulate pipeline with any dependency scheme, 
+ to dispatch the computational tasks on parallel architectures, 
+ to keep track of what processing has been applied to your data and perform comparisons.

** Getting started

*** Obtaining Pipelet

There is not any published stable release of pipelet right now.

git clone git://gitorious.org/pipelet/pipelet.git

*** Installing Pipelet
The web interface of Pipelet requires the installation of the
cherrypy3 Python module (debian aptitude install python-cherrypy3).
Pipelet requires Python >= 2.6.

sudo python setup.py install

*** Running the test pipeline

1. Run the test pipeline

cd test
python main.py

2. Add this pipeline to the web interface

pipeweb track test ./.sqlstatus

3. Set the access control and launch the web server

pipeutils -a username -l 2 .sqlstatus
pipeweb start

4. You should be able to browse the result on the web page
   http://localhost:8080

79 80 81 82 83 84 85 86
*** Getting a new pipe framework

To get a new pipe framework, with sample main and segment scripts : 

pipeutils -c pipename



Marc Betoule's avatar
Marc Betoule committed
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
** Writing Pipes

*** Pipeline architecture

The definition of a data processing pipeline consists in :
+ a succession of python scripts, called segments, coding each step
  of the actual processing.
+ a main script that defines the dependency scheme between segments,
  and launch the processing.

The dependencies between segments must form a directed acyclic
graph. This graph is described by a char string using a subset of the
graphviz dot language (http://www.graphviz.org). For exemple the string:

"""
a -> b -> d;
c -> d;
c -> e;
"""

defines a pipeline with 5 segments {"a", "b", "c", "d", "e"}. The
relation "a->b" ensures that the processing of the segment "a" will be
done before the processing of its child segment "b". Also the output
of "a" will be feeded as input for "b". In the given example, the node
"d" has two parents "b" and "c". Both will be executed before "d". As
their is no relation between "b" and "c" which of the two will be
executed first is not defined.

Marc Betoule's avatar
Marc Betoule committed
115 116 117 118 119 120 121 122 123 124 125 126
*** The Pipeline object

Practically, the creation of a Pipeline object by needs 3 arguments:

Pipeline(pipedot, codedir=, prefix=)

- pipedot is the string description of the pipeline
- codedir is the path of the code of the segments
- prefix  is the path of the data repository

*** Dependencies between segments

Marc Betoule's avatar
Marc Betoule committed
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
The modification of the code of one segment will trigger its
recalculation and the recalculation of all the segments which
depend on it.

The output of a segment is a list of python objects. If a segment as
no particular output the list can be empty. A child segment receives
input from a set build from its parents output sets. An instance of
the child segment is executed for each element in the set. The default
behaviour of the engine is to form the Cartesian product of the output
sets of its parents. This means that if the output of the "b" segment
is the list of string ["Lancelot", "Galahad"] and the output of "c" is
the list of string ['the Brave', 'the Pure'], four instances of
segment "d" will be run. Their inputs will be respectively the four
2-tuples: ('Lancelot','the Brave'), ('Lancelot,'the Pure'),
('Galahad','the Brave'), ('Galahad','the Pure'). At the end of the
execution of all the instances of the segment their output sets are
concatenated. If the action of segment "d" is to concatenate the two
strings he receives separated by a space, the final output set of
segment "d" will be: [('Lancelot the Brave'), ('Lancelot the Pure'),
('Galahad the Brave'), ('Galahad the Pure')].

Marc Betoule's avatar
Marc Betoule committed
148

Marc Betoule's avatar
Marc Betoule committed
149 150
*** Multiplex directive

Marc Betoule's avatar
Marc Betoule committed
151 152 153 154 155 156 157 158 159 160 161 162 163 164
This default behavior can be altered by specifying an @multiplex
directive in the commentary of the segment code. If several multiplex
directive can be found the last one is retained.

- @multiplex cross_prod : activate the default behaviour

- @multiplex zip : similar to the zip python command. The input set is
    a list of tuples, where each tuple contains the i-th element from
    each of the parent sorted output list. If the list have different
    size, the shortest is used.

- @multiplex union : The input set contains all the output.

- @multiplex gather : The input set contains one tuple of all the ouputs.
Marc Betoule's avatar
Marc Betoule committed
165 166


167 168 169 170
*** Depend directive



Marc Betoule's avatar
Marc Betoule committed
171
*** Orphan segments
Marc Betoule's avatar
Marc Betoule committed
172

Marc Betoule's avatar
Marc Betoule committed
173
TODO TBD
Marc Betoule's avatar
Marc Betoule committed
174

Marc Betoule's avatar
Marc Betoule committed
175 176 177 178 179 180 181 182 183 184
*** Hierarchical data storage

This system provides versionning of your data and easy access through
the web interface. It is also used to keep track of the code, of the
execution logs, and various meta-data of the processing. Of course,
you remain able to bypass the hierarchical storage and store your
actual data elsewhere, but you will loose the benefit of automated
versionning which proves to be quite convenient.

The storage is organized as follows:
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205

- all pipeline instances are stored below a root which corresponds to
  the prefix parameter of the Pipeline object. 
      /prefix/
- all segment meta data are stored below a root which name corresponds
  to an unique match of the segment code.
      /prefix/seg_segname_YFLJ65/
- Segment's meta data are: 
  - a copy of the segment python script
  - a copy of all segment hook scripts
  - a parameter file (.args) which contains segment parameters value
  - a meta data file (.meta) which contains some extra meta data
- all segment instances data and meta data are stored in a specific subdirectory
  which name corresponds to a string representation of its input
  	/prefix/seg_segname_YFLJ65/data/1/
- if there is a single segment instance, then data are stored in
       /prefix/seg_segname_YFLJ65/data/
- If a segment has at least one parent, its root will be located below
  one of its parent's one : 
       /prefix/seg_segname_YFLJ65/seg_segname2_PLMBH9/
- etc...
Marc Betoule's avatar
Marc Betoule committed
206 207 208 209 210 211 212

*** The segment environment

The segment code is executed in a specific environment that provides:

1. access to the segment input and output
   - seg_input:  this variable is a dictionnary containing the input of the segment
213
   - get_input(): 
Marc Betoule's avatar
Marc Betoule committed
214 215 216 217
   - seg_output: this variable has to be set to a list containing the 

2. Functionnalities to use the automated hierarchical data storage system.
   - get_data_fn(basename): complete the filename with the path to the working directory. 
218
   - glob_seg(regexp, seg): return the list of filename matching regexp from segment seg
Marc Betoule's avatar
Marc Betoule committed
219 220
   - get_tmp_fn(): return a temporary filename.

221 222 223
3. Functionnalities to use the automated parameters handling
   - var_key: list of parameter names of the segment 
   - var_tag: list of parameter names which will be made visible from the web interface
Marc Betoule's avatar
Marc Betoule committed
224
   - load_param(seg, var_names)
225 226

4. Various convenient functionalities
Marc Betoule's avatar
Marc Betoule committed
227 228 229 230 231 232 233
   - save_products(filename=', var_names='*'): use pickle to save a
     part of a given namespace.
   - load_products(filename, var_names): update the namespace by
     unpickling requested object from the file.
   - logged_subprocess(lst_args): execute a subprocess and log its output.
   - log is a standard logging.Logger object that can be used to log the processing

234
5. Hooking support 
Marc Betoule's avatar
Marc Betoule committed
235 236 237
   Pipelet enables you to write reusable generic
   segments by providing a hooking system via the hook function.
   hook (hookname, globals()): execute Python script ‘seg_segname_hookname.py’ and update the namespace.
Marc Betoule's avatar
Marc Betoule committed
238 239 240



Marc Betoule's avatar
Marc Betoule committed
241

242 243
Loading another environment

Marc Betoule's avatar
Marc Betoule committed
244

Marc Betoule's avatar
Marc Betoule committed
245 246 247 248 249 250 251




** Running Pipes

*** The interactive mode
252 253
This mode has been designed to ease debugging. If P is an instance of
the pipeline object, the syntax reads :
Marc Betoule's avatar
Marc Betoule committed
254 255 256 257 258 259 260 261 262

from pipelet.launchers import launch_interactive
w, t = launch_interactive(P)
w.run()

In this mode, each tasks will be computed in a sequential way. 
Do not hesitate to invoque the Python debugger from IPython : %pdb

*** The process mode
263 264 265 266 267 268 269
In this mode, one can run simultaneous tasks (if the pipe scheme
allows to do so). 
The number of subprocess is set by the N parameter : 

from pipelet.launchers import launch_process
launch_process(P, N)

Marc Betoule's avatar
Marc Betoule committed
270 271

*** The batch mode
272 273 274 275 276
In this mode, one can submit some batch jobs to execute the tasks. 
The number of job is set by the N parameter : 

from pipelet.launchers import launch_pbs
launch_pbs(P, N , address=(os.environ['HOST'],50000))
Marc Betoule's avatar
Marc Betoule committed
277 278

** Browsing Pipes
279 280 281 282 283 284 285 286 287 288 289 290 291
   
*** The pipelet webserver and ACL

The pipelet webserver allows the browsing of multiple pipelines. 
Each pipeline has to be register using : 

pipeweb track <shortname> sqlfile

As the pipeline browsing implies a disk parsing, some basic security
has to be set also. All users have to be register with a specific access
level (1 for read-only access, and 2 for write access).  

pipeutils -a <username> -l 2 sqlfile
Marc Betoule's avatar
Marc Betoule committed
292

293 294 295 296 297
Start the web server using : 

pipeweb start

Then the web application will be available on the web page http://localhost:8080
Marc Betoule's avatar
Marc Betoule committed
298 299

*** The web application
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348

In order to ease the comparison of different processing, the web
interface displays various views of the pipeline data : 

**** The index page 

The index page display a tree view of all pipeline instances. Each
segment may be expand or reduce via the +/- buttons.  

The parameters used in each segments are resumed and displayed with
the date of execution and the number of related tasks order by
status. 

A checkbox allows to performed operation on multiple segments :
  - deletion : to clean unwanted data
  - tag : to tag remarkable data

The filter panel allows to display the segment instances wrt 2
criterions :
  - tag
  - date of execution

**** The code page

Each segment names is a link to its code page. From this page the user
can view all python scripts code which have been applied to the data.

The tree view is reduced to the current segment and its related
parents.

The root path corresponding to the data storage is also displayed.


**** The product page

The number of related tasks, order by status, is a link to the product
pages, where the data can be directly displayed (if images, or text
files) or downloaded. 
From this page it is also possible to delete a specific product and
its dependencies. 


**** The log page

The log page can be acceed via the log button of the filter panel.
Logs are ordered by date. 



Marc Betoule's avatar
Marc Betoule committed
349 350 351 352 353

* Advanced usage

** Database reconstruction

354 355 356 357 358
In case of unfortunate lost of the pipeline sql data base, it is
possible to reconstruct it from the disk : 

import pipelet
pipelet.utils.rebuild_db_from_disk (prefix, sqlfile)
Marc Betoule's avatar
Marc Betoule committed
359

360 361 362
All information will be retrieve, but with new identifiers. 

** The hooking system
Marc Betoule's avatar
Marc Betoule committed
363 364 365 366 367 368 369 370 371 372
** Writing custom environments

** Using custom dependency schemes

** Launching pipeweb behind apache

Pipeweb use the cherrypy web framework server and can be run behind an
apache webserver which brings essentially two advantages:
- https support.
- faster static files serving.
373 374 375 376 377 378 379 380 381 382


* The pipelet actors

** The Repository object
** The Pipeline object
** The Task object
** The Scheduler object
** The Tracker object
** The Worker object