README.org 13.9 KB
Newer Older
Marc Betoule's avatar
Marc Betoule committed
1 2 3 4
WARNING Pipelet is currently under active development and highly
unstable. There is good chance that it becomes incompatible from one
commit to another.

5 6 7 8
Pipelet is a free framework allowing for creation, manipulation,
execution and browsing of scientific data processing pipelines. It
provides:

Marc Betoule's avatar
Marc Betoule committed
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
+ easy chaining of interdependent elementary tasks,
+ web access to data products,
+ branch handling,
+ automated distribution of computational tasks.

Both engine and web interface are written in Python.

* Tutorial
** Introduction
*** Why using pipelines

The pipeline mechanism allows you to perform a sequence of processing
steps to your data, in a way that the input of each process is the
output of the previous one. Making visible these different processing
steps, in the right order, is essential in data analysis to keep track
of what you did, and make sure that the whole processing remains
consistent.

*** How it works

Pipelet is based on the fact that you have the possibility to save on
disk every intermediate input or output of your pipeline, which is not
a strong constraint but offers a lot of benefits. It means that you
can stop the processing whenever you want, and begin it again without
recomputing the whole thing: you just take the last products you have
on disk, and continue the processing where it stopped. This logic is
interesting when the computation cost is higher than the cost of disk
space required by intermediate products.

*** The Pipelet functionalities

Pipelet implements an automatic tool which helps you : 
+ to write and manipulate pipeline with any dependency scheme, 
+ to dispatch the computational tasks on parallel architectures, 
+ to keep track of what processing has been applied to your data and perform comparisons.

** Getting started

*** Obtaining Pipelet

There is not any published stable release of pipelet right now.

git clone git://gitorious.org/pipelet/pipelet.git

*** Installing Pipelet
The web interface of Pipelet requires the installation of the
cherrypy3 Python module (debian aptitude install python-cherrypy3).
Pipelet requires Python >= 2.6.

sudo python setup.py install

*** Running the test pipeline

1. Run the test pipeline

cd test
python main.py

2. Add this pipeline to the web interface

pipeweb track test ./.sqlstatus

3. Set the access control and launch the web server

pipeutils -a username -l 2 .sqlstatus
pipeweb start

4. You should be able to browse the result on the web page
   http://localhost:8080

79 80 81 82 83 84 85 86
*** Getting a new pipe framework

To get a new pipe framework, with sample main and segment scripts : 

pipeutils -c pipename



Marc Betoule's avatar
Marc Betoule committed
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
** Writing Pipes

*** Pipeline architecture

The definition of a data processing pipeline consists in :
+ a succession of python scripts, called segments, coding each step
  of the actual processing.
+ a main script that defines the dependency scheme between segments,
  and launch the processing.

The dependencies between segments must form a directed acyclic
graph. This graph is described by a char string using a subset of the
graphviz dot language (http://www.graphviz.org). For exemple the string:

"""
a -> b -> d;
c -> d;
c -> e;
"""

defines a pipeline with 5 segments {"a", "b", "c", "d", "e"}. The
relation "a->b" ensures that the processing of the segment "a" will be
done before the processing of its child segment "b". Also the output
of "a" will be feeded as input for "b". In the given example, the node
"d" has two parents "b" and "c". Both will be executed before "d". As
their is no relation between "b" and "c" which of the two will be
executed first is not defined.

Marc Betoule's avatar
Marc Betoule committed
115 116 117 118 119 120 121 122 123 124 125 126
*** The Pipeline object

Practically, the creation of a Pipeline object by needs 3 arguments:

Pipeline(pipedot, codedir=, prefix=)

- pipedot is the string description of the pipeline
- codedir is the path of the code of the segments
- prefix  is the path of the data repository

*** Dependencies between segments

Marc Betoule's avatar
Marc Betoule committed
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
The modification of the code of one segment will trigger its
recalculation and the recalculation of all the segments which
depend on it.

The output of a segment is a list of python objects. If a segment as
no particular output the list can be empty. A child segment receives
input from a set build from its parents output sets. An instance of
the child segment is executed for each element in the set. The default
behaviour of the engine is to form the Cartesian product of the output
sets of its parents. This means that if the output of the "b" segment
is the list of string ["Lancelot", "Galahad"] and the output of "c" is
the list of string ['the Brave', 'the Pure'], four instances of
segment "d" will be run. Their inputs will be respectively the four
2-tuples: ('Lancelot','the Brave'), ('Lancelot,'the Pure'),
('Galahad','the Brave'), ('Galahad','the Pure'). At the end of the
execution of all the instances of the segment their output sets are
concatenated. If the action of segment "d" is to concatenate the two
strings he receives separated by a space, the final output set of
segment "d" will be: [('Lancelot the Brave'), ('Lancelot the Pure'),
('Galahad the Brave'), ('Galahad the Pure')].

Marc Betoule's avatar
Marc Betoule committed
148

Marc Betoule's avatar
Marc Betoule committed
149 150
*** Multiplex directive

Marc Betoule's avatar
Marc Betoule committed
151 152 153 154 155 156 157 158 159 160 161 162 163 164
This default behavior can be altered by specifying an @multiplex
directive in the commentary of the segment code. If several multiplex
directive can be found the last one is retained.

- @multiplex cross_prod : activate the default behaviour

- @multiplex zip : similar to the zip python command. The input set is
    a list of tuples, where each tuple contains the i-th element from
    each of the parent sorted output list. If the list have different
    size, the shortest is used.

- @multiplex union : The input set contains all the output.

- @multiplex gather : The input set contains one tuple of all the ouputs.
Marc Betoule's avatar
Marc Betoule committed
165 166


167 168 169 170
*** Depend directive



Marc Betoule's avatar
Marc Betoule committed
171
*** Orphan segments
Marc Betoule's avatar
Marc Betoule committed
172

Marc Betoule's avatar
Marc Betoule committed
173
TODO TBD
Marc Betoule's avatar
Marc Betoule committed
174

Marc Betoule's avatar
Marc Betoule committed
175 176 177 178 179 180 181 182 183 184
*** Hierarchical data storage

This system provides versionning of your data and easy access through
the web interface. It is also used to keep track of the code, of the
execution logs, and various meta-data of the processing. Of course,
you remain able to bypass the hierarchical storage and store your
actual data elsewhere, but you will loose the benefit of automated
versionning which proves to be quite convenient.

The storage is organized as follows:
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205

- all pipeline instances are stored below a root which corresponds to
  the prefix parameter of the Pipeline object. 
      /prefix/
- all segment meta data are stored below a root which name corresponds
  to an unique match of the segment code.
      /prefix/seg_segname_YFLJ65/
- Segment's meta data are: 
  - a copy of the segment python script
  - a copy of all segment hook scripts
  - a parameter file (.args) which contains segment parameters value
  - a meta data file (.meta) which contains some extra meta data
- all segment instances data and meta data are stored in a specific subdirectory
  which name corresponds to a string representation of its input
  	/prefix/seg_segname_YFLJ65/data/1/
- if there is a single segment instance, then data are stored in
       /prefix/seg_segname_YFLJ65/data/
- If a segment has at least one parent, its root will be located below
  one of its parent's one : 
       /prefix/seg_segname_YFLJ65/seg_segname2_PLMBH9/
- etc...
Marc Betoule's avatar
Marc Betoule committed
206 207 208 209 210 211 212 213 214 215 216

*** The segment environment

The segment code is executed in a specific environment that provides:

1. access to the segment input and output
   - seg_input:  this variable is a dictionnary containing the input of the segment
   - seg_output: this variable has to be set to a list containing the 

2. Functionnalities to use the automated hierarchical data storage system.
   - get_data_fn(basename): complete the filename with the path to the working directory. 
217
   - glob_seg(regexp, seg): return the list of filename matching regexp from segment seg
Marc Betoule's avatar
Marc Betoule committed
218 219
   - get_tmp_fn(): return a temporary filename.

220
3. Functionnalities to use the automated parameters handling
221 222 223
   - lst_par: list of parameter names of the segment 
   - lst_tag: list of parameter names which will be made visible from the web interface
   - load_param(seg, lst_par)
224 225

4. Various convenient functionalities
226
   - save_products(filename=', lst_par='*'): use pickle to save a
Marc Betoule's avatar
Marc Betoule committed
227
     part of a given namespace.
228
   - load_products(filename, lst_par): update the namespace by
Marc Betoule's avatar
Marc Betoule committed
229 230
     unpickling requested object from the file.
   - logged_subprocess(lst_args): execute a subprocess and log its output.
231
   - logger is a standard logging.Logger object that can be used to log the processing
Marc Betoule's avatar
Marc Betoule committed
232

233
5. Hooking support 
Marc Betoule's avatar
Marc Betoule committed
234 235 236
   Pipelet enables you to write reusable generic
   segments by providing a hooking system via the hook function.
   hook (hookname, globals()): execute Python script ‘seg_segname_hookname.py’ and update the namespace.
Marc Betoule's avatar
Marc Betoule committed
237 238 239



Marc Betoule's avatar
Marc Betoule committed
240 241 242 243 244 245





** Running Pipes
Maude Le Jeune's avatar
Maude Le Jeune committed
246
   
Marc Betoule's avatar
Marc Betoule committed
247
*** The interactive mode
248 249
This mode has been designed to ease debugging. If P is an instance of
the pipeline object, the syntax reads :
Marc Betoule's avatar
Marc Betoule committed
250 251 252 253 254 255 256 257 258

from pipelet.launchers import launch_interactive
w, t = launch_interactive(P)
w.run()

In this mode, each tasks will be computed in a sequential way. 
Do not hesitate to invoque the Python debugger from IPython : %pdb

*** The process mode
259 260 261 262 263 264 265
In this mode, one can run simultaneous tasks (if the pipe scheme
allows to do so). 
The number of subprocess is set by the N parameter : 

from pipelet.launchers import launch_process
launch_process(P, N)

Marc Betoule's avatar
Marc Betoule committed
266 267

*** The batch mode
268 269 270 271 272
In this mode, one can submit some batch jobs to execute the tasks. 
The number of job is set by the N parameter : 

from pipelet.launchers import launch_pbs
launch_pbs(P, N , address=(os.environ['HOST'],50000))
Marc Betoule's avatar
Marc Betoule committed
273 274

** Browsing Pipes
275 276 277 278 279 280 281 282 283 284 285 286 287
   
*** The pipelet webserver and ACL

The pipelet webserver allows the browsing of multiple pipelines. 
Each pipeline has to be register using : 

pipeweb track <shortname> sqlfile

As the pipeline browsing implies a disk parsing, some basic security
has to be set also. All users have to be register with a specific access
level (1 for read-only access, and 2 for write access).  

pipeutils -a <username> -l 2 sqlfile
Marc Betoule's avatar
Marc Betoule committed
288

289 290 291 292 293
Start the web server using : 

pipeweb start

Then the web application will be available on the web page http://localhost:8080
Marc Betoule's avatar
Marc Betoule committed
294 295

*** The web application
296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344

In order to ease the comparison of different processing, the web
interface displays various views of the pipeline data : 

**** The index page 

The index page display a tree view of all pipeline instances. Each
segment may be expand or reduce via the +/- buttons.  

The parameters used in each segments are resumed and displayed with
the date of execution and the number of related tasks order by
status. 

A checkbox allows to performed operation on multiple segments :
  - deletion : to clean unwanted data
  - tag : to tag remarkable data

The filter panel allows to display the segment instances wrt 2
criterions :
  - tag
  - date of execution

**** The code page

Each segment names is a link to its code page. From this page the user
can view all python scripts code which have been applied to the data.

The tree view is reduced to the current segment and its related
parents.

The root path corresponding to the data storage is also displayed.


**** The product page

The number of related tasks, order by status, is a link to the product
pages, where the data can be directly displayed (if images, or text
files) or downloaded. 
From this page it is also possible to delete a specific product and
its dependencies. 


**** The log page

The log page can be acceed via the log button of the filter panel.
Logs are ordered by date. 



Marc Betoule's avatar
Marc Betoule committed
345 346 347 348 349

* Advanced usage

** Database reconstruction

350 351 352 353 354
In case of unfortunate lost of the pipeline sql data base, it is
possible to reconstruct it from the disk : 

import pipelet
pipelet.utils.rebuild_db_from_disk (prefix, sqlfile)
Marc Betoule's avatar
Marc Betoule committed
355

356 357 358
All information will be retrieve, but with new identifiers. 

** The hooking system
Marc Betoule's avatar
Marc Betoule committed
359 360
** Writing custom environments

Maude Le Jeune's avatar
Maude Le Jeune committed
361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428
The pipelet software provides a set of default utilities available
from the segment environment. It is possible to extend this default
environment or even re-write a completely new one.

*** Extending the default environment

The different environment utilities are actually methods of the class
Environment. It is possible to add new functionnalities by using the
python heritage mechanism: 

File : myenvironment.py

   from pipelet.environment import *

   class MyEnvironment(Environment):
         def my_function (self):
            """ My function do nothing
	    """
	    return 


The pipelet engine objects (segments, tasks, pipeline) are available
from the worker attribut self._worker. See section "The pipelet
actors" for more details about the pipelet machinery.


*** Writing new environment

In order to start with a completely new environment, extend the base
environment: 

File : myenvironment.py

   from pipelet.environment import *

   class MyEnvironment(EnvironmentBase):
         def my_get_data_fn (self, x):
            """ New name for get_data_fn
	    """
	    return self._get_data_fn(x)

         def _close(self, glo):
            """ Post processing code
            """	 
	    return glo['seg_output']

From the base environment, the basic functionnalities for getting file
names and executing hook scripts are still available through: 
- self._get_data_fn 
- self._hook

The segment input argument is also stored in self._seg_input
The segment output argument has to be returned by the _close(self, glo)
method. 

The pipelet engine objects (segments, tasks, pipeline) are available
from the worker attribut self._worker. See section "The pipelet
actors" for more details about the pipelet machinery.


*** Loading another environment

To load another environment, set the pipeline environment attribut
accordingly. 

Pipeline(pipedot, codedir=, prefix=, env=MyEnvironment)


Marc Betoule's avatar
Marc Betoule committed
429 430 431 432 433 434 435 436
** Using custom dependency schemes

** Launching pipeweb behind apache

Pipeweb use the cherrypy web framework server and can be run behind an
apache webserver which brings essentially two advantages:
- https support.
- faster static files serving.
437 438 439 440 441 442 443 444 445 446


* The pipelet actors

** The Repository object
** The Pipeline object
** The Task object
** The Scheduler object
** The Tracker object
** The Worker object