README.org 6.15 KB
Newer Older
Marc Betoule's avatar
Marc Betoule committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
Pipelet is a free framework allowing for creation, manipulation,
execution and browsing of scientific data processing pipelines. It
provides:

WARNING Pipelet is currently under active development and highly
unstable. There is good chance that it becomes incompatible from one
commit to another.

+ easy chaining of interdependent elementary tasks,
+ web access to data products,
+ branch handling,
+ automated distribution of computational tasks.

Both engine and web interface are written in Python.

* Tutorial
** Introduction
*** Why using pipelines

The pipeline mechanism allows you to perform a sequence of processing
steps to your data, in a way that the input of each process is the
output of the previous one. Making visible these different processing
steps, in the right order, is essential in data analysis to keep track
of what you did, and make sure that the whole processing remains
consistent.

*** How it works

Pipelet is based on the fact that you have the possibility to save on
disk every intermediate input or output of your pipeline, which is not
a strong constraint but offers a lot of benefits. It means that you
can stop the processing whenever you want, and begin it again without
recomputing the whole thing: you just take the last products you have
on disk, and continue the processing where it stopped. This logic is
interesting when the computation cost is higher than the cost of disk
space required by intermediate products.

*** The Pipelet functionalities

Pipelet implements an automatic tool which helps you : 
+ to write and manipulate pipeline with any dependency scheme, 
+ to dispatch the computational tasks on parallel architectures, 
+ to keep track of what processing has been applied to your data and perform comparisons.

** Getting started

*** Obtaining Pipelet

There is not any published stable release of pipelet right now.

git clone git://gitorious.org/pipelet/pipelet.git

*** Installing Pipelet
The web interface of Pipelet requires the installation of the
cherrypy3 Python module (debian aptitude install python-cherrypy3).
Pipelet requires Python >= 2.6.

sudo python setup.py install

*** Running the test pipeline

1. Run the test pipeline

cd test
python main.py

2. Add this pipeline to the web interface

pipeweb track test ./.sqlstatus

3. Set the access control and launch the web server

pipeutils -a username -l 2 .sqlstatus
pipeweb start

4. You should be able to browse the result on the web page
   http://localhost:8080

** Writing Pipes

*** Pipeline architecture

The definition of a data processing pipeline consists in :
+ a succession of python scripts, called segments, coding each step
  of the actual processing.
+ a main script that defines the dependency scheme between segments,
  and launch the processing.

The dependencies between segments must form a directed acyclic
graph. This graph is described by a char string using a subset of the
graphviz dot language (http://www.graphviz.org). For exemple the string:

"""
a -> b -> d;
c -> d;
c -> e;
"""

defines a pipeline with 5 segments {"a", "b", "c", "d", "e"}. The
relation "a->b" ensures that the processing of the segment "a" will be
done before the processing of its child segment "b". Also the output
of "a" will be feeded as input for "b". In the given example, the node
"d" has two parents "b" and "c". Both will be executed before "d". As
their is no relation between "b" and "c" which of the two will be
executed first is not defined.

The modification of the code of one segment will trigger its
recalculation and the recalculation of all the segments which
depend on it.

The output of a segment is a list of python objects. If a segment as
no particular output the list can be empty. A child segment receives
input from a set build from its parents output sets. An instance of
the child segment is executed for each element in the set. The default
behaviour of the engine is to form the Cartesian product of the output
sets of its parents. This means that if the output of the "b" segment
is the list of string ["Lancelot", "Galahad"] and the output of "c" is
the list of string ['the Brave', 'the Pure'], four instances of
segment "d" will be run. Their inputs will be respectively the four
2-tuples: ('Lancelot','the Brave'), ('Lancelot,'the Pure'),
('Galahad','the Brave'), ('Galahad','the Pure'). At the end of the
execution of all the instances of the segment their output sets are
concatenated. If the action of segment "d" is to concatenate the two
strings he receives separated by a space, the final output set of
segment "d" will be: [('Lancelot the Brave'), ('Lancelot the Pure'),
('Galahad the Brave'), ('Galahad the Pure')].

*** Multiplex directive

This default behavior can be altered by specifying a @multiplex
directive in the commentary of the segment code.



If a segment code has to be applied on several data, the pipe engine
creates as many subtasks as dataset size.  This behaviour is specified
by setting a list in the output variable of the upstream
segment. There will be then one task per element of the list, each
task will receive one list element as an input.

Depend directive

If a segment code needs several outputs to run, the output variable of the upstream segments has to be set to None.
Default segment environment
Some usefull functionnalities are available from the segment script environment. 
Filename tools:

fullname = get_data_fn (shortname) : complete the filename with the path to the working directory. 
fullname = get_tmp_fn (): return a temporary filename
lst_file = glob_seg (regexp, seg): return the list of filename matching regexp from segment seg
Parameter tools

input: the output value from the upstream segment. 
output : the input value of the downstream segment.
load_param (seg, globals(), lst_par) : update the namespace with parameters of segment seg
save_products (filename, globals(), lst_par): use pickle to save a part of a given namespace.
load_products (filename, globals(), lst_par): update the namespace by unpickling requested object from the file.
Code dependency tools

logged_subprocess (lst_args) : execute a subprocess and log its output.
hook (hookname, globals()): execute Python script ‘seg_segname_hookname.py’ and update the namespace.
Loading another environment