README.org 8.65 KB
Newer Older
Marc Betoule's avatar
Marc Betoule committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
Pipelet is a free framework allowing for creation, manipulation,
execution and browsing of scientific data processing pipelines. It
provides:

WARNING Pipelet is currently under active development and highly
unstable. There is good chance that it becomes incompatible from one
commit to another.

+ easy chaining of interdependent elementary tasks,
+ web access to data products,
+ branch handling,
+ automated distribution of computational tasks.

Both engine and web interface are written in Python.

* Tutorial
** Introduction
*** Why using pipelines

The pipeline mechanism allows you to perform a sequence of processing
steps to your data, in a way that the input of each process is the
output of the previous one. Making visible these different processing
steps, in the right order, is essential in data analysis to keep track
of what you did, and make sure that the whole processing remains
consistent.

*** How it works

Pipelet is based on the fact that you have the possibility to save on
disk every intermediate input or output of your pipeline, which is not
a strong constraint but offers a lot of benefits. It means that you
can stop the processing whenever you want, and begin it again without
recomputing the whole thing: you just take the last products you have
on disk, and continue the processing where it stopped. This logic is
interesting when the computation cost is higher than the cost of disk
space required by intermediate products.

*** The Pipelet functionalities

Pipelet implements an automatic tool which helps you : 
+ to write and manipulate pipeline with any dependency scheme, 
+ to dispatch the computational tasks on parallel architectures, 
+ to keep track of what processing has been applied to your data and perform comparisons.

** Getting started

*** Obtaining Pipelet

There is not any published stable release of pipelet right now.

git clone git://gitorious.org/pipelet/pipelet.git

*** Installing Pipelet
The web interface of Pipelet requires the installation of the
cherrypy3 Python module (debian aptitude install python-cherrypy3).
Pipelet requires Python >= 2.6.

sudo python setup.py install

*** Running the test pipeline

1. Run the test pipeline

cd test
python main.py

2. Add this pipeline to the web interface

pipeweb track test ./.sqlstatus

3. Set the access control and launch the web server

pipeutils -a username -l 2 .sqlstatus
pipeweb start

4. You should be able to browse the result on the web page
   http://localhost:8080

** Writing Pipes

*** Pipeline architecture

The definition of a data processing pipeline consists in :
+ a succession of python scripts, called segments, coding each step
  of the actual processing.
+ a main script that defines the dependency scheme between segments,
  and launch the processing.

The dependencies between segments must form a directed acyclic
graph. This graph is described by a char string using a subset of the
graphviz dot language (http://www.graphviz.org). For exemple the string:

"""
a -> b -> d;
c -> d;
c -> e;
"""

defines a pipeline with 5 segments {"a", "b", "c", "d", "e"}. The
relation "a->b" ensures that the processing of the segment "a" will be
done before the processing of its child segment "b". Also the output
of "a" will be feeded as input for "b". In the given example, the node
"d" has two parents "b" and "c". Both will be executed before "d". As
their is no relation between "b" and "c" which of the two will be
executed first is not defined.

Marc Betoule's avatar
Marc Betoule committed
107 108 109 110 111 112 113 114 115 116 117 118
*** The Pipeline object

Practically, the creation of a Pipeline object by needs 3 arguments:

Pipeline(pipedot, codedir=, prefix=)

- pipedot is the string description of the pipeline
- codedir is the path of the code of the segments
- prefix  is the path of the data repository

*** Dependencies between segments

Marc Betoule's avatar
Marc Betoule committed
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139
The modification of the code of one segment will trigger its
recalculation and the recalculation of all the segments which
depend on it.

The output of a segment is a list of python objects. If a segment as
no particular output the list can be empty. A child segment receives
input from a set build from its parents output sets. An instance of
the child segment is executed for each element in the set. The default
behaviour of the engine is to form the Cartesian product of the output
sets of its parents. This means that if the output of the "b" segment
is the list of string ["Lancelot", "Galahad"] and the output of "c" is
the list of string ['the Brave', 'the Pure'], four instances of
segment "d" will be run. Their inputs will be respectively the four
2-tuples: ('Lancelot','the Brave'), ('Lancelot,'the Pure'),
('Galahad','the Brave'), ('Galahad','the Pure'). At the end of the
execution of all the instances of the segment their output sets are
concatenated. If the action of segment "d" is to concatenate the two
strings he receives separated by a space, the final output set of
segment "d" will be: [('Lancelot the Brave'), ('Lancelot the Pure'),
('Galahad the Brave'), ('Galahad the Pure')].

Marc Betoule's avatar
Marc Betoule committed
140

Marc Betoule's avatar
Marc Betoule committed
141 142
*** Multiplex directive

Marc Betoule's avatar
Marc Betoule committed
143 144 145 146 147 148 149 150 151 152 153 154 155 156
This default behavior can be altered by specifying an @multiplex
directive in the commentary of the segment code. If several multiplex
directive can be found the last one is retained.

- @multiplex cross_prod : activate the default behaviour

- @multiplex zip : similar to the zip python command. The input set is
    a list of tuples, where each tuple contains the i-th element from
    each of the parent sorted output list. If the list have different
    size, the shortest is used.

- @multiplex union : The input set contains all the output.

- @multiplex gather : The input set contains one tuple of all the ouputs.
Marc Betoule's avatar
Marc Betoule committed
157 158


Marc Betoule's avatar
Marc Betoule committed
159
*** Orphan segments
Marc Betoule's avatar
Marc Betoule committed
160

Marc Betoule's avatar
Marc Betoule committed
161
TODO TBD
Marc Betoule's avatar
Marc Betoule committed
162

Marc Betoule's avatar
Marc Betoule committed
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201
*** Hierarchical data storage

This system provides versionning of your data and easy access through
the web interface. It is also used to keep track of the code, of the
execution logs, and various meta-data of the processing. Of course,
you remain able to bypass the hierarchical storage and store your
actual data elsewhere, but you will loose the benefit of automated
versionning which proves to be quite convenient.

The storage is organized as follows:
all data are stored below a root 

*** The segment environment

The segment code is executed in a specific environment that provides:

1. access to the segment input and output
   - seg_input:  this variable is a dictionnary containing the input of the segment
   - get_input():
   - seg_output: this variable has to be set to a list containing the 

2. Functionnalities to use the automated hierarchical data storage system.
   - get_data_fn(basename): complete the filename with the path to the working directory. 
   - glob_seg():
   - get_tmp_fn(): return a temporary filename.

3. Various convenient functionalities
   - load_param(seg, var_names)
   - save_products(filename=', var_names='*'): use pickle to save a
     part of a given namespace.
   - load_products(filename, var_names): update the namespace by
     unpickling requested object from the file.
   - logged_subprocess(lst_args): execute a subprocess and log its output.
   - log is a standard logging.Logger object that can be used to log the processing

4. Hooking support 
   Pipelet enables you to write reusable generic
   segments by providing a hooking system via the hook function.
   hook (hookname, globals()): execute Python script ‘seg_segname_hookname.py’ and update the namespace.
Marc Betoule's avatar
Marc Betoule committed
202 203 204 205 206 207


fullname = get_tmp_fn (): return a temporary filename
lst_file = glob_seg (regexp, seg): return the list of filename matching regexp from segment seg
Parameter tools

Marc Betoule's avatar
Marc Betoule committed
208

Marc Betoule's avatar
Marc Betoule committed
209 210
output : the input value of the downstream segment.
load_param (seg, globals(), lst_par) : update the namespace with parameters of segment seg
Marc Betoule's avatar
Marc Betoule committed
211
save_products (filename, globals(), lst_par): 
Marc Betoule's avatar
Marc Betoule committed
212 213 214
load_products (filename, globals(), lst_par): update the namespace by unpickling requested object from the file.
Code dependency tools

Marc Betoule's avatar
Marc Betoule committed
215 216


Marc Betoule's avatar
Marc Betoule committed
217
Loading another environment
Marc Betoule's avatar
Marc Betoule committed
218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261

*** Depend directive

** Running Pipes

*** The interactive mode
This mode has been designed to ease debugging. If P is an instance of the pipeline object, the syntax reads :

from pipelet.launchers import launch_interactive
w, t = launch_interactive(P)
w.run()

In this mode, each tasks will be computed in a sequential way. 
Do not hesitate to invoque the Python debugger from IPython : %pdb

*** The process mode

*** The batch mode

** Browsing Pipes

*** The pipelet webserver

*** The web application
- The various views (index, pipeline, segment, tasks)
- 
*** ACL

* Advanced usage

** Database reconstruction

** The hooking system

** Writing custom environments

** Using custom dependency schemes

** Launching pipeweb behind apache

Pipeweb use the cherrypy web framework server and can be run behind an
apache webserver which brings essentially two advantages:
- https support.
- faster static files serving.