Commit 0fa28b7f authored by Marc Betoule's avatar Marc Betoule
Browse files

The readme file

parent 507576e3
Pipelet is a free framework allowing for creation, manipulation,
execution and browsing of scientific data processing pipelines. It
provides:
WARNING Pipelet is currently under active development and highly
unstable. There is good chance that it becomes incompatible from one
commit to another.
+ easy chaining of interdependent elementary tasks,
+ web access to data products,
+ branch handling,
+ automated distribution of computational tasks.
Both engine and web interface are written in Python.
* Tutorial
** Introduction
*** Why using pipelines
The pipeline mechanism allows you to perform a sequence of processing
steps to your data, in a way that the input of each process is the
output of the previous one. Making visible these different processing
steps, in the right order, is essential in data analysis to keep track
of what you did, and make sure that the whole processing remains
consistent.
*** How it works
Pipelet is based on the fact that you have the possibility to save on
disk every intermediate input or output of your pipeline, which is not
a strong constraint but offers a lot of benefits. It means that you
can stop the processing whenever you want, and begin it again without
recomputing the whole thing: you just take the last products you have
on disk, and continue the processing where it stopped. This logic is
interesting when the computation cost is higher than the cost of disk
space required by intermediate products.
*** The Pipelet functionalities
Pipelet implements an automatic tool which helps you :
+ to write and manipulate pipeline with any dependency scheme,
+ to dispatch the computational tasks on parallel architectures,
+ to keep track of what processing has been applied to your data and perform comparisons.
** Getting started
*** Obtaining Pipelet
There is not any published stable release of pipelet right now.
git clone git://gitorious.org/pipelet/pipelet.git
*** Installing Pipelet
The web interface of Pipelet requires the installation of the
cherrypy3 Python module (debian aptitude install python-cherrypy3).
Pipelet requires Python >= 2.6.
sudo python setup.py install
*** Running the test pipeline
1. Run the test pipeline
cd test
python main.py
2. Add this pipeline to the web interface
pipeweb track test ./.sqlstatus
3. Set the access control and launch the web server
pipeutils -a username -l 2 .sqlstatus
pipeweb start
4. You should be able to browse the result on the web page
http://localhost:8080
** Writing Pipes
*** Pipeline architecture
The definition of a data processing pipeline consists in :
+ a succession of python scripts, called segments, coding each step
of the actual processing.
+ a main script that defines the dependency scheme between segments,
and launch the processing.
The dependencies between segments must form a directed acyclic
graph. This graph is described by a char string using a subset of the
graphviz dot language (http://www.graphviz.org). For exemple the string:
"""
a -> b -> d;
c -> d;
c -> e;
"""
defines a pipeline with 5 segments {"a", "b", "c", "d", "e"}. The
relation "a->b" ensures that the processing of the segment "a" will be
done before the processing of its child segment "b". Also the output
of "a" will be feeded as input for "b". In the given example, the node
"d" has two parents "b" and "c". Both will be executed before "d". As
their is no relation between "b" and "c" which of the two will be
executed first is not defined.
The modification of the code of one segment will trigger its
recalculation and the recalculation of all the segments which
depend on it.
The output of a segment is a list of python objects. If a segment as
no particular output the list can be empty. A child segment receives
input from a set build from its parents output sets. An instance of
the child segment is executed for each element in the set. The default
behaviour of the engine is to form the Cartesian product of the output
sets of its parents. This means that if the output of the "b" segment
is the list of string ["Lancelot", "Galahad"] and the output of "c" is
the list of string ['the Brave', 'the Pure'], four instances of
segment "d" will be run. Their inputs will be respectively the four
2-tuples: ('Lancelot','the Brave'), ('Lancelot,'the Pure'),
('Galahad','the Brave'), ('Galahad','the Pure'). At the end of the
execution of all the instances of the segment their output sets are
concatenated. If the action of segment "d" is to concatenate the two
strings he receives separated by a space, the final output set of
segment "d" will be: [('Lancelot the Brave'), ('Lancelot the Pure'),
('Galahad the Brave'), ('Galahad the Pure')].
*** Multiplex directive
This default behavior can be altered by specifying a @multiplex
directive in the commentary of the segment code.
If a segment code has to be applied on several data, the pipe engine
creates as many subtasks as dataset size. This behaviour is specified
by setting a list in the output variable of the upstream
segment. There will be then one task per element of the list, each
task will receive one list element as an input.
Depend directive
If a segment code needs several outputs to run, the output variable of the upstream segments has to be set to None.
Default segment environment
Some usefull functionnalities are available from the segment script environment.
Filename tools:
fullname = get_data_fn (shortname) : complete the filename with the path to the working directory.
fullname = get_tmp_fn (): return a temporary filename
lst_file = glob_seg (regexp, seg): return the list of filename matching regexp from segment seg
Parameter tools
input: the output value from the upstream segment.
output : the input value of the downstream segment.
load_param (seg, globals(), lst_par) : update the namespace with parameters of segment seg
save_products (filename, globals(), lst_par): use pickle to save a part of a given namespace.
load_products (filename, globals(), lst_par): update the namespace by unpickling requested object from the file.
Code dependency tools
logged_subprocess (lst_args) : execute a subprocess and log its output.
hook (hookname, globals()): execute Python script ‘seg_segname_hookname.py’ and update the namespace.
Loading another environment
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment