Commit 693b8926 authored by Maude Le Jeune's avatar Maude Le Jeune
Browse files

update du readme en cours

parent b14e27be
WARNING Pipelet is currently under active development and highly
unstable. There is good chance that it becomes incompatible from one
commit to another.
Pipelet is a free framework allowing for creation, manipulation,
execution and browsing of scientific data processing pipelines. It
provides:
Pipelet is a free framework allowing for the creation, execution and
browsing of scientific data processing pipelines. It provides:
+ easy chaining of interdependent elementary tasks,
+ web access to data products,
......@@ -17,7 +12,7 @@ Both engine and web interface are written in Python.
** Introduction
*** Why using pipelines
The pipeline mechanism allows you to perform a sequence of processing
The pipeline mechanism allows you to apply a sequence of processing
steps to your data, in a way that the input of each process is the
output of the previous one. Making visible these different processing
steps, in the right order, is essential in data analysis to keep track
......@@ -26,10 +21,10 @@ consistent.
*** How it works
Pipelet is based on the fact that you have the possibility to save on
disk every intermediate input or output of your pipeline, which is not
a strong constraint but offers a lot of benefits. It means that you
can stop the processing whenever you want, and begin it again without
Pipelet is based on the possibility to save on disk every intermediate
input or output of your pipeline, which is usually not a strong
constraint but offers a lot of benefits. It means that you can stop
the processing whenever you want, and begin it again without
recomputing the whole thing: you just take the last products you have
on disk, and continue the processing where it stopped. This logic is
interesting when the computation cost is higher than the cost of disk
......@@ -37,23 +32,33 @@ space required by intermediate products.
*** The Pipelet functionalities
Pipelet implements an automatic tool which helps you :
+ to write and manipulate pipeline with any dependency scheme,
Pipelet is a free framework which helps you :
+ to write and manipulate pipelines with any dependency scheme,
+ to dispatch the computational tasks on parallel architectures,
+ to keep track of what processing has been applied to your data and perform comparisons.
** Getting started
*** Pipelet installation
**** Dependencies
+ Running the Pipelet engine requires Python >= 2.6.
+ The web interface of Pipelet requires the installation of the
cherrypy3 Python module (debian aptitude install python-cherrypy3).
You may find usefull to install some generic scientific tools that nicely interact with Pipelet:
+ numpy
+ matplotlib
+ latex
*** Obtaining Pipelet
**** Getting Pipelet
There is not any published stable release of pipelet right now.
git clone git://gitorious.org/pipelet/pipelet.git
*** Installing Pipelet
The web interface of Pipelet requires the installation of the
cherrypy3 Python module (debian aptitude install python-cherrypy3).
Pipelet requires Python >= 2.6.
**** Installing Pipelet
sudo python setup.py install
......@@ -78,14 +83,20 @@ pipeweb start
*** Getting a new pipe framework
To get a new pipe framework, with sample main and segment scripts :
To get a new pipeline framework, with example main and segment scripts :
pipeutils -c pipename
This command ends up with the creation of directory named pipename wich contains:
+ a main script (named main.py) providing functionnalities to execute
your pipeline in various modes (debug, parallel, batch mode, ...)
+ an example of segment script (seg_default_code.py) which illustrates
the pipelet utilities with comments.
The next section describes those two files in more details.
** Writing Pipes
** Writing Pipes
*** Pipeline architecture
The definition of a data processing pipeline consists in :
......@@ -112,15 +123,24 @@ of "a" will be feeded as input for "b". In the given example, the node
their is no relation between "b" and "c" which of the two will be
executed first is not defined.
When executing the segment "seg", the engine looks for a python script
named seg.py. If not found, it looks iteratively for script files
named "se.py" and "s.py". This way, different segments of the pipeline
can share the same code, if they are given a name with a common root
(this mecanism is useful to write generic segment and is completed by
the hooking system, described in the advanced usage section). The code
is then executed in a specific namespace (see below The execution
environment).
*** The Pipeline object
Practically, the creation of a Pipeline object by needs 3 arguments:
Practically, the creation of a Pipeline object needs 3 arguments:
P = Pipeline(pipedot, codedir=, prefix=)
- pipedot is the string description of the pipeline
- codedir is the path of the code of the segments
- prefix is the path of the data repository
- codedir is the path where the segment scripts can be found
- prefix is the path to the data repository (see below Hierarchical data storage)
*** Dependencies between segments
......@@ -129,37 +149,59 @@ recalculation and the recalculation of all the segments which
depend on it.
The output of a segment is a list of python objects. If a segment as
no particular output the list can be empty. A child segment receives
input from a set build from its parents output sets. An instance of
the child segment is executed for each element in the set. The default
behaviour of the engine is to form the Cartesian product of the output
sets of its parents. This means that if the output of the "b" segment
is the list of string ["Lancelot", "Galahad"] and the output of "c" is
the list of string ['the Brave', 'the Pure'], four instances of
segment "d" will be run. Their inputs will be respectively the four
2-tuples: ('Lancelot','the Brave'), ('Lancelot,'the Pure'),
('Galahad','the Brave'), ('Galahad','the Pure'). At the end of the
execution of all the instances of the segment their output sets are
concatenated. If the action of segment "d" is to concatenate the two
strings he receives separated by a space, the final output set of
segment "d" will be: [('Lancelot the Brave'), ('Lancelot the Pure'),
('Galahad the Brave'), ('Galahad the Pure')].
no particular output this list can be empty and do not need to be
specified. Elements of the list are allowed to be any kind of
pickleable python objects. However, a good practice is to fill the
list with the minimal set of characteristics relevant to describe the
output of the segment and to defer the storage of the data to
appropriate structures and file formats. For example, a segment which
performs computation on large images could virtually pass the results
of its computation to the following segment using the output list. It
is a better practice to store the resulting image in a dedicated file
and to pass in the list only the information allowing a non ambiguous
identification of this file (like its name or part of it) for the
following segments.
The input of a child segment is taken in a set build from the output
lists of its parents. The content of the input set is actually tunable
using the multiplex directive (see below). However the simplest and
default behaviour of the engine is to form the Cartesian product of
the output list of its parent.
To illustrate this behaviour let us consider the following pipeline,
build from three segments:
knights -> melt;
quality -> melt;
and assume that the respective output lists of segments knights and
quality are:
["Lancelot", "Galahad"]
and:
['the Brave', 'the Pure']
The cartesian product of the previous set is:
[('Lancelot','the Brave'), ('Lancelot,'the Pure'), ('Galahad','the Brave'), ('Galahad','the
Pure')]
Four instances of segment "melt" will thus be run, each one receving
as input one of the four 2-tuples.
At the end of the execution of all the instances of a segment, their
output lists are concatenated. If the action of segment "melt" is to
concatenate the two strings he receives separated by a space, the
final output set of segment "melt" will be:
[('Lancelot the Brave'), ('Lancelot the Pure'), ('Galahad the Brave'), ('Galahad the Pure')].
*** Multiplex directive
This default behavior can be altered by specifying an #multiplex
This default behavior can be altered by specifying a #multiplex
directive in the commentary of the segment code. If several multiplex
directive can be found the last one is retained.
directive are present in the segment code the last one is retained.
- #multiplex cross_prod : activate the default behaviour
- #multiplex zip : similar to the zip python command. The input set is
a list of tuples, where each tuple contains the i-th element from
each of the parent sorted output list. If the list have different
size, the shortest is used.
- #multiplex union : The input set contains all the output.
- #multiplex : activate the default behaviour
- #multiplex cross_prod group by 0 : The input set contains one tuple of all the ouputs.
......@@ -167,11 +209,9 @@ directive can be found the last one is retained.
that are identical. To make use of group, elements of the output set
have to be hashable.
*** Depend directive
*** Orphan segments
By default, orphan segments have no input argument (an empty list),
......@@ -182,7 +222,6 @@ If P is an instance of the pipeline object:
P.push (segname=seg_input)
*** Hierarchical data storage
This system provides versionning of your data and easy access through
......@@ -250,7 +289,6 @@ The segment code is executed in a specific environment that provides:
** Running Pipes
*** The interactive mode
This mode has been designed to ease debugging. If P is an instance of
the pipeline object, the syntax reads :
......@@ -270,7 +308,6 @@ The number of subprocess is set by the N parameter :
from pipelet.launchers import launch_process
launch_process(P, N)
*** The batch mode
In this mode, one can submit some batch jobs to execute the tasks.
The number of job is set by the N parameter :
......@@ -278,8 +315,8 @@ The number of job is set by the N parameter :
from pipelet.launchers import launch_pbs
launch_pbs(P, N , address=(os.environ['HOST'],50000))
** Browsing Pipes
*** The pipelet webserver and ACL
The pipelet webserver allows the browsing of multiple pipelines.
......@@ -351,7 +388,6 @@ Logs are ordered by date.
* Advanced usage
** Database reconstruction
In case of unfortunate lost of the pipeline sql data base, it is
......@@ -432,7 +468,6 @@ accordingly.
Pipeline(pipedot, codedir=, prefix=, env=MyEnvironment)
** Using custom dependency schemes
** Launching pipeweb behind apache
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment