Commit c1da684c authored by Marc Betoule's avatar Marc Betoule
Browse files

Ze readme

parent a03eb935
......@@ -104,6 +104,18 @@ of "a" will be feeded as input for "b". In the given example, the node
their is no relation between "b" and "c" which of the two will be
executed first is not defined.
*** The Pipeline object
Practically, the creation of a Pipeline object by needs 3 arguments:
Pipeline(pipedot, codedir=, prefix=)
- pipedot is the string description of the pipeline
- codedir is the path of the code of the segments
- prefix is the path of the data repository
*** Dependencies between segments
The modification of the code of one segment will trigger its
recalculation and the recalculation of all the segments which
depend on it.
......@@ -125,38 +137,125 @@ strings he receives separated by a space, the final output set of
segment "d" will be: [('Lancelot the Brave'), ('Lancelot the Pure'),
('Galahad the Brave'), ('Galahad the Pure')].
*** Multiplex directive
This default behavior can be altered by specifying a @multiplex
directive in the commentary of the segment code.
This default behavior can be altered by specifying an @multiplex
directive in the commentary of the segment code. If several multiplex
directive can be found the last one is retained.
- @multiplex cross_prod : activate the default behaviour
- @multiplex zip : similar to the zip python command. The input set is
a list of tuples, where each tuple contains the i-th element from
each of the parent sorted output list. If the list have different
size, the shortest is used.
- @multiplex union : The input set contains all the output.
- @multiplex gather : The input set contains one tuple of all the ouputs.
*** Orphan segments
If a segment code has to be applied on several data, the pipe engine
creates as many subtasks as dataset size. This behaviour is specified
by setting a list in the output variable of the upstream
segment. There will be then one task per element of the list, each
task will receive one list element as an input.
TODO TBD
Depend directive
*** Hierarchical data storage
This system provides versionning of your data and easy access through
the web interface. It is also used to keep track of the code, of the
execution logs, and various meta-data of the processing. Of course,
you remain able to bypass the hierarchical storage and store your
actual data elsewhere, but you will loose the benefit of automated
versionning which proves to be quite convenient.
The storage is organized as follows:
all data are stored below a root
*** The segment environment
The segment code is executed in a specific environment that provides:
1. access to the segment input and output
- seg_input: this variable is a dictionnary containing the input of the segment
- get_input():
- seg_output: this variable has to be set to a list containing the
2. Functionnalities to use the automated hierarchical data storage system.
- get_data_fn(basename): complete the filename with the path to the working directory.
- glob_seg():
- get_tmp_fn(): return a temporary filename.
3. Various convenient functionalities
- load_param(seg, var_names)
- save_products(filename=', var_names='*'): use pickle to save a
part of a given namespace.
- load_products(filename, var_names): update the namespace by
unpickling requested object from the file.
- logged_subprocess(lst_args): execute a subprocess and log its output.
- log is a standard logging.Logger object that can be used to log the processing
4. Hooking support
Pipelet enables you to write reusable generic
segments by providing a hooking system via the hook function.
hook (hookname, globals()): execute Python script ‘seg_segname_hookname.py’ and update the namespace.
If a segment code needs several outputs to run, the output variable of the upstream segments has to be set to None.
Default segment environment
Some usefull functionnalities are available from the segment script environment.
Filename tools:
fullname = get_data_fn (shortname) : complete the filename with the path to the working directory.
fullname = get_tmp_fn (): return a temporary filename
lst_file = glob_seg (regexp, seg): return the list of filename matching regexp from segment seg
Parameter tools
input: the output value from the upstream segment.
output : the input value of the downstream segment.
load_param (seg, globals(), lst_par) : update the namespace with parameters of segment seg
save_products (filename, globals(), lst_par): use pickle to save a part of a given namespace.
save_products (filename, globals(), lst_par):
load_products (filename, globals(), lst_par): update the namespace by unpickling requested object from the file.
Code dependency tools
logged_subprocess (lst_args) : execute a subprocess and log its output.
hook (hookname, globals()): execute Python script ‘seg_segname_hookname.py’ and update the namespace.
Loading another environment
*** Depend directive
** Running Pipes
*** The interactive mode
This mode has been designed to ease debugging. If P is an instance of the pipeline object, the syntax reads :
from pipelet.launchers import launch_interactive
w, t = launch_interactive(P)
w.run()
In this mode, each tasks will be computed in a sequential way.
Do not hesitate to invoque the Python debugger from IPython : %pdb
*** The process mode
*** The batch mode
** Browsing Pipes
*** The pipelet webserver
*** The web application
- The various views (index, pipeline, segment, tasks)
-
*** ACL
* Advanced usage
** Database reconstruction
** The hooking system
** Writing custom environments
** Using custom dependency schemes
** Launching pipeweb behind apache
Pipeweb use the cherrypy web framework server and can be run behind an
apache webserver which brings essentially two advantages:
- https support.
- faster static files serving.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment