Commit 3f2d2b52 authored by Maude Le Jeune's avatar Maude Le Jeune
Browse files

update README with v1.1 features

parent 59c87d3f
......@@ -17,8 +17,8 @@ language, prior knowledge of this language is required.
** Introduction
*** Why using pipelines
The pipeline mechanism allows you to apply a sequence of processing
steps to your data, in a way that the input of each process is the
The pipeline mechanism allows to apply a sequence of processing
steps to some data, in a way that the input of each process is the
output of the previous one. Making visible these different processing
steps, in the right order, is essential in data analysis to keep track
of what you did, and make sure that the whole processing remains
......@@ -27,7 +27,7 @@ consistent.
*** How it works
Pipelet is based on the possibility to save on disk every intermediate
input or output of your pipeline, which is usually not a strong
input or output of a pipeline, which is usually not a strong
constraint but offers a lot of benefits. It means that you can stop
the processing whenever you want, and start it again without
recomputing the whole thing: you just take the last products you have
......@@ -53,8 +53,18 @@ Pipelet is a free framework which helps you :
*** What's new in v1.1
- The glob_seg behavior has been modified for coherence, convenience,
and performance sake. See [[*The%20segment%20environment][The segment environment]].
- Speed improvement during execution and navigation to handle pipeline of 100 thousand tasks.
- Task repository versionning to manage =group_by= directive which uses different parent tasks list.
- New =glob_seg= type utility to search data files from parent task
only + improvement of I/O and parameters utilities. See [[*The%20segment%20environment][The segment
environment]].
- Improvement of external dependencies management : the depend
directive induces a copy of external dependencies, the version
number (together with RCS revision if exist) of the imported modules
are output.
- Pickle file render available from the Web interface
** Getting started
*** Pipelet installation
......@@ -65,11 +75,18 @@ Pipelet is a free framework which helps you :
+ The web interface of Pipelet requires the installation of the
cherrypy3 Python module (on Debian: aptitude install python-cherrypy3).
+ Although default Python installation provides the sqlite3 module,
you may not be able to use it. In that case, you can manually
install the pysqlite2 module.
You may find useful to install some generic scientific tools that nicely interact with Pipelet:
+ numpy
+ matplotlib
+ latex
**** Getting Pipelet
***** Software status
......@@ -237,7 +254,7 @@ and:
The Cartesian product of the previous set is:
#+begin_src python
[('Lancelot','the Brave'), ('Lancelot,'the Pure'), ('Galahad','the Brave'), ('Galahad','the
[('Lancelot','the Brave'), ('Lancelot','the Pure'), ('Galahad','the Brave'), ('Galahad','the
Pure')]
#+end_src
......@@ -268,6 +285,12 @@ k = seg_input["knights"]
q = seg_input["quality"]
#+end_src
One can also use dedicated segment routines:
#+begin_src python
k = get_input("knights")
q = get_input("quality")
#+end_src
See section [[*The%20segment%20environment]['The segment environment']] for more details.
*** Orphan segments
......@@ -283,14 +306,10 @@ P.push (segname=[1,2,3])
#+end_src
From the segment environment, inputs can be retrieve from the
usual dictionary, using the keyword =segnamephantom=.
dedicated routine:
#+begin_src python
id = seg_input['segnamephantom']
#+end_src
or
#+begin_src python
id = seg_input.values()[0]
id = get_input()
#+end_src
In this scheme, it is important to uniquely identify the child tasks
......@@ -348,7 +367,7 @@ The segment code is executed in a specific environment that provides:
1. access to the segment input and output
- =get_input(seg)=: return the input coming from segment seg. If no
segment specified, take the first. This utility replaces the
seg_input variable which type could vary as described below.
=seg_input= variable which type could vary as described below.
- =seg_input=: this variable is a dictionary containing the input of the segment.
In the general case, =seg_input= is a python dictionary which
......@@ -358,14 +377,14 @@ The segment code is executed in a specific environment that provides:
directive, which alters the origin of the inputs. In this case,
=seg_input= contains the resulting class elements.
- =set_output(o)=: set the segment output as a list. If o is not a
list, set a list of one element o.
- =set_output(o)=: set the segment output as a list. If =o= is not a
list, set a list of one element =[o]=.
- =seg_output=: this variable has to be a list.
2. Functionalities to use the automated hierarchical data storage system.
- =get_data_fn(basename)=: complete the filename with the path to
the working directory.
- =get_data_fn(basename, seg)=: complete the filename with the path to
the working directory of the segment (default is the current segment).
- =glob_parent(regexp, segs)=: Return the list of filename matching
the pattern y in the data directory of direct parent tasks. It
is possible to search only in a specific segment list segs.
......@@ -380,8 +399,8 @@ The segment code is executed in a specific environment that provides:
- =get_tmp_fn()=: return a temporary filename.
3. Functionalities to use the automated parameters handling
- =save_param(lst)=: save the listed parameters of the segment in the meta data.
- =expose(lst)=: expose the listed parameters from the web interface
- =save_param(lst)=: the listed parameters will be saved in a dedicated file.
- =expose(lst)=: the listed parameters will be exposed from the web interface
- =load_param(seg, globals(), lst)=: retrieve parameters from the meta data.
4. Various convenient functionalities
......@@ -422,18 +441,18 @@ Now, we create the 3 segment files =knights.py=, =quality.py= and
=melt.py=. The only action we expect from segment knights is simply to
provide a list of knights. Its code is very simple:
#+begin_src python
seg_output = ["Lancelot", "Galahad"]
set_output(["Lancelot", "Galahad")
#+end_src
Same thing for the segment quality:
#+begin_src python
seg_output = ['the Brave', 'the Pure']
set_output (['the Brave', 'the Pure'] )
#+end_src
As explained, the segment melt will be executed four times. We expect
from it to concatenate its input and write the result into a file, so the code is:
#+begin_src python
knight, quality = seg_input['knights'], seg_input['quality']
knight, quality = get_input('knights'), get_input('quality')
f = open(get_data_fn('result.txt'), 'w')
f.write(knight + ' ' + quality+'\n')
f.close()
......@@ -463,9 +482,6 @@ mode that enable to exploitation of data parallelism (in this case
running the four independent instances of the melt segment in
parallel), and how to provide web access to the results.
*** The exemple pipelines
**** fft
**** cmb
** Running Pipes
......@@ -724,8 +740,8 @@ At execution, 4 instances of the =fftimg= segment will be
created, and each of them outputs one element of this list :
#+begin_src python
img = seg_input.values()[0] #(fftimg.py - line 16)
seg_output = [img] #(fftimg.py - line 41)
img = get_input() #(fftimg.py - line 15)
set_output (img) #(fftimg.py - line 38)
#+end_src
On the other side, a single instance of the =mkgauss= segment will be
......@@ -739,7 +755,7 @@ The instance identifier which is set by the =fftimg= output, can be
retrieve with the following instruction:
#+begin_src python
img = seg_input['fftimg'] #(convol.py - line 12)
img = get_input('fftimg') #(convol.py - line 12)
#+end_src
**** Running the pipe
......@@ -806,9 +822,8 @@ P.push(noise=[nside])
From the segment, those inputs are retrieved with :
#+begin_src python
nside = seg_input.values()[0][0] ##(cmb.py line 13)
sim_id = seg_input.values()[0][1] ##(cmb.py line 14)
nside = seg_input.values()[0] ##(noise.py line 16)
(nside,sim_id) = get_input() ##(cmb.py line 14)
nside = seg_input() ##(noise.py line 15)
#+end_src
The last segment produces a plot in which we compare:
......@@ -888,7 +903,7 @@ as variable wearing the name of the corresponding parents.
Given the Cartesian product set:
#+begin_src python
[('Lancelot','the Brave'), ('Lancelot,'the Pure'), ('Galahad','the Brave'), ('Galahad','the
[('Lancelot','the Brave'), ('Lancelot','the Pure'), ('Galahad','the Brave'), ('Galahad','the
Pure')]
#+end_src
......@@ -988,7 +1003,6 @@ confusion.
The hook scripts are included into the hash key computation.
** Segment script repository
*** Local repository
By default, segment scripts are read from a local directory, specified
......@@ -1006,27 +1020,10 @@ of each segment.
It is generally a good idea to make this directory controlled by an
RCS, to ease the reproducibility of the pipeline (even if the pipelet
engine makes a copy of the segment script in the segment output
directory). This can be done manually or using one the following
repositories.
*** Git repository
A Git repository is defined by :
+ its URL and the location (starting from Git repository) to the segment scripts
+ a revision string (heads or tags, default is 'HEAD')
See http://www.kernel.org/pub/software/scm/git/docs/gitrevisions.html
for more informations.
#+begin_src python
from pipelet.pipeline import Pipeline
P = Pipeline(pipedot, code_dir = "./repo", git_url=("git://host.xz[:port]/path/to/repo.git/", "path/to/segments"), git_rev='HEAD', prefix="./")
#+end_src
At the pipeline initialization, a local copy of the Git repository
will be downloaded or updated to =code_dir= using the =git clone= and
=git checkout= command.
directory).
If using Git, the revision number will be stored at the beginning of
the copy of the segment script.
** Writing custom environments
......@@ -1088,8 +1085,8 @@ The segment output argument has to be returned by the =_close(self, glo)=
method.
The pipelet engine objects (segments, tasks, pipeline) are available
from the worker attribut =self._worker=. See section [[*The%20Pipelet%20actors][The Pipelet
actors]] for more details about the Pipelet machinery.
from the worker attribut =self._worker=. See doxygen documentation for
more details about the Pipelet machinery.
......@@ -1117,7 +1114,7 @@ giving hints about each. We describe here an example case using
mod_rewrite and virtual hosting.
1. The first thing we need is a working installation of apache with
mod_rewrite activated. On a debian-like distribution this usually
=mod_rewrite= activated. On a debian-like distribution this usually
obtain by:
=sudo a2enmod rewrite=
=sudo a2enmod proxy=
......@@ -1127,21 +1124,22 @@ mod_rewrite and virtual hosting.
application except for the static files of the application that
will be served directly. Here is a sample configuration file for a
dedicated virtual host named pipeweb with pipelet installed under
=/usr/local/lib/python2.6/dist-packages/=.
#+begin_src apache
<VirtualHost pipeweb:80>
ServerAdmin pipeweb_admin@localhost
DocumentRoot /usr/local/lib/python2.6/dist-packages/pipelet
# ErrorLog /some/custom/error_file.log
# CustomLog /some/custom/access_file.log common
RewriteEngine on
RewriteRule ^/static/(.*) /usr/local/lib/python2.6/dist-packages/pipelet/static/$1 [L]
RewriteRule ^(.*) http://127.0.0.1:8080$1 [proxy]
</VirtualHost>
#+end_src
=/usr/local/lib/python2.6/dist-packages/= .
#+begin_src apache
<VirtualHost pipeweb:80>
ServerAdmin pipeweb_admin@localhost
DocumentRoot /usr/local/lib/python2.6/dist-packages/pipelet
# ErrorLog /some/custom/error_file.log
# CustomLog /some/custom/access_file.log common
RewriteEngine on
RewriteRule ^/static/(.*) /usr/local/lib/python2.6/dist-packages/pipelet/static/$1 [L]
RewriteRule ^(.*) http://127.0.0.1:8080$1 [proxy]
</VirtualHost>
#+end_src
3. Restart apache and start the pipeweb application to serve on the
specified address and port:
=pipeweb start -H 127.0.0.1=
......
......@@ -11,8 +11,7 @@ import healpy as hp
save_param(['lmax', 'nside', 'cmb_unit', 'sim_id', 'input_cl'])
expose(['lmax', 'nside', 'cmb_unit'])
nside = seg_input.values()[0][0] ## pushed from main
sim_id = seg_input.values()[0][1] ## pushed from main
(nside,sim_id) = get_input() ## pushed from main
lmax = 2*nside
### Generate a cmb map
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment