WARNING Pipelet is currently under active development and highly unstable. There is good chance that it becomes incompatible from one commit to another. Pipelet is a free framework allowing for creation, manipulation, execution and browsing of scientific data processing pipelines. It provides: + easy chaining of interdependent elementary tasks, + web access to data products, + branch handling, + automated distribution of computational tasks. Both engine and web interface are written in Python. * Tutorial ** Introduction *** Why using pipelines The pipeline mechanism allows you to perform a sequence of processing steps to your data, in a way that the input of each process is the output of the previous one. Making visible these different processing steps, in the right order, is essential in data analysis to keep track of what you did, and make sure that the whole processing remains consistent. *** How it works Pipelet is based on the fact that you have the possibility to save on disk every intermediate input or output of your pipeline, which is not a strong constraint but offers a lot of benefits. It means that you can stop the processing whenever you want, and begin it again without recomputing the whole thing: you just take the last products you have on disk, and continue the processing where it stopped. This logic is interesting when the computation cost is higher than the cost of disk space required by intermediate products. *** The Pipelet functionalities Pipelet implements an automatic tool which helps you : + to write and manipulate pipeline with any dependency scheme, + to dispatch the computational tasks on parallel architectures, + to keep track of what processing has been applied to your data and perform comparisons. ** Getting started *** Obtaining Pipelet There is not any published stable release of pipelet right now. git clone git://gitorious.org/pipelet/pipelet.git *** Installing Pipelet The web interface of Pipelet requires the installation of the cherrypy3 Python module (debian aptitude install python-cherrypy3). Pipelet requires Python >= 2.6. sudo python setup.py install *** Running the test pipeline 1. Run the test pipeline cd test python main.py 2. Add this pipeline to the web interface pipeweb track test ./.sqlstatus 3. Set the access control and launch the web server pipeutils -a username -l 2 .sqlstatus pipeweb start 4. You should be able to browse the result on the web page http://localhost:8080 *** Getting a new pipe framework To get a new pipe framework, with sample main and segment scripts : pipeutils -c pipename ** Writing Pipes *** Pipeline architecture The definition of a data processing pipeline consists in : + a succession of python scripts, called segments, coding each step of the actual processing. + a main script that defines the dependency scheme between segments, and launch the processing. The dependencies between segments must form a directed acyclic graph. This graph is described by a char string using a subset of the graphviz dot language (http://www.graphviz.org). For exemple the string: """ a -> b -> d; c -> d; c -> e; """ defines a pipeline with 5 segments {"a", "b", "c", "d", "e"}. The relation "a->b" ensures that the processing of the segment "a" will be done before the processing of its child segment "b". Also the output of "a" will be feeded as input for "b". In the given example, the node "d" has two parents "b" and "c". Both will be executed before "d". As their is no relation between "b" and "c" which of the two will be executed first is not defined. *** The Pipeline object Practically, the creation of a Pipeline object by needs 3 arguments: P = Pipeline(pipedot, codedir=, prefix=) - pipedot is the string description of the pipeline - codedir is the path of the code of the segments - prefix is the path of the data repository *** Dependencies between segments The modification of the code of one segment will trigger its recalculation and the recalculation of all the segments which depend on it. The output of a segment is a list of python objects. If a segment as no particular output the list can be empty. A child segment receives input from a set build from its parents output sets. An instance of the child segment is executed for each element in the set. The default behaviour of the engine is to form the Cartesian product of the output sets of its parents. This means that if the output of the "b" segment is the list of string ["Lancelot", "Galahad"] and the output of "c" is the list of string ['the Brave', 'the Pure'], four instances of segment "d" will be run. Their inputs will be respectively the four 2-tuples: ('Lancelot','the Brave'), ('Lancelot,'the Pure'), ('Galahad','the Brave'), ('Galahad','the Pure'). At the end of the execution of all the instances of the segment their output sets are concatenated. If the action of segment "d" is to concatenate the two strings he receives separated by a space, the final output set of segment "d" will be: [('Lancelot the Brave'), ('Lancelot the Pure'), ('Galahad the Brave'), ('Galahad the Pure')]. *** Multiplex directive This default behavior can be altered by specifying an #multiplex directive in the commentary of the segment code. If several multiplex directive can be found the last one is retained. - #multiplex cross_prod : activate the default behaviour - #multiplex zip : similar to the zip python command. The input set is a list of tuples, where each tuple contains the i-th element from each of the parent sorted output list. If the list have different size, the shortest is used. - #multiplex union : The input set contains all the output. - #multiplex cross_prod group by 0 : The input set contains one tuple of all the ouputs. - #multiplex cross_prod group by ... : compute the cross_product and group the task that are identical. To make use of group, elements of the output set have to be hashable. *** Depend directive *** Orphan segments By default, orphan segments have no input argument (an empty list), and therefore are executed once. The possibility is offer to push an input list to an orphan segment. If P is an instance of the pipeline object: P.push (segname=seg_input) *** Hierarchical data storage This system provides versionning of your data and easy access through the web interface. It is also used to keep track of the code, of the execution logs, and various meta-data of the processing. Of course, you remain able to bypass the hierarchical storage and store your actual data elsewhere, but you will loose the benefit of automated versionning which proves to be quite convenient. The storage is organized as follows: - all pipeline instances are stored below a root which corresponds to the prefix parameter of the Pipeline object. /prefix/ - all segment meta data are stored below a root which name corresponds to an unique match of the segment code. /prefix/seg_segname_YFLJ65/ - Segment's meta data are: - a copy of the segment python script - a copy of all segment hook scripts - a parameter file (.args) which contains segment parameters value - a meta data file (.meta) which contains some extra meta data - all segment instances data and meta data are stored in a specific subdirectory which name corresponds to a string representation of its input /prefix/seg_segname_YFLJ65/data/1/ - if there is a single segment instance, then data are stored in /prefix/seg_segname_YFLJ65/data/ - If a segment has at least one parent, its root will be located below one of its parent's one : /prefix/seg_segname_YFLJ65/seg_segname2_PLMBH9/ - etc... *** The segment environment The segment code is executed in a specific environment that provides: 1. access to the segment input and output - seg_input: this variable is a dictionnary containing the input of the segment - seg_output: this variable has to be a list. 2. Functionnalities to use the automated hierarchical data storage system. - get_data_fn(basename): complete the filename with the path to the working directory. - glob_seg(seg, regexp): return the list of filename matching regexp from segment seg - get_tmp_fn(): return a temporary filename. 3. Functionnalities to use the automated parameters handling - lst_par: list of parameter names of the segment - lst_tag: list of parameter names which will be made visible from the web interface - load_param(seg, globals(), lst_par) 4. Various convenient functionalities - save_products(filename=', lst_par='*'): use pickle to save a part of a given namespace. - load_products(filename, lst_par): update the namespace by unpickling requested object from the file. - logged_subprocess(lst_args): execute a subprocess and log its output in processname.log and processname.err. - logger is a standard logging.Logger object that can be used to log the processing 5. Hooking support Pipelet enables you to write reusable generic segments by providing a hooking system via the hook function. hook (hookname, globals()): execute Python script ‘seg_segname_hookname.py’ and update the namespace. ** Running Pipes *** The interactive mode This mode has been designed to ease debugging. If P is an instance of the pipeline object, the syntax reads : from pipelet.launchers import launch_interactive w, t = launch_interactive(P) w.run() In this mode, each tasks will be computed in a sequential way. Do not hesitate to invoque the Python debugger from IPython : %pdb *** The process mode In this mode, one can run simultaneous tasks (if the pipe scheme allows to do so). The number of subprocess is set by the N parameter : from pipelet.launchers import launch_process launch_process(P, N) *** The batch mode In this mode, one can submit some batch jobs to execute the tasks. The number of job is set by the N parameter : from pipelet.launchers import launch_pbs launch_pbs(P, N , address=(os.environ['HOST'],50000)) ** Browsing Pipes *** The pipelet webserver and ACL The pipelet webserver allows the browsing of multiple pipelines. Each pipeline has to be register using : pipeweb track sqlfile As the pipeline browsing implies a disk parsing, some basic security has to be set also. All users have to be register with a specific access level (1 for read-only access, and 2 for write access). pipeutils -a -l 2 sqlfile Start the web server using : pipeweb start Then the web application will be available on the web page http://localhost:8080 *** The web application In order to ease the comparison of different processing, the web interface displays various views of the pipeline data : **** The index page The index page display a tree view of all pipeline instances. Each segment may be expand or reduce via the +/- buttons. The parameters used in each segments are resumed and displayed with the date of execution and the number of related tasks order by status. A checkbox allows to performed operation on multiple segments : - deletion : to clean unwanted data - tag : to tag remarkable data The filter panel allows to display the segment instances wrt 2 criterions : - tag - date of execution **** The code page Each segment names is a link to its code page. From this page the user can view all python scripts code which have been applied to the data. The tree view is reduced to the current segment and its related parents. The root path corresponding to the data storage is also displayed. **** The product page The number of related tasks, order by status, is a link to the product pages, where the data can be directly displayed (if images, or text files) or downloaded. From this page it is also possible to delete a specific product and its dependencies. **** The log page The log page can be acceed via the log button of the filter panel. Logs are ordered by date. * Advanced usage ** Database reconstruction In case of unfortunate lost of the pipeline sql data base, it is possible to reconstruct it from the disk : import pipelet pipelet.utils.rebuild_db_from_disk (prefix, sqlfile) All information will be retrieve, but with new identifiers. ** The hooking system ** Writing custom environments The pipelet software provides a set of default utilities available from the segment environment. It is possible to extend this default environment or even re-write a completely new one. *** Extending the default environment The different environment utilities are actually methods of the class Environment. It is possible to add new functionnalities by using the python heritage mechanism: File : myenvironment.py from pipelet.environment import * class MyEnvironment(Environment): def my_function (self): """ My function do nothing """ return The pipelet engine objects (segments, tasks, pipeline) are available from the worker attribut self._worker. See section "The pipelet actors" for more details about the pipelet machinery. *** Writing new environment In order to start with a completely new environment, extend the base environment: File : myenvironment.py from pipelet.environment import * class MyEnvironment(EnvironmentBase): def my_get_data_fn (self, x): """ New name for get_data_fn """ return self._get_data_fn(x) def _close(self, glo): """ Post processing code """ return glo['seg_output'] From the base environment, the basic functionnalities for getting file names and executing hook scripts are still available through: - self._get_data_fn - self._hook The segment input argument is also stored in self._seg_input The segment output argument has to be returned by the _close(self, glo) method. The pipelet engine objects (segments, tasks, pipeline) are available from the worker attribut self._worker. See section "The pipelet actors" for more details about the pipelet machinery. *** Loading another environment To load another environment, set the pipeline environment attribut accordingly. Pipeline(pipedot, codedir=, prefix=, env=MyEnvironment) ** Using custom dependency schemes ** Launching pipeweb behind apache Pipeweb use the cherrypy web framework server and can be run behind an apache webserver which brings essentially two advantages: - https support. - faster static files serving. * The pipelet actors ** The Repository object ** The Pipeline object ** The Task object ** The Scheduler object ** The Tracker object ** The Worker object ** The Environment object