Commit d0967eb1 authored by Maude Le Jeune's avatar Maude Le Jeune
Browse files

new name for glob_seg. README updated.

parent db310f0f
......@@ -293,6 +293,13 @@ or
id = seg_input.values()[0]
#+end_src
In this scheme, it is important to uniquely identify the child tasks
of the orphan segment by setting a dedicated output.
#+begin_src python
seg_output = id
#+end_src
See section [[*The%20segment%20environment][The segment environment]] for more details.
*** Hierarchical data storage
......@@ -319,7 +326,8 @@ The storage is organized as follows:
- a meta data file (.meta) which contains some extra meta data
- all segment instances data and meta data are stored in a specific subdirectory
which name corresponds to a string representation of its input
=/prefix/segname_YFLJ65/data/1/=
prefix by its identifier number
=/prefix/segname_YFLJ65/data/1_a/=
- if there is a single segment instance, then data are stored in
=/prefix/segname_YFLJ65/data/=
- If a segment has at least one parent, its root will be located below
......@@ -350,14 +358,16 @@ The segment code is executed in a specific environment that provides:
- =seg_output=: this variable has to be a list.
2. Functionalities to use the automated hierarchical data storage system.
- =get_data_fn(basename)=: complete the filename with the path to the working directory.
- =glob_seg(seg, regexp)=: Return the list of filename matching the pattern y in the
data directory of parent tasks from the parent segment x.
- =glob_seg_all(seg, regexp)=: Return the list of filename matching
y in the working directory of segment x independantly of
whether the file comes from a task related to the current
task. glob_seg_all is provided to reproduce the behaviour of
old glob_seg for backward compatibility. Its usage should be limited as it:
- =get_data_fn(basename)=: complete the filename with the path to
the working directory.
- =glob_parent(regexp, segs)=: Return the list of filename matching
the pattern y in the data directory of direct parent tasks. It
is possible to search only in a specific segment list segs.
- =glob_seg(seg, regexp)=: Return the list of filename matching the
pattern y in the data directory of parent segment x (all task
directories are searched, independantly of whether the file
comes from a task related to the current task). Its usage
should be limited as it:
- potentially breaks the dependancy scheme.
- may hurt performances as all task directories of the segment
x will be searched.
......
......@@ -2,10 +2,21 @@
I see at least before three projects to complete before making the first release:
* The task_id project is not closed:
This is release critical.
- [ ] There is some dark zone (sideeffects):
- [ ] how segment without seg_output are treated (no task is stored, what happened when we delete these kind of segs ...)
- [ ] Any problem with Orphan tasks ?
- [ ] problem for parents giving same str_input outside the special case of groud_by
- [ ] how segment without seg_output are treated (no task is
stored, what happened when we delete these kind of segs
...).
- [ ] Any problem with Orphan tasks ? -> Main issue here :
task are identified by parent list. For orphan task the
current solution is to use product instead but there is two
exceptions here : group_by and ouput constant (None for
example). In those two cases: if seg_input changes from
main, no recomputation. Is it possible to store parent of
orphan task (phantom) ?
- [X] problem for parents giving same str_input outside the
special case of groud_by -> compute as many task as parents
even if str_input is the same.
- [ ] Does the tracking on disk allow to reconstruct the database
- [ ] I modified the task format to allow for easy glob_seg. It may have break other things (At least the database reconstruction)
- [X] Is the treatment of redundant tasks resulting from group_by OK
......@@ -15,14 +26,15 @@ I see at least before three projects to complete before making the first release
release critical (this is the case of the glob_seg project I think),
API changes for which compatibility could be maintained by
publishing some kind of OldEnvironment classes are not.
- [ ] I started a project (glob_seg/glob_seg_all separation) to make
- [X] I started a project (glob_seg/glob_seg_all separation) to make
glob_seg easier to use and more efficient when large number of
task are present (look only in the requested directory). This is not finalize:
- It allows only to glob the direct parent. Going further causes a
big "depth" problem in the current state.
- It is probably buggy with side effects (see what happened on
segment without seg_output.
- Its calling signature should be rethought
segment without seg_output) -> no bug here, but search give no
result. I think this is not an issue (Maude)
- [ ] Its calling signature should be rethought
- The name should be changed in glob_parents and glob_seg
- [ ] Are we satisfied with seg_input, is this convenient, should we
provide extra function to ease the retrieval of inputs.
......
......@@ -182,10 +182,10 @@ class Environment(EnvironmentBase):
self.logger.info ("hooking %s"%hook_name)
return self._hook(hook_name, glo)
def glob_seg(self, y, segs=None):
def glob_parent(self, y, segs=None):
""" Globbing limited to the fatherhood
For unlimited globbing see glob_seg_all.
For unlimited globbing see glob_seg.
Parameters
----------
......@@ -212,11 +212,11 @@ class Environment(EnvironmentBase):
res+=glob(path.join(self._worker.pipe.get_data_dir(segx),t,y))
return res
def glob_seg_all(self, x, y):
def glob_seg(self, x, y):
""" Return the list of filename matching y in the working
directory of segment x.
Usage of glob_seg_all should be limited:
Usage of glob_seg should be limited:
- potentially breaks the dependancy scheme
- May hurt performances as all task directories of the segment x will be searched
......
print seg_input
import os
os.system("touch %s"%get_data_fn("file%d.dat"%seg_input.values()[0]))
seg_output = [seg_input.values()[0]]
#seg_output = ['aa', 'bb']
import os
print seg_input
seg_output = [(seg_input.values()[0], "pp")]
os.system("touch %s"%get_data_fn("bbbb%d.dat"%seg_input.values()[0]))
#seg_output = [(seg_input.values()[0], "pp")]
#multiplex cross_prod group_by "0"
print seg_input
import os
f = glob_seg("a", "file*.dat")
for file in f:
os.system("cp %s %s"%(file, get_data_fn("fromglobseg_%s"%os.path.basename(file))))
g = glob_parent("b*.dat")
for file in g:
os.system("cp %s %s"%(file, get_data_fn("fromglobparent_%s"%os.path.basename(file))))
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment