WIP: Prototype for splitting an External API search into sub tasks
Hopefully and eventually fixes #511.
@cgenie I'm basically out of hours for this week, but could you give this a high-level glance and validate or suggest changes to the approach? I have taken a stab towards a more granular architecture for importing documents into GGTX when an external API provider is queries.
I have introduced the DataProducer abstraction that supports a variety of ways for fetching data from external sources, so it's backward compatible with the old way of streaming things via Conduit. However, for the HAL producer I had a play with the DataAsyncBatchProducer, which generates jobs during the first pass which are then executed to perform the actual HTTP calls to HAL. This means that we can split a big import into multiple (independent) jobs, which can then be scheduled in our queue system and scaled by adding or removing workers as we see fit.
This "kinda works" for some simple tests, but there are a few wrinkle I didn't have time to investigate:
- I have tried searching for
OCamlbut I have got a weird error where document titles seems to have generated loops(!!) rather than ngrams; - I don't see any tokenisation happening in the logs (i.e. the NLP server is not being called, for some reason)
- The progress is broken, because now we have a job that spawn other jobs, and we have no way, as far as I know, to "wait" for children jobs or have children jobs report progress towards a "master" job?
Thank you in advance!