Automatize the harvesters
Currently, each group runs its harvesters manually. This development will run the harvesters for each group periodically.
- Periodicity is once every week.
- The logs will be stored in the database and kept during one month.
- The logs can be view using the current harvester views.
- The automatize process can be switch off.
- Each harvester can be activated or deactivated in the automatize process.
- This development would relies on the web2py task scheduler.
Roadmap
- Refactor harvester
- Add automated harvester application parameter
- Setup Scheduler with a skeleton automated harvesting task function
- Phase1: Create a scheduler task for automated harvesting
- If global automated harvester parameter is not yes or true return from task
-
Iterate on all harvester group entry
- If harvest is False continue
- Harvest group using process_url
- Convert logs and collection_logs to json
- Use logging system for debug information
- Add an application parameter to define the execution scheduling
- Queue or dequeue automatic harvesting task according to application parameter values
- Requeue the automatic harvesting task with the new start time if the scheduling is modified
- Phase 2: Create DB tables
- Create a table to hold automatic harvesting logs
- Write json logs and info into the table
- Erase logs older than one month
- Update the DB schema graphic
- Phase 3: Create view for the logs
- Create Selector for harvesting logs display
- Create Controller function for harvesting logs
- Add menu command to display harvesting logs
- Get logs from the database
Conclusions
From that prototype, we identified all pieces required to run periodically the harvesters:
- task scheduler
- scheduler tables
- task modules
- additional controller to manipualte the task and to give access to the log
It also appears that we have to simplify the interface exposes to the user.
A possible evolution is to create a separate application, SCAN, connected to the task scheduler:
- Give access to the schedule tables
- Contain the logic to authorize the running of the harvester for a given track_publications_xxx database
- Contain the logic to balance the load between the different track_publications_xxx applications
For each track_publication application, the user will have access to:
- a switch to allow or not the periodic scan
- a switch for each harvester
- an action to consult log. It will give access to the date and the harvester log for each team. The layout is a grid where row are grouped per team. Each row contains the date and an hyper-link pointing to the harvester log.