Automatize the harvesters
Currently, each group runs its harvesters manually. This development will run the harvesters for each group periodically.
- Periodicity is once every week.
- The logs will be stored in the database and kept during one month.
- The logs can be view using the current harvester views.
- The automatize process can be switch off.
- Each harvester can be activated or deactivated in the automatize process.
- This development would relies on the web2py task scheduler.
Roadmap
-
Refactor harvester -
Add automated harvester application parameter -
Setup Scheduler with a skeleton automated harvesting task function - Phase1: Create a scheduler task for automated harvesting
-
If global automated harvester parameter is not yes or true return from task -
Iterate on all harvester group entry -
If harvest is False continue -
Harvest group using process_url -
Convert logs and collection_logs to json
-
-
Use logging system for debug information -
Add an application parameter to define the execution scheduling -
Queue or dequeue automatic harvesting task according to application parameter values -
Requeue the automatic harvesting task with the new start time if the scheduling is modified
-
- Phase 2: Create DB tables
-
Create a table to hold automatic harvesting logs -
Write json logs and info into the table -
Erase logs older than one month -
Update the DB schema graphic
-
- Phase 3: Create view for the logs
-
Create Selector for harvesting logs display -
Create Controller function for harvesting logs -
Add menu command to display harvesting logs -
Get logs from the database
-
Conclusions
From that prototype, we identified all pieces required to run periodically the harvesters:
- task scheduler
- scheduler tables
- task modules
- additional controller to manipualte the task and to give access to the log
It also appears that we have to simplify the interface exposes to the user.
A possible evolution is to create a separate application, SCAN, connected to the task scheduler:
- Give access to the schedule tables
- Contain the logic to authorize the running of the harvester for a given track_publications_xxx database
- Contain the logic to balance the load between the different track_publications_xxx applications
For each track_publication application, the user will have access to:
- a switch to allow or not the periodic scan
- a switch for each harvester
- an action to consult log. It will give access to the date and the harvester log for each team. The layout is a grid where row are grouped per team. Each row contains the date and an hyper-link pointing to the harvester log.