[DCP - Ingestion] Clear the redis cache after each import#76
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces Redis cache clearing functionality to the ingestion workflow and configures the ingestion helper service with Redis connection details and VPC access. Feedback indicates that the clear_cache_step is incorrectly nested within the enable_datacommons_services conditional block and should be moved to ensure it runs independently and before service restarts. Additionally, it is recommended to pass null instead of empty strings for Redis host and port variables when Redis is disabled to prevent overriding default module values.
| template: | ||
| labels: | ||
| restarted-at: '$${string(int(sys.now()))}' | ||
| %{if var.enable_redis_cache_clearing} |
There was a problem hiding this comment.
The clear_cache_step is currently nested inside the enable_datacommons_services conditional block. This means the Redis cache will only be cleared if the Data Commons services are also enabled. If Redis is used independently or if the services are managed elsewhere, the cache might not be cleared as expected. Additionally, it is generally safer to clear the cache before triggering the service restart to ensure that new service instances do not encounter stale data during their startup. Consider moving this block outside and before the restart_service step. Furthermore, ensure that any changes to this workflow are consistent with the single-import limitation (using input.importName) and that parameter names for downstream jobs match the exact keys expected.
References
- The ingestion workflow currently supports only a single import. Ensure that any changes are consistent with this limitation, using input.importName rather than input.importList.
- When defining parameters for downstream jobs (like Dataflow Flex Templates) within Google Workflows, ensure the parameter names match the exact keys expected by the job, regardless of the terminology used in the pull request title.
| redis_host = var.redis_config.enable && length(module.redis) > 0 ? module.redis[0].redis_host : "" | ||
| redis_port = var.redis_config.enable && length(module.redis) > 0 ? tostring(module.redis[0].redis_port) : "" |
There was a problem hiding this comment.
When Redis is disabled, redis_host and redis_port are passed as empty strings (""). This will override the default values (e.g., 6379 for the port) in the ingestion_helper_service module. It is better to pass null when Redis is disabled, which allows the module to use its defined default values or handle the absence of configuration more gracefully. Additionally, ensure that environment variables specific to this service are defined in its own configuration and that any mandatory dependencies on other stacks are explicitly documented in variable descriptions.
redis_host = var.redis_config.enable && length(module.redis) > 0 ? module.redis[0].redis_host : null
redis_port = var.redis_config.enable && length(module.redis) > 0 ? tostring(module.redis[0].redis_port) : null
References
- Environment variables that are intentionally specific to one service should be defined directly within that service's configuration rather than in a shared configuration file.
- When an infrastructure component or workflow step has a mandatory dependency on another stack, ensure this requirement is explicitly documented in Terraform variable descriptions or comments to prevent confusion regarding its optionality.
| display_name = "Data Commons Ingestion Workflow SA" | ||
| } | ||
|
|
||
| resource "google_workflows_workflow" "ingestion_orchestrator" { |
There was a problem hiding this comment.
I see that we have the complete definition of the workflow added here. How do we ensure it stays in sync with the workflow definition for base DC? For example, I plan to make changes to the workflow to make aggregations asynchronous. How will that be incorporated here? We should avoid code duplication. Also, having the complete definition of a workflow embedded in a Terraform file does not seem a good idea to me.
There was a problem hiding this comment.
We do have the complete definition. For the record, the DCP workflow and the base DC one were supposed to have significant differences at the start since there was no aggregations needed for DCP etc.
Now...well we had to come back on some of those and it's true the two are likely going to be the same at the end.
Even the stuff we ve only added for DCP, I assume that's also needed for base.
Id love to find a way to share that, but as a follow up to these crazy timelines :)
Temporary duplication, but we do need to merge it back in.
There was a problem hiding this comment.
Sure...let's push for a combined solution after the release deadline.
This is based on new PR which adds the clear_redis_cache action type to ingestion helper.
datacommonsorg/data#2035