Executor: Cross-year student ID matching#60
Conversation
| cross_year_output = self.cross_year_pass(self.output_sets[0]) | ||
| self.output_sets.append(cross_year_output) | ||
|
|
||
| self.upload_artifact(artifact.MATCH_RATES) |
There was a problem hiding this comment.
If the cross year pass generates unmatched IDs, it seems the file is being saved at non-ods/input_no_student_id_match.csv within the job directory, instead of at the top level where the app expects it.
FWIW, we could definitely tighten up the handoff of the unmatched IDs file... but I feel that's not really worth it in this project and soon we should be moving away from this paradigm anyway.
There was a problem hiding this comment.
Thanks for flagging. I think in principle it's actually uploaded to both locations, but not being handled correctly due to the bug we discussed in slack. Still, I'm going to filter out the non-JSONL files from teh output set upload so that there aren't duplicated copies sent to S3
| # if we don't return the unmatched students file at all | ||
| self.logger.debug("too many unmatched students. Skipping upload") | ||
| artifact.UNMATCHED_STUDENTS.needs_upload = False | ||
| raise ValueError("insufficient ID matches to continue (highest rate {self.highest_match_rate} < required {config.REQUIRED_ID_MATCH_RATE}; ID column name: {self.highest_match_id_name}; Ed-Fi ID type: {self.highest_match_id_type})") |
There was a problem hiding this comment.
Does this need to be an f string?
| first_run_id_type = self.highest_match_id_type | ||
|
|
||
| first_run_output_dir = os.path.abspath(config.OUTPUT_DIR_FIRST_RUN) | ||
| os.rename(self.output_dir, first_run_output_dir) |
There was a problem hiding this comment.
Just checking my understanding: if the first run is successful and kicks out unmatched students to be processed again for xyear matches, and that second run is unsuccessful, we do not want to present the original unmatched students from run 1? My intuition says yes that is correct, and also I'm not sure how run 2 could fail.
There was a problem hiding this comment.
Your understanding of the desired behavior is correct, and I agree it seems unlikely to occur in practice.
|
|
||
| def stream_to_file(session, url, dest_path, max_attempts=3): | ||
| """GET url as a stream and write the body to dest_path""" | ||
| for attempt in range(1, max_attempts + 1): |
There was a problem hiding this comment.
I tested the retry and it works well. Had the app interrupt the first two attempts after 10 rows and succeed on the third. The executor detected the interrupted stream and tried again.
# Attempt 1
[EarthbeamApiController] cross-year roster fetch failed for run 30: test error
[ExecutorLocalPythonService] Executor stdout: stream_to_file: attempt 1/3 failed (ChunkedEncodingError); retrying in 5s
# Attempt 2
[EarthbeamApiController] cross-year roster fetch failed for run 30: test error
[ExecutorLocalPythonService] Executor stdout: stream_to_file: attempt 2/3 failed (ChunkedEncodingError); retrying in 10s
# Attempt 3: Success
[EarthbeamApiController] cross-year roster: partnerId=ea tenantCode=grand_bend rowCount=1060 durationMs=756
[ExecutorLocalPythonService] Executor stderr: 2026-05-22 15:22:13.053 runway INFO cross-year pass: matching on TSDS UID / Local
4940802 to
5c38d4e
Compare
Open question: I think it's technically possible for the second pass to not match anything at all. It would probably produce a result set of empty files which we may not want to upload. Do we just skip in that case?
Implement cross-year matching in the executor; companion to #62. The core logic here is
Other notes: