Skip to content

Executor: Cross-year student ID matching#60

Open
johncmerfeld wants to merge 21 commits into
developmentfrom
executor/x-year-matching
Open

Executor: Cross-year student ID matching#60
johncmerfeld wants to merge 21 commits into
developmentfrom
executor/x-year-matching

Conversation

@johncmerfeld
Copy link
Copy Markdown
Collaborator

@johncmerfeld johncmerfeld commented May 13, 2026

Open question: I think it's technically possible for the second pass to not match anything at all. It would probably produce a result set of empty files which we may not want to upload. Do we just skip in that case?

Implement cross-year matching in the executor; companion to #62. The core logic here is

  • IF the app says there is a cross-year roster is available
    • AND this is an ODS job
      • AND there are unmatched students from Earthmover
        • THEN run earthmover a second time on those unamtched students, using a roster we pull from EDU as a year-agnostic source of IDs
    • AND this is NOT and ODS
      • THEN perform a single Earthmover pass, using the EDU roster as the source of IDs
  • Otherwise behave normally

Other notes:

@johncmerfeld johncmerfeld self-assigned this May 13, 2026
@johncmerfeld johncmerfeld marked this pull request as ready for review May 14, 2026 22:46
cross_year_output = self.cross_year_pass(self.output_sets[0])
self.output_sets.append(cross_year_output)

self.upload_artifact(artifact.MATCH_RATES)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cross year pass generates unmatched IDs, it seems the file is being saved at non-ods/input_no_student_id_match.csv within the job directory, instead of at the top level where the app expects it.

FWIW, we could definitely tighten up the handoff of the unmatched IDs file... but I feel that's not really worth it in this project and soon we should be moving away from this paradigm anyway.

Copy link
Copy Markdown
Collaborator Author

@johncmerfeld johncmerfeld May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging. I think in principle it's actually uploaded to both locations, but not being handled correctly due to the bug we discussed in slack. Still, I'm going to filter out the non-JSONL files from teh output set upload so that there aren't duplicated copies sent to S3

Comment thread executor/executor/executor.py Outdated
# if we don't return the unmatched students file at all
self.logger.debug("too many unmatched students. Skipping upload")
artifact.UNMATCHED_STUDENTS.needs_upload = False
raise ValueError("insufficient ID matches to continue (highest rate {self.highest_match_rate} < required {config.REQUIRED_ID_MATCH_RATE}; ID column name: {self.highest_match_id_name}; Ed-Fi ID type: {self.highest_match_id_type})")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be an f string?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

first_run_id_type = self.highest_match_id_type

first_run_output_dir = os.path.abspath(config.OUTPUT_DIR_FIRST_RUN)
os.rename(self.output_dir, first_run_output_dir)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking my understanding: if the first run is successful and kicks out unmatched students to be processed again for xyear matches, and that second run is unsuccessful, we do not want to present the original unmatched students from run 1? My intuition says yes that is correct, and also I'm not sure how run 2 could fail.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding of the desired behavior is correct, and I agree it seems unlikely to occur in practice.


def stream_to_file(session, url, dest_path, max_attempts=3):
"""GET url as a stream and write the body to dest_path"""
for attempt in range(1, max_attempts + 1):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the retry and it works well. Had the app interrupt the first two attempts after 10 rows and succeed on the third. The executor detected the interrupted stream and tried again.

# Attempt 1
[EarthbeamApiController] cross-year roster fetch failed for run 30: test error
[ExecutorLocalPythonService] Executor stdout: stream_to_file: attempt 1/3 failed (ChunkedEncodingError); retrying in 5s

# Attempt 2
[EarthbeamApiController] cross-year roster fetch failed for run 30: test error
[ExecutorLocalPythonService] Executor stdout: stream_to_file: attempt 2/3 failed (ChunkedEncodingError); retrying in 10s

# Attempt 3: Success
[EarthbeamApiController] cross-year roster: partnerId=ea tenantCode=grand_bend rowCount=1060 durationMs=756
[ExecutorLocalPythonService] Executor stderr: 2026-05-22 15:22:13.053 runway INFO cross-year pass: matching on TSDS UID / Local

@johncmerfeld johncmerfeld force-pushed the executor/x-year-matching branch from 4940802 to 5c38d4e Compare May 29, 2026 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants