Skip to content

PEDS-1257 Add Difference class with diff#9

Open
sonali-shaw wants to merge 1 commit into
mainfrom
feature/diff
Open

PEDS-1257 Add Difference class with diff#9
sonali-shaw wants to merge 1 commit into
mainfrom
feature/diff

Conversation

@sonali-shaw

Copy link
Copy Markdown

Created the class Difference in pfb_to_zip.py that generates the difference between two datasets writes it out as an avro file. This functionality can be used from the command line with -d (diff.zip will be downloaded). Additionally, there will be another file downloaded containing some relevant console output (such as withdrawn consent indicating deleted subjects).

@grugna grugna changed the title Add Difference class with diff PEDS-1618 Add Difference class with diff Jun 11, 2026
@grugna grugna requested a review from sgchoe June 11, 2026 16:15
Comment thread pfb_to_zip/pfb_to_zip.py
with open(file_path, 'rb') as f:
with PFBReader(f) as reader:
for record in reader:
submitter_id = record['object'].get('submitter_id')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the submitter_id is preserved across data version only for the subject node. Other node may change. so the diff should be made on all the values and the attribute subjects.submitter_id will be stable as a reference to the subject node.

@grugna grugna changed the title PEDS-1618 Add Difference class with diff PEDS-1257 Add Difference class with diff Jun 11, 2026
Comment thread pfb_to_zip/pfb_to_zip.py
if submitter_id not in old:
diff_records.append(new_record)
else:
if new_record['object'] != old[submitter_id]['object']:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest not including the 'created_datetime' and 'updated_datetime' attributes within 'object' when checking to see if the record has changed.

Comment thread pfb_to_zip/pfb_to_zip.py
for sid in sorted(deleted_submitter_ids):
line = f" - {sid}"
print(line)
log_lines.append(line)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to capture and output the removed records similar to the way 'diff_records' contains added/changed records.

@sgchoe sgchoe left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my line level comments regarding output of removed records and reconsidering how record changes are determined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants