Skip to content

CoRetweetsNumpyApproach produces incorrect results compared to CoRetweetsApproach #1

Description

@cerquide

Title: CoRetweetsNumpyApproach produces incorrect results compared to CoRetweetsApproach

Labels: bug, priority:high, correctness


Description

The NumPy-optimized version of the co-retweets detection approach (CoRetweetsNumpyApproach) does not produce equivalent results to the standard implementation (CoRetweetsApproach), despite passing synthetic correctness tests.

Location

  • Implementation: src/approaches/coretweets_numpy.py
  • Standard version: src/approaches/coretweets.py
  • Correctness tests: examples/test_numpy_correctness.py

Problem

While the NumPy implementation passes all synthetic tests in test_numpy_correctness.py, it appears to produce different results on real datasets compared to the standard approach. This suggests:

  1. The synthetic tests may not cover all edge cases present in real data
  2. There may be subtle differences in how the vectorized operations handle certain scenarios
  3. Potential issues with timestamp handling or floating-point precision

Expected Behavior

CoRetweetsNumpyApproach should produce identical results to CoRetweetsApproach for all datasets and parameters, with the only difference being improved performance.

Current Status

  • ✅ Synthetic tests pass (10/10 test cases)
  • ❌ Real dataset results differ from standard approach
  • ⚠️ Performance optimization is effective but correctness is not guaranteed

Impact

  • Users should not use coretweets_numpy in production until this is resolved
  • The standard coretweets approach remains reliable and should be used for all experiments
  • This issue affects reproducibility and comparability of results

Reproduction

# Run correctness tests (these pass)
python examples/test_numpy_correctness.py

# Benchmark on real data (results will differ)
python examples/benchmark_coretweets.py /path/to/dataset/Processed/

# Run both approaches on same dataset and compare
python bin/run_experiments.py --approaches coretweets coretweets_numpy --datasets Armenia

Investigation Needed

  1. Identify discrepancies: Run both approaches on real datasets and compare pair-by-pair results
  2. Debug vectorization logic: Review lines 68-86 in coretweets_numpy.py for issues with:
    • Time difference calculations (relative vs absolute timestamps)
    • Boolean masking and early termination logic
    • Handling of edge cases (boundary times, same user pairs)
  3. Enhance test coverage: Add tests with real data characteristics:
    • Timestamp precision edge cases
    • Large time gaps between retweets
    • Viral tweets with many retweets
    • Users retweeting same tweet multiple times

Possible Fixes

  1. Revert to non-vectorized approach for correctness
  2. Fix the vectorization logic to handle all edge cases
  3. Add comprehensive integration tests with real data
  4. Consider alternative vectorization strategies (e.g., using pandas or polars)

Workaround

Until fixed, use the standard coretweets approach:

# DO use this
approach = ApproachFactory.create('coretweets', window_sec=60, min_coactions=1)

# DO NOT use this (yet)
# approach = ApproachFactory.create('coretweets_numpy', window_sec=60, min_coactions=1)

Priority

High - This affects correctness of detection results and could lead to invalid research conclusions if used in production.

Related Files

  • src/approaches/coretweets_numpy.py - NumPy implementation
  • src/approaches/coretweets.py - Standard implementation (reference)
  • examples/test_numpy_correctness.py - Current test suite
  • examples/benchmark_coretweets.py - Performance comparison tool

Next Steps

  1. Create integration test with real dataset
  2. Compare output pair-by-pair between implementations
  3. Identify specific scenarios where results differ
  4. Fix vectorization logic or revert to standard approach
  5. Verify fix with comprehensive test suite
  6. Document any performance trade-offs

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions