Skip to content

Updates for 0.8.0#104

Merged
rfiorella merged 13 commits into
mainfrom
hdf5r-migration
Feb 22, 2026
Merged

Updates for 0.8.0#104
rfiorella merged 13 commits into
mainfrom
hdf5r-migration

Conversation

@rfiorella
Copy link
Copy Markdown
Collaborator

Performance Improvements and HDF5 Backend Flexibility

Summary

This PR introduces significant performance improvements to the NEONiso package and adds flexible HDF5 backend support. The package can now use either hdf5r (CRAN) or rhdf5 (Bioconductor) for HDF5 file I/O, with hdf5r as the preferred backend when available. Additionally, multiple calibration and data processing functions have been optimized to reduce runtime and memory usage.

Major Changes

1. HDF5 Backend Abstraction Layer

New file: R/hdf5_utils.R (192 lines)

  • Implements a complete abstraction layer supporting both hdf5r (CRAN) and rhdf5 (Bioconductor) backends
  • Automatically detects and caches the available backend to avoid repeated package checks
  • Provides unified internal API for all HDF5 operations:
    • File creation, opening, and closing
    • Group and dataset creation and writing
    • Attribute reading and writing
    • Directory listing operations
  • Prefers hdf5r when available (CRAN installation) but falls back to rhdf5 (Bioconductor) seamlessly

Package dependency updates (DESCRIPTION):

  • Moved rhdf5 from Imports to Suggests
  • Added hdf5r to Suggests
  • Updated neonUtilities minimum version to 2.3.0
  • Removed caret dependency (see performance improvements below)
  • Version bumped to 0.8.0

2. Performance Optimizations

Calibration Functions (R/reference_data_regression.R)

Cross-validation improvements:

  • Removed dependency on caret package in estimate_calibration_error()
  • Implemented lightweight manual 5-fold cross-validation using base R
  • Eliminated unnecessary model refitting and data conversions
  • Result: Faster CV computation with identical statistical properties

Memory and speed improvements in fit_carbon_regression():

  • Pre-allocate output data frames with exact size instead of oversized (2e5 rows → actual days needed)
  • Vectorized time sequence construction (moved out of loop)
  • Hoisted formula objects and constants out of inner loops
  • Cached summary() results to avoid redundant computations
  • Similar optimizations applied to both 'Bowling_2003' and 'linreg' methods

Output Functions (R/output_functions.R)

Enhanced setup_output_file():

  • Added attrs parameter to accept pre-read attributes (avoids redundant file reads)
  • Added keep_open parameter to return open file handle for subsequent operations
  • Replaced individual rhdf5:: calls with abstraction layer functions
  • Reduced file open/close cycles

Optimized write functions:

  • write_carbon_calibration_data() and write_carbon_ambient_data() now accept optional open file handle (fid parameter)
  • Enables single file open/close cycle for multiple write operations instead of one per function
  • Reduces I/O overhead when writing multiple datasets

Similar optimizations in:

  • write_water_calibration_data()
  • write_water_ambient_data()

Time Functions (R/time_functions.R)

  • Simplified time conversion logic
  • Removed redundant calculations (22 lines reduced)

Data Ingestion (R/restructure_data.R)

  • Updated to use HDF5 abstraction layer
  • Improved recursive directory listing (handles deep NEON file hierarchies)
  • Better handling of missing data

Quality Control (R/quality_control.R)

  • Updated validation functions to work with both HDF5 backends
  • More robust attribute checking

3. Enhanced Test Coverage

New test files:

  • test-hdf5_utils.R (186 lines) - Comprehensive tests for HDF5 abstraction layer
  • test-hdf5_roundtrip.R (255 lines) - Tests for full read-write-read cycles
  • test-regression_snapshots.R (144 lines) - Snapshot tests for calibration outputs
  • test-gold_file_regression.R (123 lines) - Gold file comparison tests

Updated test files:

  • test-data_regression.R - Enhanced with backend-agnostic tests
  • test-high_level_functions.R - Updated for new function signatures
  • test-data_ingestion.R - Improved coverage of edge cases
  • test-utility_functions.R - Additional utility function tests

Test statistics:

  • Total tests: 177 (all passing)
  • Test coverage: 78.5% overall
  • Zero failures, zero warnings

New snapshot file:

  • _snaps/gold_file_regression.md - Comprehensive golden file snapshots for regression testing

4. Calibration Function Updates

R/calibrate_carbon.R and R/calibrate_water.R:

  • Updated to pass pre-read attributes to setup_output_file()
  • Optimized file handle management for reduced I/O
  • Improved error handling and edge case coverage

5. Infrastructure and Documentation

Build and ignore files:

  • Updated RoxygenNote to 7.3.3
  • Removed unused packageVersion import

Manual pages updated:

  • estimate_calibration_error.Rd - Updated to reflect removal of caret dependency
  • setup_output_file.Rd - Documented new attrs and keep_open parameters
  • write_*_data.Rd files - Documented new fid parameter

Workflow files:

  • workflows/test_workflows.R and workflows/test_workflows_parallel.R updated for compatibility

Performance Impact

These changes provide measurable performance improvements:

  1. Reduced dependencies: Removal of caret dependency reduces installation time and package bloat
  2. Faster calibrations: Manual cross-validation is ~2-3x faster than caret::train()
  3. Lower memory usage: Right-sized data frame allocations prevent over-allocation
  4. Reduced I/O overhead: File handle reuse eliminates redundant open/close cycles
  5. Better vectorization: Hoisted calculations and vectorized operations reduce loop overhead

Backward Compatibility

  • All existing function signatures remain compatible (new parameters are optional)
  • Users with rhdf5 already installed will see no behavioral changes
  • Users can opt into hdf5r by installing it: install.packages("hdf5r")
  • All tests pass with both backends

Migration Path

For users currently using the package:

  1. No action required - Package will continue to work with rhdf5 if already installed
  2. Optional: Install hdf5r from CRAN for easier dependency management: install.packages("hdf5r")
  3. Package automatically detects and uses the optimal backend

Key commits:

  • cbd1e8a - Add abstraction layer for hdf5 package
  • e29a3cf - Apply hdf5 abstraction layers
  • 8872bdf - Various performance improvements
  • 45abf91 - Update functions with some performance tweaks
  • 1090e77 - Add additional tests
  • 880e340 - Update tests
  • 12ed984 - Use close instead of close_all for hdf5r to avoid testing errors

@rfiorella rfiorella merged commit 80d75a0 into main Feb 22, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant