Fast file filtering using awk for efficient data loading with
data.table::fread.
filterRead is an R package that provides a high-performance interface for
loading GWAS summary statistics from file. It's main features are the
standardized interface and fast filtering prior to loading in memory,
powered by awk. This simplifies the loading of summary statistics from
different sources, by invisibly and efficiently handling differences in column
names, compression, field separator, and more, returning a data.table with
standardized names for downstream analyses.
- Fast filtering: Leverages
awkto filter rows before loading into R - Automatic file detection: Handles CSV, TSV, and gzipped files automatically
- Column naming: Automatically converts columns to standardized names
- Missing columns: Missing columns (e.g.,
pos) are inferred from existing columns (e.g., variant_id) - Genomic data support: Built-in support for chromosome/position filtering and RSID matching
- R-style syntax: Write filter conditions using familiar R expressions
- Complex conditions: Supports logical operators (
&,|), comparisons, and pattern matching - Memory efficient: Only loads filtered rows into memory
# Install from local source
remotes::install_github("JonSulc/filterRead")library(filterRead)
# Create a file interface
# This doesn't load any data, only checks the file formatting
finterface <- new_file_interface("gwas_summary_stats.txt.gz")
# Check how the processed table will look
head(finterface)
# Filter using R-style conditions and load into memory
significant_hits <- finterface[chr == 1 & pval < 5e-8]
genomic_regions <- finterface[
(chr == 2 & 12345 < pos & pos < 23456) |
(chr == 3 & 42 < pos & pos < 4242)
]- File interface creation: Automatically detects file format, separator, column names, and structure
- Condition parsing: R expressions are parsed and translated to
awk-compatible filter conditions - Efficient filtering:
awkprocesses the file and filters rows before data reaches R - Data loading: Only filtered rows are loaded using
data.table::fread - Header reconstructed: The header names are filled in with standard names
- Missing data: Where possible missing fields are reconstructed from other
columns (e.g., extracting
posfromMarkerName)
- variant_id: Variant identifier
- chr: Chromosome
- pos: Position
- ref: Reference (non-effect) allele
- alt: Alternate (effect) allele
- allele1, allele2: Unordered allele pair, resolved into
ref/altvia dbSNP matching - rsid: dbSNP reference SNP id
- effect: Estimated effect of the allele
- effect_se: Standard error of the effect
- pval: P-value
- log10p: Negative log10 p-value
- zscore: Z-score
- odds_ratio: Odds ratio
- R (>= 4.1.0)
awk(available on most Unix-like systems)tabix(for RSID matching functionality)
- curl
- data.table (>= 1.16.0)
- purrr
- rlang
- stringr
GPL (>= 3)
Jonathan Sulc (jonsulc@gmail.com)
Contributions are welcome! Please feel free to submit issues or pull requests.