filterRead

Fast file filtering using awk for efficient data loading with data.table::fread.

Overview

filterRead is an R package that provides a high-performance interface for loading GWAS summary statistics from file. It's main features are the standardized interface and fast filtering prior to loading in memory, powered by awk. This simplifies the loading of summary statistics from different sources, by invisibly and efficiently handling differences in column names, compression, field separator, and more, returning a data.table with standardized names for downstream analyses.

Features

Fast filtering: Leverages awk to filter rows before loading into R
Automatic file detection: Handles CSV, TSV, and gzipped files automatically
Column naming: Automatically converts columns to standardized names
Missing columns: Missing columns (e.g., pos) are inferred from existing columns (e.g., variant_id)
Genomic data support: Built-in support for chromosome/position filtering and RSID matching
R-style syntax: Write filter conditions using familiar R expressions
Complex conditions: Supports logical operators (&, |), comparisons, and pattern matching
Memory efficient: Only loads filtered rows into memory

Installation

# Install from local source
remotes::install_github("JonSulc/filterRead")

Quick Start

Basic Usage

library(filterRead)

# Create a file interface
# This doesn't load any data, only checks the file formatting
finterface <- new_file_interface("gwas_summary_stats.txt.gz")

# Check how the processed table will look
head(finterface)

# Filter using R-style conditions and load into memory
significant_hits <- finterface[chr == 1 & pval < 5e-8]
genomic_regions <- finterface[
  (chr == 2 & 12345 < pos & pos < 23456) |
  (chr == 3 & 42 < pos & pos < 4242)
]

How It Works

File interface creation: Automatically detects file format, separator, column names, and structure
Condition parsing: R expressions are parsed and translated to awk-compatible filter conditions
Efficient filtering: awk processes the file and filters rows before data reaches R
Data loading: Only filtered rows are loaded using data.table::fread
Header reconstructed: The header names are filled in with standard names
Missing data: Where possible missing fields are reconstructed from other columns (e.g., extracting pos from MarkerName)

Standard column names

variant_id: Variant identifier
chr: Chromosome
pos: Position
ref: Reference (non-effect) allele
alt: Alternate (effect) allele
allele1, allele2: Unordered allele pair, resolved into ref/alt via dbSNP matching
rsid: dbSNP reference SNP id
effect: Estimated effect of the allele
effect_se: Standard error of the effect
pval: P-value
log10p: Negative log10 p-value
zscore: Z-score
odds_ratio: Odds ratio

System Requirements

R (>= 4.1.0)
awk (available on most Unix-like systems)
tabix (for RSID matching functionality)

R Package Dependencies

curl
data.table (>= 1.16.0)
purrr
rlang
stringr

License

GPL (>= 3)

Author

Jonathan Sulc (jonsulc@gmail.com)

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
R		R
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
Jenkinsfile		Jenkinsfile
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

filterRead

Overview

Features

Installation

Quick Start

Basic Usage

How It Works

Standard column names

System Requirements

R Package Dependencies

License

Author

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

filterRead

Overview

Features

Installation

Quick Start

Basic Usage

How It Works

Standard column names

System Requirements

R Package Dependencies

License

Author

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages