Adrià Setó Llorens, Predoctoral Researcher at the Barcelona Institute for Global Health (ISGlobal).
Augusto Anguita-Ruiz, Junior Leader Researcher at the Barcelona Institute for Global Health (ISGlobal).
The multi-omics approach aims to integrate diverse layers of biological information—such as genomics, transcriptomics, proteomics, metabolomics, and epigenomics—to achieve a more comprehensive understanding of biological systems and disease mechanisms. Each omic layer captures a distinct yet interconnected level of cellular regulation, and their integration enables the identification of molecular interactions that cannot be detected through single-omic analyses alone. The main advantage of multi-omics integration over traditional single-omic studies lies in its ability to uncover cross-level biological relationships and multi-factorial drivers of phenotypes, improving prediction accuracy and mechanistic insight. This systems-level perspective supports the discovery of key biomarkers, regulatory networks, and potential therapeutic targets.
There are many multi-omics integration algorithms, each suited for different analytical goals, and they can be classified according to whether they are supervised or unsupervised and whether they perform variable selection—in this session, we will focus on the RGCCA (Regularized Generalized Canonical Correlation Analysis) approach.
The objective of this session to offer an introduction to a multi-omics integration analysis using RGCCA. We will:
- Load the data
- Preprocess the data
- Perform multi-omics integration
- Understand the results of multi-omics integration
- Evaluate the algorithm’s performance
We will integrate multi-omics data — including proteomics, urine and serum metabolomics, gene expression, and DNA methylation — using Regularized Generalized Canonical Correlation Analysis (RGCCA). The outcome variable will be standardized body mass index (zBMI) at 9 years old. The objective of this analysis is to identify multi-omic signatures predictive of BMI in later childhood while gaining a hands-on understanding of the application of RGCCA to multi-omics data integration.
For this practical tutorial, we will use data from the HELIX exposome study. The HELIX study is a collaborative project between six longitudinal, population-based birth cohort studies from six European countries (France, Greece, Lithuania, Norway, Spain and the UK).
Note: The data provided in this introductory course were simulated from the HELIX sub-cohort data. Details of the HELIX project and the origin of the data collected can be found in the following publication: BMJ Open - HELIX and on the project website. Additional details about the dataset can be found in the official repository at https://github.com/isglobal-exposomeHub/ExposomeDataChallenge2021.
The repository contains the following documents:
- The WORKSHOP_MULTIOMICS_INTEGRATION.ipynb. It contains the notebook for the practical tutorial with the code needed to perform the multi-omic integration using RGCCA.
- The WORKSHOP_MULTIOMICS_INTEGRATION.Rmd file contains the R Markdown tutorial and all the code needed to perform multi-omic integration using RGCCA locally.
- The WORKSHOP_MULTIOMICS_INTEGRATION.html file presents the code and results of the tutorial on multi-omic integration using RGCCA.
- Functions: This directory contains all the functions used in this session. These functions are stored in separate files to keep the notebook clean and easy to follow. For more details, you can consult the files in this directory.
- RGCCA modified package.
This is the dataset we will use:
- Exposome data (n=1301): Rdata file containing three objects:
- 1 object for exposures:
exposome - 1 object for covariates:
covariates - 1 object for outcomes:
phenotype
- 1 object for exposures:
The three tables can be linked using ID variable. See the codebook for variable description (variable name, domain, type of variable, transformation, ...)
- omic data: Exposome and omic data can be linked using ID variable.
- Proteome: ExpressionSet called
metabol_serumof 1170 individuals and 39 proteins (log-transformed) that are annotated in theExpressionSetobject (usefData(proteome)after loadingBiobaseBioconductor package). - Gene expression: ExpressionSet called
genexpr(see here what an ExpressionSet is) of 1007 individuals and 28,738 transcripts with annotated gene symbols. - Methylation: GenomicRatioSet called
methy(see here what a GenomicRatioSet is) of 918 individuals and 386,518 CpGs
- Proteome: ExpressionSet called
The variables that are available in the metadata are:
- ID: identification number
- e3_sex: gender (male, female)
- age_sample_years: age (in years)
- h_ethnicity_cauc: caucasic? (yes, no)
- ethn_PC1: first PCA to address population stratification
- ethn_PC2: second PCA to address population stratification
- Cell-type estimates (only for methylation): NK_6, Bcell_6, CD4T_6, CD8T_6, Gran_6, Mono_6
This notebook will guide you step by step, from loading a dataset to analyzing it.
Getting Started:
- Open multiomics_integration_tutorial.ipynb and click “Open in Colab” (sign in with your Google account if needed).
- Select “Open in draft mode” at the top left so you can run the code safely.
- If you see "Warning: This notebook was not created by Google.", don’t worry—just click Run anyway.
How to Use the Notebook:
- The notebook mixes text explanations and code cells for hands-on learning.
- Always run cells in order to avoid errors.
- Click the play button next to a cell, or press Ctrl+Enter (Cmd+Enter on Mac).
- Lines starting with # are comments for guidance, they won’t affect the code.
- Outputs appear below each cell, showing results and any printed messages.