This Google Colaboratory notebook provides an introduction and practical guide to generating synthetic data using the MIT SynthDataVault library. It demonstrates how to leverage this powerful tool in Python to create realistic synthetic datasets whilst preserving the statistical properties and privacy of original data.
CITATION Dr. Genevieve Smith-Nunes. (2025). pegleggen/SynthDataSingularity: Synthetic Data Generation with Python. Zenodo. https://doi.org/10.5281/zenodo.20284499
- Overview
- Features
- Getting Started
- Usage
- Key Concepts
- Results
- Troubleshooting
- Contributing
- Licence
- Acknowledgements
Automatic text summarisation is the process of creating a concise and coherent summary of a longer document. This Colab focuses on extractive summarisation, where the summary is formed by selecting important sentences directly from the original text. We'll be using the powerful 🤗 Transformers library, which provides thousands of pre-trained models, to fine-tune a model specifically for this purpose.
- Easy Setup: Directly runnable in Google Colab with minimal configuration.
- MIT SynthDataVault Integration: Demonstrates core functionalities of the
synthdatavaultlibrary. - Practical Examples: Includes code examples for generating synthetic data from a given dataset.
- Data Analysis: Shows how to compare the statistical properties of real and synthetic data.
- Privacy Considerations: Highlights the benefits of synthetic data in privacy-sensitive scenarios.
To get started with this Colab, you'll need a Google account to access Google Colab.
This notebook doesn't have specific hardware requirements and can run efficiently on a CPU runtime. You generally don't need a GPU for synthetic data generation with synthdatavault.
To check or change your runtime type:
- Go to
Runtimein the Colab menu. - Select
Change runtime type. - Ensure
None(for CPU) is selected underHardware acceleratorif you wish to stick with CPU, or keep GPU if it's already selected and you prefer.
No local installation is required. Simply open the Colab notebook in your browser:
https://colab.research.google.com/drive/1_GPjenw4voPWL7STov81c3XeXISgTPnb?usp=sharing
Once opened, you can run the cells sequentially. The first few cells will handle the installation of necessary libraries.
The notebook is structured to guide you through the process of generating synthetic data. You will:
- Install necessary libraries: This will primarily involve installing
synthdatavaultand other data manipulation/visualisation libraries (e.g.,pandas,matplotlib). - Load a sample dataset: The Colab will use a publicly available or generated sample dataset for demonstration.
- Initialise and configure
synthdatavault: Learn how to set up the synthetic data generation model. - Generate synthetic data: Execute the generation process.
- Compare real vs. synthetic data: Visualise and analyse the statistical similarities and differences between the original and synthetic datasets.
Simply execute each cell in the notebook in order. Explanations are provided within the notebook to clarify each step and the underlying concepts.
The Colab will likely touch upon concepts fundamental to synthetic data generation, including:
- Differential Privacy: A strong, mathematically rigorous definition of privacy protection often implemented in synthetic data generation.
- Generative Models: The underlying machine learning models (e.g., GANs, VAEs, or statistical models) used by
synthdatavaultto learn data distributions. - Data Utility: Metrics and methods to assess how well the synthetic data preserves the statistical properties and analytical utility of the original data.
- Privacy-Utility Trade-off: The inherent balance between preserving privacy and maintaining data utility.
After running the notebook, you will observe:
- The generated synthetic dataset.
- Visualisations (e.g., histograms, scatter plots, correlation matrices) comparing the original and synthetic data distributions.
- Potentially, quantitative metrics to assess the similarity and utility of the synthetic data.
- GPU Not Available: Ensure you've correctly set the runtime type to GPU. If you encounter issues, try restarting the runtime (
Runtime -> Restart runtime). - Out of Memory Errors: If you're running into memory issues, try reducing the
batch_sizein your training arguments. You might also consider using a smaller pre-trained model if available. - Installation Issues: If a library fails to install, check your internet connection or try restarting the Colab session.
Contributions to this Colab are welcome! If you have suggestions for improvements, bug fixes, or new features, feel free to:
- Fork the original notebook.
- Make your changes.
- Share your improved version, perhaps with a brief explanation of your modifications.
This project is open-source and available under the MIT Licence.
- The MIT Lincoln Laboratory for developing and open-sourcing the
SynthDataVaultlibrary. - Google Colaboratory for providing a free and accessible platform for machine learning and data science.
- The broader data privacy and synthetic data communities for their ongoing research and development.