Skip to content

pegleggen/SynthDataSingularity

Repository files navigation

SynthData_Singularity

Colab: Synthetic Data Generation with MIT SynthDataVault

This Google Colaboratory notebook provides an introduction and practical guide to generating synthetic data using the MIT SynthDataVault library. It demonstrates how to leverage this powerful tool in Python to create realistic synthetic datasets whilst preserving the statistical properties and privacy of original data.

CITATION Dr. Genevieve Smith-Nunes. (2025). pegleggen/SynthDataSingularity: Synthetic Data Generation with Python. Zenodo. https://doi.org/10.5281/zenodo.20284499

Table of Contents

Overview

Automatic text summarisation is the process of creating a concise and coherent summary of a longer document. This Colab focuses on extractive summarisation, where the summary is formed by selecting important sentences directly from the original text. We'll be using the powerful 🤗 Transformers library, which provides thousands of pre-trained models, to fine-tune a model specifically for this purpose.

Features

  • Easy Setup: Directly runnable in Google Colab with minimal configuration.
  • MIT SynthDataVault Integration: Demonstrates core functionalities of the synthdatavault library.
  • Practical Examples: Includes code examples for generating synthetic data from a given dataset.
  • Data Analysis: Shows how to compare the statistical properties of real and synthetic data.
  • Privacy Considerations: Highlights the benefits of synthetic data in privacy-sensitive scenarios.

Getting Started

To get started with this Colab, you'll need a Google account to access Google Colab.

Hardware Requirements

This notebook doesn't have specific hardware requirements and can run efficiently on a CPU runtime. You generally don't need a GPU for synthetic data generation with synthdatavault.

To check or change your runtime type:

  1. Go to Runtime in the Colab menu.
  2. Select Change runtime type.
  3. Ensure None (for CPU) is selected under Hardware accelerator if you wish to stick with CPU, or keep GPU if it's already selected and you prefer.

Setup

No local installation is required. Simply open the Colab notebook in your browser:

https://colab.research.google.com/drive/1_GPjenw4voPWL7STov81c3XeXISgTPnb?usp=sharing

Once opened, you can run the cells sequentially. The first few cells will handle the installation of necessary libraries.

Usage

The notebook is structured to guide you through the process of generating synthetic data. You will:

  1. Install necessary libraries: This will primarily involve installing synthdatavault and other data manipulation/visualisation libraries (e.g., pandas, matplotlib).
  2. Load a sample dataset: The Colab will use a publicly available or generated sample dataset for demonstration.
  3. Initialise and configure synthdatavault: Learn how to set up the synthetic data generation model.
  4. Generate synthetic data: Execute the generation process.
  5. Compare real vs. synthetic data: Visualise and analyse the statistical similarities and differences between the original and synthetic datasets.

Simply execute each cell in the notebook in order. Explanations are provided within the notebook to clarify each step and the underlying concepts.

Key Concepts

The Colab will likely touch upon concepts fundamental to synthetic data generation, including:

  • Differential Privacy: A strong, mathematically rigorous definition of privacy protection often implemented in synthetic data generation.
  • Generative Models: The underlying machine learning models (e.g., GANs, VAEs, or statistical models) used by synthdatavault to learn data distributions.
  • Data Utility: Metrics and methods to assess how well the synthetic data preserves the statistical properties and analytical utility of the original data.
  • Privacy-Utility Trade-off: The inherent balance between preserving privacy and maintaining data utility.

Results

After running the notebook, you will observe:

  • The generated synthetic dataset.
  • Visualisations (e.g., histograms, scatter plots, correlation matrices) comparing the original and synthetic data distributions.
  • Potentially, quantitative metrics to assess the similarity and utility of the synthetic data.

Troubleshooting

  • GPU Not Available: Ensure you've correctly set the runtime type to GPU. If you encounter issues, try restarting the runtime (Runtime -> Restart runtime).
  • Out of Memory Errors: If you're running into memory issues, try reducing the batch_size in your training arguments. You might also consider using a smaller pre-trained model if available.
  • Installation Issues: If a library fails to install, check your internet connection or try restarting the Colab session.

Contributing

Contributions to this Colab are welcome! If you have suggestions for improvements, bug fixes, or new features, feel free to:

  1. Fork the original notebook.
  2. Make your changes.
  3. Share your improved version, perhaps with a brief explanation of your modifications.

Licence

This project is open-source and available under the MIT Licence.

Acknowledgements

  • The MIT Lincoln Laboratory for developing and open-sourcing the SynthDataVault library.
  • Google Colaboratory for providing a free and accessible platform for machine learning and data science.
  • The broader data privacy and synthetic data communities for their ongoing research and development.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors