Skip to content

IFCA-Advanced-Computing/trasgoDP

Repository files navigation

License: Apache 2.0 codecov PyPI Documentation Status Publish Package in PyPI CI/CD Pipeline Code Coverage Python version

TrasgoDP implements different mechanims for ε-differential privacy (numerical and categorical data), (ε, δ)-differential privacy (numerical data) and metric-privacy (location-based data). The mechanisms are implemented for being used under a local approach, adding noise directly to the raw data. Two types of mechanims are implemented:

  • For numerical records: Laplace and Gaussian mechanisms. The implementation includes a final clipping applyied on the data with DP.
  • For categorical records: Exponential mechanism and Randomized Response (both for binary attributes and the k-ary version).
  • For location-based records: geo-indistinguishability mechanism for metric-privacy.

This library provides dedicated function designed for being applied on both pandas dataframes and lists/numpy arrays.

Installation

You can install trasgoDP using pip. We recommend to use Python3 with virtualenv:

virtualenv .venv -p python3
source .venv/bin/activate
pip install trasgoDP

Mechanisms implemented

Mechanism Type of the attribute Function in trasgoDP
Laplace Numerical numerical.dp_clip_laplace()
Gaussian Numerical numerical.dp_clip_gaussian()
Exponential Categorical categorical.dp_exponential()
Randomized response Categorical (binary) categorical.dp_randomized_response_binary()
k-ary randomized response Categorical categorical.dp_randomized_response_kary()
Ggeo-indistinguishability Location geoindis.metric_privacy()

Getting started

For applying DP mechanisms to a column of a dataframe you need to introduce:

  • The pandas dataframe with the data.
  • The column in the dataframe to be privatized.
  • The privacy budget (ε).
  • The probability of exceeding the privacy budget (δ) in case of numerical attributes and the Gaussian mechanism.
  • The uper and lower bounds for numerical attributes (optional).

Example: apply DP to the adult dataset with the Laplace mechanism for the column age and the Exponential mechanism for the column workclass:

import pandas as pd
from trasgodp.numerical import dp_clip_laplace
from trasgodp.categorical import dp_exponential

# Read and process the data
data = pd.read_csv("examples/adult.csv")
data.columns = data.columns.str.strip()
cols = [
    "workclass",
    "education",
    "marital-status",
    "occupation",
    "sex",
    "native-country",
]
for col in cols:
    data[col] = data[col].str.strip()

# Apply DP for the attribute age:
column_num = "age"
epsilon1 = 10
df = dp_clip_laplace(data, column_num, epsilon1, new_column=True)

# Apply DP for the attribute workclass:
column_cat = "workclass"
epsilon2 = 5
df = dp_exponential(data, column_cat, epsilon2, new_column=True)

For applying metric privacy to location data, you need to introduce:

  • The pandas dataframe with the data.
  • The column with the latitude data column_lat in the dataframe.
  • The column with the longitude data column_lon in the dataframe.
  • The privacy budget (ε).
  • Whether or not to create two new columns containing the privatized latitude and longitude coordinates.

Example: apply metric privacy to the earthquake dataset and plot the map:

import pandas as pd
from trasgodp.geoindis import metric_privacy, plot_metric_dp_map

# Read the data
data = pd.read_csv("./examples/earthquake_data.csv")
column_lat = "latitude"
column_lon = "longitude"

# Apply metric privacy creating new columns for lat and lon:
epsilon =1.e-3
data_priv = metric_privacy(data, column_lat, column_lon, epsilon, new_cols=True)

# Plot and save the map:
plot_metric_dp_map(data_priv, column_lat, column_lon, save_file="example_map.html")

Example: resulting interactive map Map preview

Warning

This project is under active development.

License

This project is licensed under the Apache 2.0 license.

Related work

If you are using trasgoDP, you may also be interested in:

  • pyCANON: a Python library for checking the level of anonymity of a dataset.
  • anjana: a Python library for anonymizing tabular datasets.

Funding and acknowledgments

This work is funded by European Union through the SIESTA project (Horizon Europe) under Grant number 101131957.

About

TrasgoDP implements a set of mechanisms for Local Differential Privacy (LDP) for numerical and categotical records, and metric privacy for location-based ones. It is particularly well-suited for generating synthetic versions of a dataset using mechanisms that ensure differential privacy.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages