Skip to content

Yash-Vekaria/pixel-config

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PixelConfig: Longitudinal Measurement and Reverse-Engineering of Meta Pixel Configurations

This work has been accepted to 26th ACM Internet Measurement Conference (IMC) 2026.

@article{ghani2026pixelconfig,
  title={PixelConfig: Longitudinal Measurement and Reverse-Engineering of Meta Pixel Configurations},
  author={Ghani, Abdullah and Vekaria, Yash and Shafiq, Zubair},
  journal={arXiv preprint arXiv:2603.09380},
  year={2026}
}

Project Summary

Tracking pixels are widely used to optimize online ad campaigns through personalization, re-targeting, and conversion tracking. While prior research has primarily focused on detecting the prevalence of tracking pixels, limited attention has been given to variations in their configurations across websites. A tracking pixel may be configured differently on different websites.

This project proposes a differential analysis framework to reverse-engineer tracking pixel configurations. Using this framework, we investigate three types of Meta Pixel configurations:

  1. Activity tracking: What a user is doing on a website.
  2. Identity tracking: Who a user is or what device they are associated with.
  3. Tracking restrictions: Mechanisms to limit the sharing of potentially sensitive information.

Using data from the Internet Archive’s Wayback Machine, we analyze and compare Meta Pixel configurations on approximately 18,000 health-related websites versus a control group of the top 10,000 websites from 2017 to 2024. This repository contains the scripts, core processed data, and analysis notebooks to reproduce our findings.

Repository Structure

.
├── Analysis/
│   ├── Plots/                                 # Directory intended for plots generated by analysis notebooks.
│   ├── comparisonPlots.ipynb                  # Jupyter notebook for generating feature adoption comparison plots.
│   ├── hashes.txt                             # Text file with SHA-256 hashes and their decrypted values.
│   ├── key_categorizations.csv                # CSV for categorizing keys (e.g., PII, health-related).
│   ├── pixelPresence.ipynb                    # Notebook for analyzing Meta Pixel presence.
│   ├── processSnapshots.py                    # Script to parse/process raw Pixel configuration snapshots.
│   └── unwantedDataAnalysis.ipynb             # Notebook for analyzing blacklisted/sensitive keys.
├── Configurations/                            # Pre-processed Meta Pixel configuration data.
│   ├── health-configurations.csv
│   └── top-10k-configurations.csv
├── Data Collection/                           # Scripts for main data collection (snapshots, Pixel IDs, configurations).
│   ├── extractPixelIDs.py
│   ├── fetchConfigurationSnapshots.py
│   └── fetchWebsiteSnapshots.py
├── Pixel History/                             # Pre-processed data on Pixel ID presence over time.
│   ├── final_pixel_history_health.csv
│   └── final_pixel_history_top10k.csv
├── Websites/                                   # Contains final website lists and health curation scripts/data.
│   ├── health_websites.csv                     # Final curated list of health websites.
│   ├── tranco_top_10k.csv                      # Input list of Top 10k websites.
│   └── health-websites/                        # Scripts and intermediate files for generating health_websites.csv.
│       ├── google-search-health-organizations.py
│       ├── input-cms-hospitals.csv
│       ├── openai-map-health-organizations.py
│       └── output-cms-hospitals.csv
└── requirements.txt                            # Python package requirements.

Core Data Files

This repository includes pre-processed data files that are the direct output of our data collection and initial processing scripts. These files can be used to reproduce the analyses presented in the paper or for further investigation.

The primary data files are:

  1. top-10k-configurations.csv
  2. health-configurations.csv
  3. final_pixel_history_top10k.csv
  4. final_pixel_history_health.csv

Below is a detailed description of their structure and content.


1. Configuration Data Files

  • Files:
    • top-10k-configurations.csv: Processed Meta Pixel configuration data for the Top 10k website cohort.
    • health-configurations.csv: Processed Meta Pixel configuration data for the Health website cohort.
  • Source: These files are the primary output of the processSnapshots.py script (after being saved to CSV). Each row represents a unique Meta Pixel configuration script snapshot found on a website at a specific time.
  • Columns:
    • plugin_name: (List of strings) A list of plugin names loaded by the Pixel configuration (e.g., ['unwanteddata', 'inferredevents']). Normalized to lowercase without underscores.
    • opt_in_info: (List of tuples) Information about instance.optIn calls. Each tuple is ('config_name', 'enabled_status_as_string'), e.g., [('automaticsetup', 'true'), ('firstpartycookies', 'true')]. Config names are normalized.
    • config_set_info: (List of tuples) Information from config.set calls. Each tuple is ('config_name', dictionary_of_config_data), e.g., [('automaticmatching', {'selectedMatchKeys': ['em', 'ph']})]. Config names are normalized. The second element is a Python dictionary parsed from the original JSON.
    • fbq_set_info: (List of tuples) Information from fbq.set calls. Each tuple is ('config_name', list_of_config_data), e.g., [('estrules', [{'condition': {...}, 'derived_event_name': '...'}])]. Config names are normalized. The second element is a Python list parsed from the original JSON array.
    • timestamp: (String, ISO 8601 format or similar) The timestamp of the Wayback Machine snapshot for this configuration script (e.g., YYYY-MM-DD HH:MM:SS).
    • website: (String) The domain name of the website where this Pixel configuration was found.
    • pixel_id: (String) The Meta Pixel ID associated with this configuration.
    • year: (Integer) The year extracted from the timestamp.
  • Notes on Data Types when reading CSV:
    • Columns like plugin_name, opt_in_info, config_set_info, and fbq_set_info store complex data structures (lists, lists of tuples, where tuples can contain dictionaries or lists). When read from a CSV, these will typically be strings. You will need to parse them back into their Python object forms (e.g., using ast.literal_eval carefully) for analysis.

2. Pixel History Data Files

  • Files:
    • final_pixel_history_top10k.csv: Historical and live Meta Pixel ID presence for the Top 10k website cohort.
    • final_pixel_history_health.csv: Historical and live Meta Pixel ID presence for the Health website cohort.
  • Source: These files are the primary output of the extractPixelIDs.py script (specifically, the pixelHistoryComplete.csv file, renamed per cohort). Each row represents a unique website.
  • Columns:
    • website: (String) The domain name of the website.
    • Monthly Columns (e.g., 202502, 202501, ..., 201701):
      • A series of columns, one for each month from January 2017 up to the latest month covered by the data collection (e.g., February 2025).
      • The value in each monthly cell represents the Meta Pixel IDs found on that website during that specific month based on Wayback Machine snapshots.
      • Data Format: The cell value is a string representation of a Python list of strings (Pixel IDs), e.g., ['239130639218562']. An empty cell or a string like [] (or nan if read by pandas and then saved) indicates no Pixel IDs were found or no snapshot was available/processed for that month.
    • live:
      • The value in this cell represents the Meta Pixel IDs found on that website during the live crawl performed by extractPixelIDs.py.
      • Data Format: Similar to monthly columns, it's a string representation of a Python list of strings (Pixel IDs), e.g., ['12345', '67890'].
    • Unnamed: 0: (Integer) An index column, likely added by pandas when saving the CSV without index=False. Can typically be ignored or dropped.
  • Example Row Structure (for pixelHistory files):
    website,202403,202402,202401,...,live
    example.com,[],['123456789'],['123456789'],...,['123456789','987654321']
    hiramhealthandrehab.com,,['239130639218562'],,,...,[]
    
  • Notes on Data Types when reading CSV:
    • The monthly columns and the live column store string representations of lists. These need to be parsed (e.g., using ast.literal_eval) into actual Python lists of strings for analysis. Be cautious with empty strings or NaN values.

Data Collection

This section outlines the methodology and scripts used to collect the data for this research, including historical website snapshots, Meta Pixel IDs, and Meta Pixel configuration script snapshots. The data collection process is divided into three main steps, executed sequentially.

Prerequisites

Before you begin the data collection process, please ensure you have the following:

  1. Python 3.11: The scripts are written in Python 3.

  2. Required Python Packages: Install all necessary packages by running:

    pip install -r requirements.txt
  3. Google Chrome: A recent version of Google Chrome browser must be installed. The scripts use webdriver-manager to automatically download the appropriate ChromeDriver.

  4. Input Website Lists:

    • tranco_top_10k.csv: A CSV file containing a list of top 10,000 websites. This file must have a header row with a column named website listing the domain names (e.g., example.com).
    • health_websites.csv: A CSV file containing a list of health-related websites. This file must also have a header row with a column named website.

    Place these CSV files in the same directory as the scripts or update the paths within the scripts accordingly.


Step 1: Fetching Historical Website Snapshots

  • Script: fetchWebsiteSnapshots.py
  • Purpose: To download historical HTML snapshots of websites from the Internet Archive's Wayback Machine. The script targets bi-annual snapshots (January and July, or closest available) for each website, starting from 2017.
  • Input:
    • A CSV file listing websites (e.g., tranco_top_10k.csv or health_websites.csv).
  • Key Operations:
    1. Reads the list of websites from the specified CSV.
    2. For each website, queries the Wayback Machine CDX Server API for available snapshots between 2017 and the current date.
    3. Filters these snapshots to select approximately two per year (one for the first half, one for the second half).
    4. Uses Selenium with a headless Chrome browser to navigate to each selected Wayback Machine URL.
    5. Waits for 30 seconds after the page loads to allow for dynamic content rendering (including potential tracking pixels).
    6. Saves the full HTML source of the page.
    7. Tracks progress in a separate CSV file, allowing the script to be stopped and resumed.
  • Configuration (within fetchWebsiteSnapshots.py):
    • WEBSITES_FILE: (Line ~165) Set this variable to the path of your input CSV file.
      • Example for Tranco Top 10k: WEBSITES_FILE = 'tranco_top_10k.csv'
      • Example for health Websites: WEBSITES_FILE = 'health_websites.csv'
    • mount_path: (Line ~91) Defines the base output directory. The script creates subdirectories for each website.
      • Important: If processing tranco_top_10k.csv, you should modify this path to, for example: mount_path = f".top10k-snapshots/" to keep outputs separate.
    • progress_file: (Line ~166) Name of the CSV file for tracking progress (e.g., "final-progress.csv"). It's advisable to use different progress file names if running for different website lists (e.g., top10k-progress.csv, health-progress.csv).
  • How to Run:
    1. Modify the WEBSITES_FILE, mount_path (if necessary), and progress_file variables within fetchWebsiteSnapshots.py.
    2. Execute the script from your terminal:
      python fetchWebsiteSnapshots.py
  • Expected Output:
    • A directory structure containing HTML snapshots:
      • Example: ./final-snapshots/cms-snapshots/healthdomain.com/20200101123456.html
      • Example: ./final-snapshots/top10k-snapshots/example.com/20200701000000.html
    • A progress CSV file (e.g., final-progress.csv) logging successfully processed websites.

Step 2: Extracting Meta Pixel IDs (Historical and Live)

  • Script: extractPixelIDs.py
  • Purpose: To extract Meta Pixel IDs from the historical HTML snapshots (collected in Step 1) and from a fresh, live crawl of the websites. It then consolidates this information.
  • Input:
    • The directory containing historical website snapshots (output from Step 1).
    • A CSV file listing websites for the live crawl (e.g., tranco_top_10k.csv or health_websites.csv).
  • Key Operations:
    1. Wayback Snapshots Processing:
      • Scans the directory of historical snapshots.
      • For each HTML file, uses regular expressions to find Pixel IDs (patterns: <script src=".../config/PIXEL_ID"> and fbq("init", "PIXEL_ID");).
      • Outputs a pixelHistory.csv file mapping websites to Pixel IDs found per month.
    2. Live Website Crawling:
      • Crawls the live version of websites listed in the specified CSV.
      • Saves the HTML of the live pages.
      • Extracts Pixel IDs from these live pages using the same regex patterns.
      • Outputs results to pixelHistoryLive.csv.
      • Logs progress in downloaded_websites.txt.
    3. Merging:
      • Combines the data from pixelHistory.csv (Wayback) and pixelHistoryLive.csv (live) into a single pixelHistoryComplete.csv.
  • Configuration (within extractPixelIDs.py):
    • BASE_FOLDER: (Line ~146) Path to the directory containing the historical snapshots from Step 1.
      • Example for Tranco Top 10k snapshots: BASE_FOLDER = "./final-snapshots/top10k-snapshots"
      • Example for health snapshots: BASE_FOLDER = "./final-snapshots/cms-snapshots"
    • WEBSITES_PATH: (Line ~148) Path to the input CSV file for the live crawl.
      • Example: WEBSITES_PATH = 'tranco_top_10k.csv'
    • live_folder_path: (Line ~144) Directory to save HTML of live crawled websites (e.g., 'live_websites'). Consider using distinct names if running for different lists (e.g., live_websites_top10k, live_websites_healths).
    • Output CSV files are named: pixelHistory.csv, pixelHistoryLive.csv, pixelHistoryComplete.csv. If running for different datasets, you might need to rename these outputs after each run to avoid overwriting.
  • How to Run:
    1. Ensure the BASE_FOLDER points to the correct output from Step 1.
    2. Set WEBSITES_PATH to the desired list for live crawling.
    3. Modify live_folder_path if desired.
    4. Execute the script:
      python extractPixelIDs.py
  • Expected Output:
    • pixelHistory.csv: Historical Pixel ID data.
    • live_websites/ (or your live_folder_path): HTML files from live crawls.
    • pixelHistoryLive.csv: Live Pixel ID data.
    • downloaded_websites.txt: Log for live crawl.
    • pixelHistoryComplete.csv: Merged historical and live Pixel ID data.

Step 3: Fetching Historical Meta Pixel Configuration Snapshots

  • Script: fetchConfigurationSnapshots.py
  • Purpose: To download historical snapshots of Meta Pixel configuration scripts from the Wayback Machine, using the Pixel IDs identified in Step 2.
  • Input:
    • The pixelHistoryComplete.csv file generated in Step 2.
  • Key Operations:
    1. Reads pixelHistoryComplete.csv.
    2. For each website and each unique Pixel ID associated with it:
      • Constructs the URL for the Meta Pixel configuration script (e.g., https://connect.facebook.net/signals/config/PIXEL_ID).
      • Queries the Wayback Machine CDX Server API for all archived versions of this script.
      • Filters to select the earliest configuration script snapshot for each month.
      • Downloads and saves these configuration scripts. (Note: They are saved with an .html extension but typically contain JavaScript code).
      • Uses a temporary checkpoint.txt for resumable CDX record fetching per Pixel ID.
  • Configuration (within fetchConfigurationSnapshots.py):
    • PIXEL_HISTORY_PATH: (Line ~128) Path to the pixelHistoryComplete.csv file.
      • Example: PIXEL_HISTORY_PATH = "pixelHistoryComplete.csv" (If you renamed outputs from Step 2, adjust accordingly).
    • OUTPUT_FOLDER: (Line ~129) Base directory to save the downloaded Pixel configuration scripts.
      • Example: OUTPUT_FOLDER = "allPixelConfigs" (Consider allPixelConfigs_top10k, allPixelConfigs_healths if running for different datasets).
  • How to Run:
    1. Ensure PIXEL_HISTORY_PATH points to the correct pixelHistoryComplete.csv from Step 2.
    2. Set OUTPUT_FOLDER as desired.
    3. Execute the script:
      python fetchConfigurationSnapshots.py
  • Expected Output:
    • A directory structure containing Pixel configuration scripts:
      • Example: allPixelConfigs/example.com/123456789012345/20200115102030.html
    • A temporary checkpoint.txt will be created and deleted during processing for each Pixel ID.

Notes on Processing Multiple Website Lists

If you are processing both tranco_top_10k.csv and health_websites.csv (or other lists):

  • It is highly recommended to run the entire 3-step pipeline separately for each list.
  • Carefully manage your output directories and intermediate file names to prevent data from one run overwriting another.
    • For fetchWebsiteSnapshots.py: Modify mount_path and progress_file.
    • For extractPixelIDs.py: Modify BASE_FOLDER (to point to the correct Step 1 output), live_folder_path, and manually rename the output CSVs (pixelHistory.csv, pixelHistoryLive.csv, pixelHistoryComplete.csv) after each run.
    • For fetchConfigurationSnapshots.py: Modify PIXEL_HISTORY_PATH (to point to the correct pixelHistoryComplete.csv) and OUTPUT_FOLDER.

Data Analysis Scripts

This section describes the scripts used to process the collected data and perform the analyses presented in the paper.

1. processSnapshots.py

  • Purpose: This script is the primary data processing engine. It parses the downloaded Meta Pixel configuration script snapshots (from fetchConfigurationSnapshots.py) to extract structured information about their settings and features. The output is a Pandas DataFrame that serves as the basis for subsequent analyses.
  • Input:
    • The directory containing the downloaded Pixel configuration script snapshots (e.g., allPixelConfigs/). This directory should have a structure like: <base_folder>/<website_domain>/<pixel_id>/<timestamp>.html.
  • Key Operations:
    1. File Iteration: Traverses the input directory, processing each HTML file (which contains a Pixel configuration script). It extracts the website, pixel_id, and timestamp from the file/folder path.
    2. Configuration Code Extraction (extractConfigurationCode):
      • Reads the content of each HTML file.
      • Isolates the relevant JavaScript code block that defines the Pixel's configuration, typically starting with fbq.registerPlugin and ending before any comments or subsequent code.
    3. Core Parsing Logic (parse_pixel_code):
      • If configuration code is found, it uses regular expressions to parse different types of Pixel configuration statements:
        • fbq.loadPlugin("PLUGIN_NAME");: Identifies loaded plugins.
        • instance.optIn("PIXEL_ID", "CONFIG_NAME", true/false);: Captures opt-in settings for specific configurations (e.g., 'UnwantedData', 'AutomaticMatching').
        • config.set("PIXEL_ID" or null, "CONFIG_NAME", {JSON_DATA});: Extracts detailed JSON-formatted configuration data for features.
        • fbq.set("CONFIG_NAME", "PIXEL_ID", [LIST_DATA]);: Parses settings like 'estRules' which are defined as lists.
      • Normalizes configuration names (lowercase, remove underscores).
      • Stores parsed elements (plugins, opt-ins, config.set, fbq.set) into separate lists.
    4. DataFrame Creation & Matching (makeConfigDataframe, parse_dataframe, manualMatch, returnMatch):
      • The initial parsed data is structured into a temporary DataFrame.
      • A matching logic (manualMatch, returnMatch) attempts to align related configuration parts (e.g., a plugin load with its corresponding opt-in or config.set). This seems to handle variations in naming conventions (e.g., "jsonldmicrodata" vs "microdatajsonld", "cookie" vs "firstpartycookies").
      • The goal is to create a more unified representation of each configuration setting.
    5. Aggregation (aggregate_source_code_info):
      • For each processed configuration script, the detailed parsed and matched DataFrame is aggregated into a single row. This row summarizes:
        • plugin_names: A list of unique plugin names found.
        • opt_in_info: A list of (config_name, enabled_status) tuples from instance.optIn.
        • config_set_info: A list of (config_name, json_data) tuples from config.set.
        • fbq_set_info: A list of (config_name, list_data) tuples from fbq.set.
      • The timestamp, website, and pixel_id are added to this aggregated row.
    6. Final DataFrame Construction:
      • All aggregated rows (one per successfully parsed configuration script) are collected into a final Pandas DataFrame.
      • This DataFrame is sorted by timestamp.
  • Configuration (within processSnapshots.py):
    • folder_path: (Line ~229) Path to the input directory containing the configuration script snapshots (e.g., "allPixelConfigs").
  • How to Run:
    1. Ensure the folder_path variable points to the correct directory containing the output from fetchConfigurationSnapshots.py.
    2. Execute the script:
      python processSnapshots.py
  • Expected Output:
    • The script will print progress using tqdm as it processes websites, Pixel IDs, and HTML files.
    • The primary output is a Pandas DataFrame named final_aggregated_df (in memory at the end of the script). Note: The provided script does not explicitly save this DataFrame to a file (e.g., a CSV or pickle). You would typically add a line like final_aggregated_df.to_csv('processed_pixel_configs.csv', index=False) or final_aggregated_df.to_pickle('processed_pixel_configs.pkl') at the end of the script to persist the results for further analysis.
    • The DataFrame final_aggregated_df will have the following columns:
      • plugin_name: List of plugin names (e.g., ['unwanteddata', 'inferredevents']).
      • opt_in_info: List of tuples, e.g., [('automaticsetup', 'true'), ('firstpartycookies', 'true')].
      • config_set_info: List of tuples, where the second element is a dictionary (parsed JSON), e.g., [('automaticmatching', {'selectedMatchKeys': ['em', 'ph']})].
      • fbq_set_info: List of tuples, where the second element is a list (parsed JSON array), e.g., [('estrules', [{'condition': {...}, 'derived_event_name': '...'}])].
      • timestamp: Pandas datetime object representing the snapshot time.
      • website: String, the domain name of the website.
      • pixel_id: String, the Meta Pixel ID.

Data Analysis Scripts (Continued)

This section details the Jupyter Notebooks used for specific analyses and plot generation, building upon the processed data from processSnapshots.py.

General Instructions for Jupyter Notebooks:

  • Ensure you have Jupyter Notebook or JupyterLab installed (pip install notebook jupyterlab).
  • These notebooks are designed to be run cell by cell, in sequential order from top to bottom.
  • Path Configuration: Before running, carefully review the initial cells of each notebook to set the correct paths to your input data files (e.g., the processed configuration DataFrames from processSnapshots.py, Pixel ID history files).
  • These notebooks can be run locally or potentially on platforms like Google Colaboratory (you might need to upload data files or connect to Google Drive).

2. comparisonPlots.ipynb

  • Purpose: This Jupyter Notebook generates various plots comparing the adoption rates of different Meta Pixel features and configurations over time, contrasting between the Top 10k websites and Health websites. Many of the website adoption graphs presented in the research paper are produced by this notebook.
  • Input:
    • The processed Pixel configuration DataFrame for Top 10k websites (output from processSnapshots.py, e.g., processed_pixel_configs_top10k.csv or .pkl).
    • The processed Pixel configuration DataFrame for Health websites (output from processSnapshots.py, e.g., processed_pixel_configs_health.csv or .pkl).
  • Key Analyses & Outputs:
    • Temporal plots showing adoption trends of features like:
      • Automatic Events (e.g., AutomaticSetup, InferredEvents)
      • Identity Tracking features (e.g., FirstPartyCookies, AutomaticMatching)
      • Tracking Restriction features (e.g., ProtectedDataMode / Core Setup)
    • Comparisons of these trends between the Top 10k and Health website cohorts.
    • The notebook will display plots inline and may include cells to save these plots to image files.
  • How to Run:
    1. Open comparisonPlots.ipynb in Jupyter Notebook or JupyterLab.
    2. In the initial cells, carefully update the file paths to point to your processed configuration DataFrames for both Top 10k and Health websites.
    3. Run all cells sequentially from top to bottom.
  • Dependencies (beyond requirements.txt): Typically plotting libraries like matplotlib and seaborn (which should be covered by a standard data science environment, but good to have in requirements.txt).

3. unwantedDataAnalysis.ipynb (and hashes.txt)

  • Purpose: This notebook focuses on the analysis of Meta Pixel's "Unwanted Data" filtering mechanism, specifically examining blacklisted_keys and sensitive_keys. It also investigates the nature of data being filtered, including examples of sensitive health-related information.
  • Input:
    • The processed Pixel configuration DataFrame for Top 10k websites.
    • The processed Pixel configuration DataFrame for Health websites.
    • hashes.txt: A text file containing SHA-256 hashes (presumably of sensitive_keys) and their corresponding decrypted plaintext values if found (e.g., via CrackStation). This file is crucial for understanding the nature of hashed sensitive keys.
  • Key Analyses & Outputs:
    • Identification and quantification of blacklisted_keys and sensitive_keys across websites.
    • Analysis of the hashes.txt file to understand what plaintext parameters are being hashed as sensitive_keys.
    • Examples of potentially sensitive parameters or event data identified (e.g., related to "OCD").
    • Plots showing the adoption of these filtering mechanisms over time for both Top 10k and Health websites.
    • Analysis of website overlap in the usage of common blacklisted_keys or sensitive_keys.
    • Plots corresponding to these analyses as presented in the paper.
  • How to Run:
    1. Open unwantedDataAnalysis.ipynb.
    2. Ensure hashes.txt is in the expected location or update the path in the notebook if it's loaded from a specific path.
    3. In the initial cells, update the file paths to your processed configuration DataFrames.
    4. Run all cells sequentially.
  • Note on hashes.txt: The quality of analysis for sensitive_keys depends significantly on the completeness and accuracy of hashes.txt.

4. pixelPresence.ipynb

  • Purpose: This notebook analyzes the prevalence of Meta Pixels on websites. It determines how many websites have at least one Meta Pixel installed and, of those, for how many the corresponding configuration scripts were successfully fetched and archived. It generates the Pixel presence plot shown in the paper.
  • Input:
    • The final Pixel ID history file for Top 10k websites (e.g., pixelHistoryComplete_top10k.csv from extractPixelIDs.py).
    • The final Pixel ID history file for Health websites (e.g., pixelHistoryComplete_health.csv from extractPixelIDs.py).
    • The directory containing the downloaded Pixel configuration script snapshots for Top 10k websites (e.g., allPixelConfigs_top10k/).
    • The directory containing the downloaded Pixel configuration script snapshots for Health websites (e.g., allPixelConfigs_health/).
  • Key Analyses & Outputs:
    • Calculation of the number of websites with at least one Pixel ID detected over time.
    • Calculation of the number of websites for which at least one configuration script was found and archived by the Wayback Machine.
    • Generation of a plot (similar to Figure 3 in your paper) showing these trends for both Top 10k and Health websites.
  • How to Run:
    1. Open pixelPresence.ipynb.
    2. In the initial cells, update the file paths to your pixelHistoryComplete.csv files and the base directories for the allPixelConfigs data for both Top 10k and Health cohorts.
    3. Run all cells sequentially.

About

Reverse-engineering tracking pixel configurations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors