constructing_accountability

This repo contains the scripts used to clean, analyse, and process data from the Mexican federal program Youth Building the Future (Jóvenes Construyendo el Futuro, its spanish name). The scripts were used for the story 'El doble blindaje de Jóvenes Construyendo el Futuro' [spanish only] published by the Mexican outlet Meganoticias.

🗂️Data Source & Original Structure

Data obtain from Padrón Único de Beneficiarios del Programa Jóvenes Construyendo el Futuro

35 .xlsx files were downloaded:

2019-2020: Annual files, each year split in two .xlsx files due to the amount of data

These files lacked information about the SEX, AGE, and REGISTRATION DATE

2021: One file for the full year
2022-2023: Monthly files

Across all files the names of the beneficiaries were split across columns Surname, Second Surname, Name and states were coded under CVE_ENTIDAD

🧼Normalization & Standarization

Annual files:

Manually added the columns SEX, AGE, and REGISTRATION DATE as empty values to match the structure of the later datasets.

Monthly files:

Ensured consistency of the column names across all files and were standarized: NUM, CVE_ENTIDAD, CVE_MUN, PRIMER APELLIDO, SEGUNDO APELLIDO, NOMBRE, SEXO, EDAD, FECHA_ALTA, IMPORTE_BENEFICIO

Normalization

Two different methods were used according to the script requirements and knowledge at the time:

if "SEXO" in df.columns and df["SEXO"].notna().any():
    sexos = df["SEXO"].astype(str).str.upper()

if "SEXO" in df.columns:
    df["SEXO"] = df["SEXO"].astype(str).str.upper().str.strip()
else:
    df["SEXO"] = None

Name and State Standardization

To clean and simplify structure the names were merged into a single NAMEcolumn, which kept the full name in a single cell:

def generate_full_name(row):
    return f"{row['NOMBRE']} {row['PRIMER APELLIDO']} {row['SEGUNDO APELLIDO']}"
df['NOMBRE'] = df.apply(generate_full_name, axis=1)

As for the states, instead of keeping the numeric codes, a dictionary was used to map each one to the corresponding name:

state_mapping = {
    '01': 'Aguascalientes', '02': 'Baja California', '03': 'Baja California Sur', '04': 'Campeche',
    '05': 'Coahuila', '06': 'Colima', '07': 'Chiapas', '08': 'Chihuahua', '09': 'Ciudad de México',
    '10': 'Durango', '11': 'Guanajuato', '12': 'Guerrero', '13': 'Hidalgo', '14': 'Jalisco',
    '15': 'México', '16': 'Michoacán', '17': 'Morelos', '18': 'Nayarit', '19': 'Nuevo León',
    '20': 'Oaxaca', '21': 'Puebla', '22': 'Querétaro', '23': 'Quintana Roo', '24': 'San Luis Potosí',
    '25': 'Sinaloa', '26': 'Sonora', '27': 'Tabasco', '28': 'Tamaulipas', '29': 'Tlaxcala',
    '30': 'Veracruz', '31': 'Yucatán', '32': 'Zacatecas'
}

The final columns across all files were standarized as: NUM, CVE_ENTIDAD, CVE_MUN, PRIMER APELLIDO, SEGUNDO APELLIDO, NOMBRE, SEXO, EDAD, FECHA_ALTA, IMPORTE_BENEFICIO

🧬UID Creation & De-duplication

To handle beneficiaries as unique individuals based on basic attributes, a simple concatenated string was used:

UID = NOMBRE + "_" + CVE_ENTIDAD + "_" + CVE_MUNICIPIO

To make identifiers cleaner and consistent across dozens of large files, a hashed_UID was created using the SHA-256 algorithm (this version wasn't used given editorial decisions):

def generate__hashed_uid(row):
    nombre = str(row.get('NOMBRE', '')).strip().upper()
    entidad = str(row.get('CVE_ENTIDAD', 'UNKNOWN')).zfill(2)
    municipio = str(row.get('CVE_MUNICIPIO', 'UNKNOWN')).zfill(3)
    uid_str = f"{nombre}_{entidad}_{municipio}"
    return hashlib.sha256(uid_str.encode()).hexdigest()

📜Scripts

mapping_states
- Replaces numeric state codes (CVE_ENTIDAD) with readable state names (e.g., 15 → México). Outputs the cleaned files into a /mapped_states folder for downstream use.
summarize_by_year
- Groups files by year (based on filenames), calculates the total number of records, total benefit amounts (IMPORTE_BENEFICIO), and—if available—counts of male and female beneficiaries (SEXO).
concat_uid_jcf
- Processes all files to generate a unique identifier (UID) per row using a combination of NOMBRE, CVE_ENTIDAD, and CVE_MUNICIPIO.
hashed_uid_jcf
- Similar to concat_uid_jcf, but creates a SHA-256 hashed UID per row from NOMBRE, CVE_ENTIDAD, and CVE_MUNICIPIO values. Useful for anonymization or deduplication.
jcf_data_wrangler
- Comprehensive summarizer that processes all files and generates different summaries. Counts beneficiaries, breaks them down by gender, and totals the benefits paid. Also logs errors in a separate file if key columns are missing.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Scripts		Scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

constructing_accountability

🗂️Data Source & Original Structure

🧼Normalization & Standarization

Annual files:

Monthly files:

Normalization

Name and State Standardization

🧬UID Creation & De-duplication

📜Scripts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

constructing_accountability

🗂️Data Source & Original Structure

🧼Normalization & Standarization

Annual files:

Monthly files:

Normalization

Name and State Standardization

🧬UID Creation & De-duplication

📜Scripts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages