This repo contains the scripts used to clean, analyse, and process data from the Mexican federal program Youth Building the Future (Jóvenes Construyendo el Futuro, its spanish name). The scripts were used for the story 'El doble blindaje de Jóvenes Construyendo el Futuro' [spanish only] published by the Mexican outlet Meganoticias.
Data obtain from Padrón Único de Beneficiarios del Programa Jóvenes Construyendo el Futuro
35 .xlsx files were downloaded:
- 2019-2020: Annual files, each year split in two
.xlsxfiles due to the amount of data
These files lacked information about the SEX, AGE, and REGISTRATION DATE
-
2021: One file for the full year
-
2022-2023: Monthly files
Across all files the names of the beneficiaries were split across columns Surname, Second Surname, Name and states were coded under CVE_ENTIDAD
Manually added the columns SEX, AGE, and REGISTRATION DATE as empty values to match the structure of the later datasets.
Ensured consistency of the column names across all files and were standarized:
NUM, CVE_ENTIDAD, CVE_MUN, PRIMER APELLIDO, SEGUNDO APELLIDO, NOMBRE, SEXO, EDAD, FECHA_ALTA, IMPORTE_BENEFICIO
Two different methods were used according to the script requirements and knowledge at the time:
if "SEXO" in df.columns and df["SEXO"].notna().any():
sexos = df["SEXO"].astype(str).str.upper()if "SEXO" in df.columns:
df["SEXO"] = df["SEXO"].astype(str).str.upper().str.strip()
else:
df["SEXO"] = NoneTo clean and simplify structure the names were merged into a single NAMEcolumn, which kept the full name in a single cell:
def generate_full_name(row):
return f"{row['NOMBRE']} {row['PRIMER APELLIDO']} {row['SEGUNDO APELLIDO']}"
df['NOMBRE'] = df.apply(generate_full_name, axis=1)As for the states, instead of keeping the numeric codes, a dictionary was used to map each one to the corresponding name:
state_mapping = {
'01': 'Aguascalientes', '02': 'Baja California', '03': 'Baja California Sur', '04': 'Campeche',
'05': 'Coahuila', '06': 'Colima', '07': 'Chiapas', '08': 'Chihuahua', '09': 'Ciudad de México',
'10': 'Durango', '11': 'Guanajuato', '12': 'Guerrero', '13': 'Hidalgo', '14': 'Jalisco',
'15': 'México', '16': 'Michoacán', '17': 'Morelos', '18': 'Nayarit', '19': 'Nuevo León',
'20': 'Oaxaca', '21': 'Puebla', '22': 'Querétaro', '23': 'Quintana Roo', '24': 'San Luis Potosí',
'25': 'Sinaloa', '26': 'Sonora', '27': 'Tabasco', '28': 'Tamaulipas', '29': 'Tlaxcala',
'30': 'Veracruz', '31': 'Yucatán', '32': 'Zacatecas'
}The final columns across all files were standarized as:
NUM, CVE_ENTIDAD, CVE_MUN, PRIMER APELLIDO, SEGUNDO APELLIDO, NOMBRE, SEXO, EDAD, FECHA_ALTA, IMPORTE_BENEFICIO
To handle beneficiaries as unique individuals based on basic attributes, a simple concatenated string was used:
UID = NOMBRE + "_" + CVE_ENTIDAD + "_" + CVE_MUNICIPIOTo make identifiers cleaner and consistent across dozens of large files, a hashed_UID was created using the SHA-256 algorithm (this version wasn't used given editorial decisions):
def generate__hashed_uid(row):
nombre = str(row.get('NOMBRE', '')).strip().upper()
entidad = str(row.get('CVE_ENTIDAD', 'UNKNOWN')).zfill(2)
municipio = str(row.get('CVE_MUNICIPIO', 'UNKNOWN')).zfill(3)
uid_str = f"{nombre}_{entidad}_{municipio}"
return hashlib.sha256(uid_str.encode()).hexdigest()-
- Replaces numeric state codes (
CVE_ENTIDAD) with readable state names (e.g.,15→México). Outputs the cleaned files into a/mapped_statesfolder for downstream use.
- Replaces numeric state codes (
-
- Groups files by year (based on filenames), calculates the total number of records, total benefit amounts (
IMPORTE_BENEFICIO), and—if available—counts of male and female beneficiaries (SEXO).
- Groups files by year (based on filenames), calculates the total number of records, total benefit amounts (
-
- Processes all files to generate a unique identifier (UID) per row using a combination of
NOMBRE,CVE_ENTIDAD, andCVE_MUNICIPIO.
- Processes all files to generate a unique identifier (UID) per row using a combination of
-
- Similar to
concat_uid_jcf, but creates a SHA-256 hashed UID per row fromNOMBRE,CVE_ENTIDAD, andCVE_MUNICIPIOvalues. Useful for anonymization or deduplication.
- Similar to
-
- Comprehensive summarizer that processes all files and generates different summaries. Counts beneficiaries, breaks them down by gender, and totals the benefits paid. Also logs errors in a separate file if key columns are missing.