This directory contains scripts for building and analyzing two related datasets: MCP servers/tools and Skills repositories. The pipeline moves from raw data collection → cleaning/normalization → metadata enrichment → similarity analysis → statistical and semantic visualization. Each script has a comment above describing its use and how it fits into the data analysis pipeline.
Scripts that gather raw data from external sources.
-
Skills API scraper
- Collects skill entries from SkillsMP
- Handles pagination + deduplication
- Produces
skills_data.json
-
GitHub language enrichment
- Queries GitHub API for repository languages
- Produces
servers_clean.json
Scripts that transform raw code into structured representations.
-
Repo cloning script
- Groups skills by GitHub repository
- Clones repositories locally
- Outputs
skills_repo/+ metadata mappings
-
Code normalizer
- Traverses repositories
- Removes comments / boilerplate
- Tokenizes and filters content
- Produces
normalized.json
Scripts that enrich datasets with structured attributes.
-
Primary language selector
- Filters noisy languages (HTML, JSON, configs)
- Assigns primary language per repo
-
Tool/functionality classifier (LLM-based)
- Classifies tools into functional taxonomy
- Categories include:
- data retrieval
- API interaction
- file manipulation
- code execution
- system operations
- developer tooling
- etc.
Scripts that measure overlap between repositories.
-
Pairwise similarity engine
- Computes Jaccard similarity (token overlap)
- Computes SSDeep fuzzy similarity
- Outputs ranked similarity pairs
-
Cross-dataset comparison
- Analyzes similarity distributions
- Produces histograms and summary stats
Scripts that study contribution concentration.
-
MCP author aggregation
- Tools per developer
- Top-K contribution concentration
- Rank-frequency distributions
-
Skills author aggregation
- Skills per contributor
- Inequality / concentration plots
Scripts focused on dataset properties.
-
Token length analysis
- Measures description length distributions
- Histograms for MCP and Skills datasets
-
Dataset summary tables
- MCP vs Skills comparisons:
- total entities
- total tools/skills
- averages per entity
- MCP vs Skills comparisons:
Scripts that embed and visualize content structure.
-
MCP t-SNE visualization
- Category-colored scatter plot
-
Skills t-SNE visualization
- Same embedding pipeline applied to skills dataset
Scripts for interpreting similarity results.
- Histogram generation for:
- Jaccard similarity
- SSDeep similarity
- Log-scale frequency plots