Engineering Leader | AI and Cloud Infrastructure, Platform Engineering, Performance and Reliability
Email: hello@martinspier.io
LinkedIn: linkedin.com/in/martinspier | GitHub: github.com/spiermar
Phone: +1 (425) 351-8800
Engineering leader with 15+ years of experience building and scaling platform, infrastructure, and performance engineering organizations across AI, fintech, and consumer internet. Scaled high-performing remote teams and 200+ engineer organizations, aligning technical innovation with business goals and delivering measurable gains in engineering velocity, efficiency, performance, and reliability.
Combines senior engineering leadership with deep technical judgment in distributed systems, cloud and AI infrastructure, observability, developer platforms, and performance engineering. Set technical direction across platform, infrastructure, and developer experience to improve engineering effectiveness, product velocity, reliability, and performance for products serving hundreds of millions of users. Authored widely used open-source performance tools and advised startups and investors on engineering strategy, infrastructure, and scale.
Platform Engineering & Developer Experience • Cloud, AI Infrastructure & Scalability • Performance, Reliability & Observability
Strategic Technical Leadership • Team Leadership, Development & Remote Management • Talent Acquisition & Organizational Growth
OpenAI | San Francisco, CA
ChatGPT Performance.
- Leading OpenAI’s ChatGPT Performance team to improve performance, efficiency, reliability, and end-user experience across the ChatGPT stack, partnering across product and infrastructure teams.
Parasail AI | San Mateo, CA
Parasail AI provides fully managed AI inference at scale, enabling builders to deploy any model on fast, flexible, and cost-efficient infrastructure that scales infinitely.
- Structured engineering planning, development, hiring, and onboarding processes, improving delivery predictability and growing the engineering team by 50% in 6 months.
- Improved inference performance and efficiency, including Time to First Token (TTFT) and tokens per second (TPS), increasing platform competitiveness and supporting 5x+ ARR growth in 9 months.
- Shipped low-latency audio workflows built on composable STT, LLM, and TTS models, and launched multiple open-source models on day one of release.
- Launched a fully managed, consumption-based inference service that became the company’s primary growth driver, while supporting the executive team through Series A fundraising.
- Led the company’s SOC 2 compliance program, strengthening enterprise readiness while scaling the platform and engineering organization.
PicPay | Remote
PicPay, the largest digital wallet in Latin America, extends a wide array of financial services, including peer-to-peer payments, credit cards, personal loans, insurance, and investments to a vast user base of over 60 million users.
- Led a 200+ cross-functional engineering organization across cloud infrastructure, backend, mobile, data and ML platforms, observability, and developer experience; aligned technical strategy with business goals across a complex fintech environment.
- Defined and executed a platform engineering strategy that reduced application and infrastructure complexity, cut Change Lead Time by 20%, and increased Deployment Frequency.
- Prioritized the Internal Developer Portal and consolidated developer tooling across CI/CD, observability, and source code management, improving developer satisfaction (CSAT) from 22% to 87%.
- Established a data-driven operating model with clear KPIs, quarterly planning, and OKRs, improving alignment between engineering execution and business priorities.
- Reorganized shared engineering functions around business outcomes, unified product and technology teams more closely, and revamped hiring and interview processes to improve talent density and delivery focus.
- Established a FinOps practice across 16 business units and renegotiated major vendor contracts, delivering more than $30M in annualized savings, reducing technology costs by 22%, and improving cost-to-serve.
- Led the decommissioning of large legacy systems and the transition to a multi-account cloud architecture using Infrastructure as Code (IaC), improving scalability, security, and maintainability.
- Secured executive buy-in for a hybrid and Server-Driven UI framework, reducing mobile time-to-market by 94% and significantly improving developer productivity.
- Improved end-user retention by orienting teams around crashes, ANRs, UI hangs, and slow starts, reducing ANRs by 88% and crashes by 80% through refactoring, optimization, and automated triage.
Snowflake | San Mateo, CA
- Improved Snowflake’s cloud data platform through architectural changes, performance analysis of core components, and service-level optimizations, while refining deployment models to reduce change failure rate and improve overall availability and stability.
- Developed observability and performance analysis prototypes that improved issue detection and alerting, enabled self-healing capabilities, and strengthened end-user stability and efficiency.
Netflix | Los Gatos, CA
- Performance Optimization: Designed scalable architectures and implemented optimizations for Netflix’s global streaming platform, serving over 200 million subscribers. Conducted performance analysis and tuning across all system layers, improving reliability and performance of real-time, batch, big data and ML workloads.
- Technical Thought Leadership: Provided performance, reliability, scalability and efficiency expertise to guide major changes and projects, and represented Netflix at conferences as a subject matter expert.
- Performance Tools Development:
- FlameScope: Open-source tool for visualizing flame graphs across time ranges, widely adopted in the industry.
- FlameCommander: Cloud profiling tool for capturing and analyzing CPU, memory, and heapdump profiles on any cloud instance or container.
- Visualizations: Developed open-source plugins for flame graphs and heatmaps to enhance performance data analysis. These plugins are widely used in the industry and part of major projects such as Apache Flink, Google’s pprof profiler and Oracle’s Java Flight Recorder.
- FlameCloud: Continuous profiling solution collecting thousands of software profiles and millions of stacks daily.
- End-to-End Tracing: Framework linking user device tracing with backend service tracing (Zipkin-based) for improved observability.
- Icarus: Real user performance monitoring solution processing 180B+ events daily, with GUI for analysis, alerting, and anomaly detection.
- Vector: Open-source, on-host, high-resolution performance monitoring framework.
- Mogul and Slalom: System demand and bottleneck analysis tools for visualizing data flows and dependencies in large-scale systems.
- Created the Netflix User Performance Score, a Lighthouse-inspired compound metric, representing a user’s experience, and the basis for modeling the impact of performance bottlenecks in user retention.
- Cloud Architecture Leadership: Defined and implemented cloud architecture standards and best practices during Netflix’s migration to the cloud, ensuring scalability, reliability, and efficiency across the platform.
- Mentorship & Culture Building: Mentored engineers and onboarded new hires on performance tools and practices, raising Netflix’s bar for engineering excellence.
Expedia | Bellevue, WA
Dell | Porto Alegre, Brazil
Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Brazil
- Project Management (2009 - 2010)
- B.Sc. Computer Science (2002 - 2008)