TypeSafe is a frontier model lab. We build reliable and general AI systems to power economically valuable automation. Our mission is to usher in a new era of Transformative Artificial Intelligence (TAI): technology with the power to drive a societal shift on the scale of the agricultural and industrial revolutions.
While others chase benchmarks and academic puzzles, we’ve been quietly rethinking the LLM stack from first principles — building a new kind of general frontier model designed for real-world reliability, decision-making, and autonomy in production.
We’re a small, fast-moving team from OpenAI, Google Brain, and Meta/FAIR, backed by top-tier investors. Since mid-2024, we’ve been engineering the foundation for what comes after the current “state-of-the-art” — a model that actually gets things done.
About the Role
We're looking for an Infrastructure Engineer to build and operate the infrastructure behind TypeSafe AI's products at global scale. You'll own the systems that serve millions of users across regions — from provisioning Kubernetes clusters across multiple clouds to optimizing networking for low-latency AI inference.
This is a high-impact role on a small, fast-moving team. You'll work across the full infrastructure stack: cloud primitives, container orchestration, networking, observability, and the specialized infra that makes large-scale model inference efficient.
What You'll Do
Design, deploy, and operate Kubernetes clusters across multiple regions and clouds
Build and maintain infrastructure for the platform that powers LLM inference workloads globally
Own networking, including VPCs, peering, load balancing, DNS, service mesh, CNI
Manage GPU infrastructure and autoscaling for ML workloads
Write and maintain infrastructure as code (Pulumi / Python)
Operate and improve observability: monitoring, alerting, tracing, logging
Requirements
Deep experience with Kubernetes in production at scale: networking, storage, scheduling, upgrades
Strong background in AWS
Hands-on experience with infrastructure as code (Pulumi, Terraform, or similar)
Solid understanding of Linux networking
Track record with high-traffic production ML systems
Programming fluency, Python preferred
Nice to Have
Experience with large-scale LLM / ML inference infrastructure (GPU scheduling, model serving, vLLM, KubeRay, Kubernetes-native tooling)
Kubernetes networking depth with Cilium or other CNI plugins; service mesh (Istio, Envoy)
Multi-cloud infrastructure
-
Background in site reliability engineering including SLOs, incident response, capacity planning
Life at TypeSafe
We’re a small, flat, close-knit team dedicated to real-world impact preparing the world for Transformative AI. Our team works fully in-person in our San Francisco office near Embarcadero station. We love what we do and care about our work a lot.
We strive for excellence and craftsmanship and won't stop until we get there. When the team wins, we all win, and we enjoy collaborating and inspiring each other to grow as a team and as individuals.
We also value emotional honesty, kindness, and bringing your whole self to work. We build machines; we don't try to be machines.
We want you to be able to do the most impactful work of your career at TypeSafe and help define our future as a company.
We provide
- Base salary of $180k-280k plus equity, based on leveling
- 100% covered health insurance
- Daily lunch and dinner
- Visa sponsorships
- 401K plans