Employees can work remotely
Job Description
Team
Join the Global Cloud Services organization's FinOps Tools team, which is building ServiceNow's next-generation analytics and financial governance platform. Our team owns the full modern data stack: Trino for distributed queries, dbt for transformations, Iceberg for lakehouse architecture, Lightdash for business intelligence, and Argo Workflows for orchestration. You will own the Forecast Engine, the system that turns ServiceNow's cloud capacity and cost actuals into forward-looking forecasts, then automatically tracks those forecasts against plan and budget and alerts the right people when reality diverges. The Forecast Engine also feeds directly into our Future Capacity Reservation (FCR) automation: its forecast of fleet growth and workload migration timing is the signal that drives how much hyperscaler capacity to reserve, in which providers and regions, and when, against the lead-time windows FinOps and Cloud Operations plan around.
Role
The Forecast Engine is the simulation and automation core behind FinOps capacity and cost planning. It reads forecasting actuals from the lakehouse and runs a deterministic multi-period simulation of fleet growth, workload migration, placement, and sizing. It validates each result against hard invariants and publishes forecasts that data scientists, analysts, and FinOps engineers consume in Lightdash. Today it is a fast, single-binary Rust core with a streaming Trino read and an Iceberg publish path. The next chapter is to turn that engine into an automated, always-on forecasting service.
As our Staff Software Engineer for the Forecast Engine, you will design and build the automation layer around the engine: scheduled forecast runs, variance and budget tracking against plan, anomaly and threshold alerting, first-class integration with planning systems, Splunk, and the broader observability stack, and the handoff that turns forecasts into Future Capacity Reservation (FCR) recommendations. You will make the forecast a living signal: recomputed on a cadence, reconciled against actuals, and translated into the capacity reservations that keep hyperscaler supply ahead of demand.
This role demands speed and high velocity. You will take a proven simulation core and rapidly make it a dependable, observable, self-monitoring product that the organization plans against, shipping working increments fast and iterating in tight loops. The automation layer around the engine is greenfield: you will build it from the ground up. We operate like a small startup, and this is the operating mode of the role and the department: we move quickly, deliver early, keep process light, and keep momentum.
What You'll Do: Core Responsibilities
- Design and develop scalable, maintainable, and reusable software components with a strong emphasis on performance, determinism, and reliability.
- Collaborate with product managers and FinOps partners to translate planning and budgeting requirements into well-architected solutions, owning features from design through delivery.
- Build intuitive and extensible interfaces for forecast consumption (Lightdash models, alert payloads, and APIs) ensuring flexibility for finance and capacity-planning use cases.
- Contribute to the design and implementation of new Forecast Engine capabilities while enhancing existing simulation, validation, and publish paths.
- Integrate automated testing into development workflows to ensure consistent quality across releases, including determinism (byte-identical output) and forecast-accuracy regression checks.
- Participate in design and code reviews ensuring best practices in performance, maintainability, and testability.
- Develop comprehensive test strategies covering functional, regression, integration, and accuracy aspects (period-over-period identity, backtest grading against real actuals).
- Foster a culture of continuous learning and improvement by sharing best practices in engineering and quality.
- Promote a culture of engineering craftsmanship, knowledge-sharing, and thoughtful quality practices across the team.
Technical Leadership & Architecture
- Own the architecture of the Forecast Engine and the automation layer around it: scheduled runs, variance/budget tracking, and alerting.
- Lead technical decision-making on forecast cadence, reconciliation against actuals, alert routing, and the contract between the simulation core and downstream consumers.
- Establish best practices for forecast automation: idempotent scheduled runs, deterministic reproducibility, fail-loud data contracts, and no silent fallbacks.
- Define how forecast signals (variance, budget breach, capacity headroom, migration drift) are computed, thresholded, and surfaced.
- Drive innovation in forecasting and planning automation, including the responsible use of AI/ML tooling to accelerate development and analysis.
Hands-On Development
- Build the automation that runs the Forecast Engine on a schedule via Argo Workflows, with retries, alerting on failure, and run-to-run reproducibility.
- Develop variance and budget tracking: reconcile each forecast against plan and against the latest actuals, compute deltas at the grains that matter (provider, region, pod, workload), and persist a queryable variance history.
- Implement alerting that fires on budget breach, forecast drift, capacity thresholds, and pipeline health, routed to Splunk and the team's notification channels.
- Integrate with planning systems so plan/budget targets flow into the engine and forecast outputs flow back out to the planning surface.
- Drive the Future Capacity Reservation (FCR) handoff: translate the forecast of fleet growth and migration timing into reservation recommendations (how much capacity, which providers/regions/pods, and by when), aligned to hyperscaler procurement lead-time windows and reconciled with Cloud Operations so the same capacity is never reserved twice.
- Build and extend the Rust simulation core (period loop, growth, migration, routing, packing, sizing, validation) and its streaming Trino read and Iceberg publish paths.
- Create and maintain the Lightdash forecast and variance marts (standard dbt models on the published tables) that finance and capacity partners consume.
Platform Foundation
- Design the forecast data contract (the upstream view the engine reads) so data-quality problems halt loudly and are fixed at the source, never papered over downstream.
- Implement scheduled, observable forecast runs with full run lineage: inputs, seed, config, output location, and metrics for every run.
- Build observability and monitoring for the Forecast Engine: run success rates, forecast latency, memory ceilings, accuracy drift, and alert-delivery health, emitted to Splunk and the observability stack.
- Establish an automation foundation that scales from a handful of scheduled scenarios to a broad, multi-scenario forecasting program.
Forecast Automation & Alerting
- Create scheduled, parameterized forecast scenarios with opinionated structure: pinned config, deterministic seeds, validated inputs, and published outputs.
- Build tooling for one-command scenario runs and for promoting a scenario from ad-hoc to scheduled with minimal manual intervention.
- Establish guardrails: input data contracts, resource/memory ceilings, and loud halts that surface real problems instead of producing wrong-but-quiet numbers.
- Collaborate closely with FinOps analysts and capacity planners to rapidly iterate on variance definitions, alert thresholds, and the signals that matter, without over-engineering.
- Prioritize forecast reliability, accuracy tracking, and clear alerting over feature breadth.
AI-Augmented Development
- Use modern AI development tools (e.g., Claude Code, Cursor, GitHub Copilot) to accelerate development, testing, and analysis, and help the team adopt effective, well-validated AI-assisted practices.
Collaboration & Integration
- Work autonomously with guidance from Engineering and FinOps leadership.
- Collaborate with DevOps and platform teams on scheduling infrastructure, CI/CD pipelines, and Splunk/observability integration.
- Partner with FinOps Tools team members working on Trino, dbt, Lightdash, and Iceberg to ensure seamless integrations.
- Partner with finance and capacity-planning stakeholders to ensure forecasts, variance, and alerts map to how they actually plan and budget.