























In April 2025, Amplitude officially started its FinOps org. I joined as the first and only FinOps Engineer, and my first big task was to pick our FinOps tool.
Traditionally, this is a no-brainer: for a company of Amplitude’s size and scale, you buy your FinOps platform.
However, April 2025 is also when Amplitude began a strategic pivot into becoming AI-first. That mandate showed up everywhere: in how we build and ship code, how we improve features through the sense/decide/act loop, and even in how we handle day-to-day work questions.
I wanted to rethink how an AI-native company would approach FinOps. Using the FinOps Foundation Framework as a guiding principle, I built my own tools and solved problems the AI way. And after only a year, the AI has already had a significant impact on efficiency.
Here are some of my experiences and what I built during a year of AI FinOps at Amplitude.
In FinOps, you need the right tools to ingest billing data from many sources, normalize it, and expose reports and dashboards on top. That’s how you’re able to justify the business value of your company’s tech stack and find ways to optimize it.
A traditional FinOps playbook might encourage us to scale FinOps by procuring a foundational FinOps tool. But with the advent of AI coding assistants, I decided to see if I could build our own economically.
In April 2025, Amplitude hosted an AI Hackathon Week where we learned how to use the new AI coding assistant tools and see what we could accomplish in a week. I was blown away by what they were capable of. The tab completions in GitHub Copilot were magical, but this was on another level.
Just as important as the coding assistants are the internal agents I built. Together, they changed FinOps from manual answering of every question to a system that encodes and reuses knowledge. By my estimate, I saved 50% of my time, allowing me to focus on cost optimization rather than answering questions or researching issues.
(Note: Agent names are borrowed from my favorite K-pop group. I’ll let you figure out which one.)
The first decision was where to store and normalize our billing data. I needed something that could query AWS Cost and Usage Reports directly in S3 (without building and maintaining a full ETL pipeline) and also let me define a single normalization layer that stays fresh without manual rebuilds.
Redshift checked both boxes: External Schemas (via Spectrum) let me query CUR data in place from S3 via the Glue Catalog, and Materialized Views gave me a single, easy-to-maintain table where all normalization logic lives as a single source of truth. When we needed non-AWS vendor data, I added lightweight Lambda functions to fetch and insert it into Redshift—no new pipeline architecture required. We considered other options, such as Snowflake, RDS, and BigQuery, but ultimately Redshift was the cheapest that met both requirements.
To set this up, I worked with the AI coding assistants to write all the infrastructure-as-code required to maintain the Redshift cluster, the Materialized Views, and the refresh schedule.
During our AI Hackathon week, I decided to build our first “AI Slack Bot” to help answer AWS cost-related questions. Reflecting on how I’d do it by hand, the first iteration of YA simply took the user’s questions, generated a SQL query, ran it, and returned the results.
YA was designed with:
Ask a question in Slack, get an instant breakdown of your top AWS RDS cost drivers.
Initially, YA did some things very well:
But it also had some drawbacks:
Cost anomalies were one of the most time-consuming parts of my day. I’d manually check dashboards and try to eyeball what looked off. If something looked fishy, I would have to spend hours digging through multiple data sources to figure out what changed and why. Agent TY was created to automate that entire loop.
TY was designed with:
The agent would identify the anomalous service and try to determine which usage, resource, or other factor caused the anomaly. A report would be created in Slack, structured around five sections: Who, What, Where, When, and How.
AI catches a DynamoDB cost spike, explains what changed, and recommends what to look at next.
Reservations and Savings Plans are one of the biggest levers for cost optimization, but tracking utilization, coverage, and expirations across multiple AWS services is tedious spreadsheet work. Agent YR was built to automate that analysis and surface actionable recommendations weekly.
Using the AWS SDK, YR pulls utilization and coverage data for Compute Savings Plans, ElastiCache, and RDS alongside existing reservation inventory. Then, it normalizes all instance data to a common unit (xlarge) so we can compare coverage across different instance sizes within the same family—without that normalization, a mix of large, 2xlarge, and 4xlarge instances makes apples-to-apples analysis impossible.
Each week, YR sends a Slack report with current coverage and utilization numbers, flags upcoming expirations, and recommends net new reservations.
AWS reservation health includes wasted spend, expiring reservations, and savings opportunities.
The v1 agents worked, but they didn’t scale. Each agent had its own bloated system prompt with baked-in schema definitions, and none could share data or tools. If I added a new table, I had to update every agent individually.
For v2, I refactored around a single idea: turn data access into shared tools rather than embedded knowledge.
This had several advantages:
In this example, Agent YA reminded me why I had a task to migrate our ElastiCache cluster from r6g. It was able to better recommend what I should do over the following months.
AI-powered ElastiCache analysis explains reservation utilization, renewal risks, and the reasons for creating migration tasks.
At this point, it was becoming increasingly difficult to maintain all the separate Lambda functions that would pull from various vendors. So I consolidated all processes into a single service called data-orchestrator:
My goal was to democratize data access for everyone, at any time. With these Agents, I was no longer the bottleneck for analysis or insights. This has freed up at least 50% of my time, allowing me to focus on high-leverage cost optimization.
AI coding assistants let me build internal tools at a pace that wouldn’t have been possible two years ago. And because everything is in-house, iteration is fast; what used to take days now takes less than an hour. During the first couple of months, I was pushing out code changes daily.
Right now, our agents can sense issues and make recommendations, but they still require manual changes from engineers. The next step is to move them toward action, enabling agents to pinpoint required changes, submit them as pull requests, and eventually detect and resolve their own errors.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。