AI optimization: How we cut energy costs in social media recommendation systems

When you scroll through Instagram Reels or browse YouTube, the seamless flow of content feels like magic. But behind that curtain lies a massive, energy-hungry machine. As a software engineer working on recommendation systems at Meta and now Google, I’ve seen firsthand how the quest for better AI models often collides with the physical limits of computing power and energy consumption.

We often talk about “accuracy” and “engagement” as the north stars of AI. But recently, a new metric has become just as critical: efficiency.

At Meta, I worked on the infrastructure powering Instagram Reels recommendations. We were dealing with a platform serving over a billion daily active users. At that scale, even a minor inefficiency in how data is processed or stored snowballs into megawatts of wasted energy and millions of dollars in unnecessary costs. We faced a challenge that is becoming increasingly common in the age of generative AI: how do we make our models smarter without making our data centers hotter?

The answer wasn’t in building a smaller model. It was in rethinking the plumbing — specifically, how we computed, fetched and stored the training data that fueled those models. By optimizing this “invisible” layer of the stack, we achieved over megawatt-scale energy savings and reduced annual operating expenses by eight figures. Here is how we did it.

The hidden cost of the recommendation funnel

To understand the optimization, you have to understand the architecture. Modern recommendation systems generally function like a funnel.

At the top, you have retrieval, where we select thousands of potential candidates from a pool of billions of media items. Next comes early-stage ranking, a high-efficiency phase that filters this large pool down to a smaller set. Finally, we reach late-stage ranking. This is where the heavy lifting happens. We use complex deep learning models — often two-tower architectures that combine user and item embeddings — to precisely order a curated set of 50 to 100 items to maximize user engagement.

This final stage is incredibly feature-dense. To rank a single Reel, the model might look at hundreds of “features.” Some are dense features (like the time a user has spent on the app today) and others are sparse features (like the specific IDs of the last 20 videos watched).

The system doesn’t just use these features to rank content; it also has to log them. Why? Because today’s inference is tomorrow’s training data. If we serve you a video and you “like” it, we need to join that positive label with the exact features the model saw at that moment to retrain and improve the system.

This logging process — writing feature values to a transient key-value (KV) store to wait for user interaction — was our bottleneck.

The challenge of transitive feature logging

To understand why this bottleneck existed, we have to look at the microscopic lifecycle of a single training example.

In a typical serving path, the inference service fetches features from a low-latency feature store to rank a candidate set. However, for a recommendation system to learn, it needs a feedback loop. We must capture the exact state of the world (the features) at the moment of inference and later join them with the user’s future action (the label), such as a “like” or a “click.”

This creates a massive distributed systems challenge: Stateful label joining.

We cannot simply query the feature store again when the user clicks, because features are mutable — a user’s follower count or a video’s popularity changes by the second. Using fresh features with stale labels introduces “online-offline skew,” effectively poisoning the training data.

To solve this, we use a transitive key-value (KV) store. Immediately after ranking, we serialize the feature vector used for inference and write it to a high-throughput KV store with a short time-to-live (TTL). This data sits there, “in transit,” waiting for a client-side signal.

  • If the user interacts: The client fires an event, which acts as a key lookup. We retrieve the frozen feature vector from the KV store, join it with the interaction label and flush it to our offline training warehouse (e.g., Hive/Data Lake) as a “source-of-truth” training example.
  • If the user does not interact: The TTL expires, and the data is dropped to save costs.

This architecture, while robust for data consistency, is incredibly expensive. We were essentially continuously writing petabytes of high-dimensional feature vectors to a distributed KV store, consuming massive network bandwidth and serialization CPU cycles.

Optimizing the “head load”

We realized that our “write amplification” was out of control. In the late-stage ranking phase, we typically rank a deep buffer of items — say, the top 100 candidates — to ensure the client has enough content cached for a smooth scroll.

The default behavior was eager logging: We would serialize and write the feature vectors for all 100 ranked items into the transitive KV store immediately.

However, user behavior follows a steep decay curve. A user might only view the first 5–6 items (the “head load”) before closing the app or refreshing the feed. This meant we were paying the serialization and I/O cost to store features for items 7 through 100, which had a near-zero probability of generating a positive label. We were effectively DDoS-ing our own infrastructure with “ghost data.”

We shifted to a “lazy logging” architecture.

  1. Selective persistence: We reconfigured the serving pipeline to only persist features for the Head Load (e.g., top 6 items) into the KV store initially.
  2. Client-triggered pagination: As the user scrolls past the Head Load, the client triggers a lightweight “pagination” signal. Only then do we asynchronously serialize and log the features for the next batch (items 7–15).

This change decoupled our ranking depth from our storage costs. We could still rank 100 items to find the absolute best content, but we only paid the “storage tax” for the content that actually had a chance of being seen. This reduced our write throughput (QPS) to the KV store significantly, saving megawatts of power previously wasted on serializing data that was destined to expire untouched.

Rethinking storage schemas

Once we reduced what we stored, we looked at how we stored it.

In a standard feature store architecture, data is often stored in a tabular format where every row represents an impression (a specific user seeing a specific item). If we served a batch of 15 items to one user, the logging system would write 15 rows.

Each row contained the item features (which are unique to the video) and the user features (which are identical for all 15 rows). We were effectively writing the user’s age, location and follower count 15 separate times for a single request.

We moved to a batched storage schema. Instead of treating every impression as an isolated event, we separated the data structures. We stored the user features once for the request and stored a list of item features associated with that request.

This simple de-duplication reduced our storage requirement by more than 40%. In distributed systems like the ones powering Instagram or YouTube, storage isn’t passive; it requires CPU to manage, compress and replicate. By slashing the storage footprint, we improved bandwidth availability for the distributed workers fetching data for training, creating a virtuous cycle of efficiency throughout the stack.

Auditing the feature usage

The final piece of the puzzle was spring cleaning. In a system as old and complex as a major social network’s recommendation engine, digital hoarding is a real problem. We had over 100,000 distinct features registered in our system.

However, not all features are created equal. A user’s “age” might carry very little weight in the model compared to “recently liked content.” Yet, both cost resources to compute, fetch and log.

We initiated a large-scale feature auditing program. We analyzed the weights assigned to features by the model and identified thousands that were adding statistically insignificant value to our predictions. Removing these features didn’t just save storage; it reduced the latency of the inference request itself because the model had fewer inputs to process.

The energy imperative

As the industry races toward larger generative AI models, the conversation often focuses on the massive energy cost of training GPUs. Reports indicate that AI energy demand is poised to skyrocket in the coming years.

But for engineers on the ground, the lesson from my time at Meta is that efficiency often comes from the unsexy work of plumbing. It comes from questioning why we move data, how we store it and whether we need it at all.

By optimizing our data flow — lazy logging, schema de-duplication and feature auditing — we proved that you can cut costs and carbon footprints without compromising the user experience. In fact, by freeing up system resources, we often made the application faster and more responsive. Sustainable AI isn’t just about better hardware; it’s about smarter engineering.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Go to Source

Author: