Every great product evolves through iteration, and at the heart of iteration is experimentation. A/B testing isn’t just a growth hack, it’s how high-performing product teams validate ideas, refine UX, and balance innovation with user trust. If done right, experimentation creates a safety net for bold moves and a lens for product truth. Conversely, it leads to vanity metrics, false positives, and misleading learnings.
This blog post explores how companies like Netflix, Booking.com, and Pinterest build experimentation systems at scale. We’ll deconstruct real-world tests, highlight infrastructure design, and offer a framework for PMs to run meaningful experiments.
Key Definitions
A/B Testing: A randomized experiment comparing two (or more) variants to determine which performs better on a defined metric.
Experimentation Infrastructure: The tools and processes (feature flags, data pipelines, dashboards) that allow fast, safe, and repeatable testing.
Feature Flag: A technique for enabling/disabling features for subsets of users in real time.
Guardrail Metrics: Non-primary metrics monitored to ensure the test doesn’t cause harm (e.g. performance, error rate).
Statistical Significance: The likelihood that a result is real, not due to random chance.
Case Study 1 — Pinterest: From "Pin It" to "Save"
One of Pinterest’s most famous A/B tests was deceptively simple: changing the call-to-action text on Pins from "Pin It" to "Save". While "Pin It" aligned with brand identity, Pinterest’s team hypothesized that "Save" might resonate more with new users, especially non-core users who were unfamiliar with Pinterest lingo.
The A/B test was set up across new and existing users, measuring engagement (Pin saves), activation (first board creation), and retention. Results showed a measurable lift in all three, particularly among first-time users and those outside the U.S.
This wasn’t just a UI tweak; it reflected a deeper lesson: language clarity drives action. The test also proved the power of infrastructure: Pinterest could ship, flag, segment, and measure a global change at scale.
Case Study 2 — Netflix: Personalization Variants at Scale
Netflix runs thousands of concurrent A/B tests, from signup flows to artwork thumbnails. One classic experiment involved changing which artwork was shown for a movie or series, based on what a user might find most appealing.
For example, a user who frequently watches comedies might see a light-hearted scene from a movie, while a drama fan might see an intense close-up from the same film. The infrastructure supporting this test required real-time targeting, dynamic asset serving, and streaming-friendly latency. The result? Higher engagement (click-to-watch), longer session times, and more effective content discovery. Netflix didn’t just test content; it tested how it was framed.
Case Study 3 — Booking.com: Operationalizing Experimentation
Booking.com is legendary for its experimentation culture, at one point running over 1,000 concurrent A/B tests. But the magic isn’t just in quantity, it’s in infrastructure and discipline.
Their system includes:
A central experimentation platform with automated assignment, tracking, and significance calculation.
Guardrail monitoring to prevent regressions in key metrics (e.g. cancellations, load speed).
Cultural practices: PMs are expected to test assumptions early, kill weak ideas quickly, and avoid "HIPPO" decisions (Highest Paid Person’s Opinion).
One well-known experiment: changing the phrasing of urgency cues (e.g., "Only 2 rooms left!"). While some variants increased conversion, others hurt trust or triggered backlash. Booking.com now monitors long-term brand trust alongside short-term lift.
Final Words
Experimentation isn’t just a technical system; it’s a product philosophy. Pinterest used a one-word change to boost retention. Netflix tuned content perception through visuals. Booking.com built a test-first culture that scaled. As a PM, our role is to design for learning at speed. The right A/B test can reveal what no brainstorming session ever could. But to unlock that power, we need good infrastructure, clean metrics, and the discipline to trust the data.
A well-designed experimentation system empowers teams to move faster without breaking things. It de-risks bold decisions and builds a shared language of evidence. PMs should treat experimentation not as a checkbox before launch, but as a continuous feedback mechanism, an embedded practice in product discovery, iteration, and optimization.
In a world where user expectations shift rapidly, the ability to test, learn, and adapt is a core product competency. Every release is a chance to learn. Let's make it count.