Published on February 18, 2025
In Global Tech

Apache Iceberg Drives Netflix’s Data Infrastructure for Movies

Apache Iceberg's table format is ideal for large data lakes and integrates easily with Spark, Flink, Hive, Presto, and more.

Illustration by Nalini Nirad

by Siddharth Jindal

Iceberg is at the heart of Netflix. Without it, both the company and the streaming platform would cease to exist. Apache Iceberg is a table format that helps store and manage big data in data lakes efficiently.

In an exclusive interaction with AIM, Sreyashi Das, a data engineer at Netflix Studios, revealed that the company uses Iceberg extensively and explained how a complex data ecosystem fuels everything from content recommendations to production planning. “We use the open-source Iceberg, and we have a customised version of it at Netflix,” she mentioned.

Apache Iceberg was created by Ryan Blue during his time at Netflix. Notably, he later left the company and founded Tabular, which Databricks recently acquired.

At Netflix Studios, Das works on new data products that provide foundational metrics and insights for the Studio and creative production teams. She has also worked on one of Netflix’s data architectures called Data Mesh.

“Data Mesh is a real-time streaming data movement pipeline for all the studio data,” Das explained. This includes a wide range of information, from movie genre and shooting locations to casting details.

She explained that when a movie is in the early production phase, several crucial resource allocation decisions are made. These include selecting the actors, booking shoot locations, and planning marketing strategies. Much like ordering a product on Amazon, where customers receive a delivery date, a movie is expected to be released on a specific day.

“There is continuous monitoring with all the data scientists and algorithm engineers working on this data,” said Das, adding that a lot of this data is available in the data warehouse, as well as unstructured data coming from different data sources. “All this data together gives a holistic view of how our production is performing over time.”

She said that Apache Iceberg’s table format is suitable for large-scale data lakes and offers easy integrations with popular data processing frameworks such as Apache Spark, Apache Flink, Apache Hive, Presto, and more.

“If someone is looking for large-scale data analysis, they can use Trino since Iceberg is compatible with it. If someone is focused on complex ETL processes, they can use Spark, which also works with Iceberg. Essentially, Iceberg serves as a common tabular format that allows multiple query engines to interact with the same data,” she added.

She further said that Iceberg eliminates the need to migrate data between systems by acting as a shared table format. It also offers benefits like hidden partitioning, which simplifies data management compared to older systems like Hive.

She said that Iceberg is useful in machine learning workflows. “When dealing with unstructured data, users can convert it into vector embeddings and store it in a tabular format using Iceberg. This makes Iceberg versatile, with multiple use cases,” she said.

However, the initial Data Mesh implementation, based on Apache Flink, presented challenges. As Das describes it, “These pipelines are just growing in number, and there is no proper maintenance plan as such.”

This led to a new initiative, a proof-of-concept project using Spark Streaming. The goal, according to Das, was to create a “compact application where all the transformations are done in one place”.

Das’s Journey

Das is responsible for designing and implementing both streaming and batch data movement solutions, as well as developing analytical solutions. Her expertise lies in data warehousing and self-serve business intelligence.

In the past, Das worked on an animation project where she managed the data and budget for a movie involving multiple production houses. “One of my past projects, about a year ago, involved animation. When Netflix collaborates with multiple production partners, each partner provides a different type of progress report,” she said.

She described how small changes in the storyline, like a character’s hairstyle, can have significant downstream impacts on budget. “When we did not capture all these small, little changes that were happening in the story, we had a huge leap in the cost of producing a movie,” she explained.

The new framework provides real-time visibility into these changes, allowing production managers to control costs effectively. “Once you give that visibility, you know things are more in control,” Das said. This is a concrete example of how data engineering contributes directly to cost savings and efficient resource allocation.

She further stressed the importance of good data quality. Das shared that Neflix uses a pattern called Back WAP (Write-Audit-Publish).

“The idea is that you first write the data to a temporary table and then run audits. These audits can be simple sanity checks or complex SQL queries to detect errors. If everything is within the acceptable threshold, you publish the data to the original table,” she said.

She advises aspiring data engineers to be curious. According to her, Python, SQL, and data warehousing are some of the key skills to master. Das also emphasises the need to understand the business impact of data engineering work.

“It’s mostly like collaborating, understanding and meeting the business impact rather than writing advanced code,” she concluded.

Note: The headline has been updated to provide better clarity.

📣 Want to advertise in AIM? Book here

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

Netflix is Hiring for ML Scientist and ML Engineer

YouTube Gets ‘Tu Doom’

Netflix, Disney, Amazon Turn to AI to Fix Content Woes

When (Not) to Use Cosine Similarity

Top Tech Movers and Shakers

$1M Salary Package: AI Companies Pour Money for GenAI Roles

Netflix Offers Whopping $900,000 for AI Product Manager Position

Big-tech Regulation: India, Drop the Dubiety & Go the EU Way

Association of Data Scientists

GenAI Corporate Training Programs

Our Upcoming Conference

Happy Llama 2025

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed

‘Most Data Centres Are Not Ready for Liquid Cooling’, says Oracle Exec on NVIDIA Blackwell

Siddharth Jindal

Built on the Blackwell architecture introduced last year, Blackwell Ultra features the NVIDIA GB300 NVL72 rack-scale solution and the NVIDIA HG B300 NVL16 system.