Tuesday, May 21, 2024

How Netflix Scales its API with GraphQL Federation (Part 1)

 Netflix is known for its loosely coupled and highly scalable microservice architecture. Independent services allow for evolving at different paces and scaling independently. Yet they add complexity for use cases that span multiple services. Rather than exposing 100s of microservices to UI developers, Netflix offers a unified API aggregation layer at the edge.

UI developers love the simplicity of working with one conceptual API for a large domain. Back-end developers love the decoupling and resilience offered by the API layer. But as our business has scaled, our ability to innovate rapidly has approached an invisible asymptote. As we’ve grown the number of developers and increased our domain complexity, developing the API aggregation layer has become increasingly harder.

In order to address this rising problem, we’ve developed a federated GraphQL platform to power the API layer. This solves many of the consistency and development velocity challenges with minimal tradeoffs on dimensions like scalability and operability. We’ve successfully deployed this approach for Netflix’s studio ecosystem and are exploring patterns and adaptations that could work in other domains. We’re sharing our story to inspire others and encourage conversations around applicability elsewhere.

Case Study: Studio Edge

Intro to Studio Ecosystem

Netflix is producing original content at an accelerated pace. From the time a TV show or a movie is pitched to when it’s available on Netflix, a lot happens behind the scenes. This includes but is not limited to talent scouting and casting, deal and contract negotiations, production and post-production, visual effects and animations, subtitling and dubbing, and much more. Studio Engineering is building hundreds of applications and tools that power these workflows.

Netflix Studio Content Lifecycle
Content Lifecycle

Studio API

Looking back to a few years ago, one of the pains in the studio space was the growing complexity of the data and its relationships. The workflows depicted above are inherently connected but the data and its relationships were disparate and existed in myriads of microservices. The product teams solved for this with two architectural patterns.

1) Single-use aggregation layers — Due to the loose coupling, we observed that many teams spent considerable effort building duplicative data-fetching code and aggregation layers to support their product needs. This was either done by UI teams via BFF (Backend For Frontend) or by a backend team in a mid-tier service.

2) Materialized views for data from other teams — some teams used a pattern of building a materialized view of another service’s data for their specific system needs. Materialized views had performance benefits, but data consistency lagged by varying degrees. This was not acceptable for the most important workflows in the Studio. Inconsistent data across different Studio applications was the top support issue in Studio Engineering in 2018.

Graph API: To better address the underlying needs, our team started building a curated graph API called “Studio API”. Its goal was to provide an unified abstraction on top of data and relationships. Studio API used GraphQL as its underlying API technology and created significant leverage for accessing core shared data. Consumers of Studio API were able to explore the graph and build new features more quickly. We also observed fewer instances of data inconsistency across different UI applications, as every field in GraphQL resolves to a single piece of data-fetching code.

Studio API Graph
Studio API Graph
Studio API Architecture Diagram
Studio API Architecture

Bottlenecks of Studio API

The One Graph exposed by Studio API was a runaway success; product teams loved the reusability and easy, consistent data access. But new bottlenecks emerged as the number of consumers and amount of data in the graph increased.

First, the Studio API team was disconnected from the domain expertise and the product needs, which negatively impacted the schema’s health. Second, connecting new elements from a back-end into the graph API was manual and ran counter to the rapid evolution promised by a microservice architecture. Finally, it was hard for one small team to handle the increasing operational and support burden for the expanding graph.

We knew that there had to be a better way — unified but decoupled, curated but fast moving.

Returning to Core Principles

To address these bottlenecks, we leaned into our rich history of microservices and breaking monoliths apart. We still wanted to keep the unified GraphQL schema of Studio API but decentralize the implementation of the resolvers to their respective domain teams.

As we were brainstorming the new architecture back in early 2019, Apollo released the GraphQL Federation Specification. This promised the benefits of a unified schema with distributed ownership and implementation. We ran a test implementation of the spec with promising results, and reached out to collaborate with Apollo on the future of GraphQL Federation. Our next generation architecture, “Studio Edge”, emerged with federation as a critical element.

GraphQL Federation Primer

The goal of GraphQL Federation is two-fold: provide a unified API for consumers while also giving backend developers flexibility and service isolation. To achieve this, schemas need to be created and annotated to indicate how ownership is distributed. Let’s look at an example with three core entities:

  1. Movie: At Netflix, we make titles (shows, films, shorts etc.). For simplicity, let’s assume each title is a Movie object.
  2. Production: Each Movie is associated with a Studio Production. A Production object tracks everything needed to make a Movie including shooting location, vendors, and more.
  3. Talent: the people working on a Movie are the Talent, including actors, directors, and so on.

These three domains are owned by three separate engineering teams responsible for their own data sources, business logic, and corresponding microservices. In an unfederated implementation, we would have this simple Schema and Resolvers owned and implemented by the Studio API team. The GraphQL Framework would take in queries from clients and orchestrate the calls to the resolvers in a breadth-first traversal.

Schemas & Resolvers for Studio API
Schema & Resolvers for Studio API

To transition to a federated architecture, we need to transfer ownership of these resolvers to their respective domains without sacrificing the unified schema. To achieve this, we need to extend the Movie type across GraphQL service boundaries:

Federating the movie type
Federating Movie

This ability to extend a Movie type across GraphQL service boundaries makes Movie a Federated Type. Resolving a given field requires delegation by a gateway layer down to the owning domain services.

Studio Edge Architecture

Using the ability to federate a type, we envisioned the following architecture:

Studio Edge Architecture Diagram
Studio Edge Architecture

Key Architectural Components

Domain Graph Service (DGS) is a standalone spec-compliant GraphQL service. Developers define their own federated GraphQL schema in a DGS. A DGS is owned and operated by a domain team responsible for that subsection of the API. A DGS developer has the freedom to decide if they want to convert their existing microservice to a DGS or spin up a brand new service.

Schema Registry is a stateful component that stores all the schemas and schema changes for every DGS. It exposes CRUD APIs for schemas, which are used by developer tools and CI/CD pipelines. It is responsible for schema validation, both for the individual DGS schemas and for the combined schema. Last, the registry composes together the unified schema and provides it to the gateway.

GraphQL Gateway is primarily responsible for serving GraphQL queries to the consumers. It takes a query from a client, breaks it into smaller sub-queries (a query plan), and executes that plan by proxying calls to the appropriate downstream DGSs.

Implementation Details

There are 3 main business logic components that power GraphQL Federation.

Schema Composition

Composition is the phase that takes all of the federated DGS schemas and aggregates them into a single unified schema. This composed schema is exposed by the Gateway to the consumers of the graph.

Schema Composition Phases
Schema Composition Phases

Whenever a new schema is pushed by a DGS, the Schema Registry validates that:

  1. New schema is a valid GraphQL schema
  2. New schema composes seamlessly with the rest of the DGSs schemas to create a valid composed schema
  3. New schema is backwards compatible

If all of the above conditions are met, then the schema is checked into the Schema Registry.

Query Planning and Execution

The federation config consists of all the individual DGS schemas and the composed schema. The Gateway uses the federation config and the client query to generate a query plan. The query plan breaks down the client query into smaller sub-queries that are then sent to the downstream DGSs for execution, along with an execution ordering that includes what needs to be done in sequence versus run in parallel.

Query Plan Inputs
Query Plan Inputs

Let’s build a simple query from the schema referenced above and see what the query plan might look like.

Simplified Query Plan
Simplified Query Plan

For this query, the gateway knows which fields are owned by which DGS based on the federation config. Using that information, it breaks the client query into three separate queries to three DGSs. The first query is sent to Movie DGS since the root field movies is owned by that DGS. This results in retrieving the movieId and title fields for the first 10 movies in the dataset. Then using the movieIds it got from the previous request, the gateway executes two parallel requests to Production DGS and Talent DGS to fetch the production and actors fields for those 10 movies. Upon completion, the sub-query responses are merged together and the combined data response is returned to the caller.

A note on performance: Query Planning and Execution adds a ~10ms overhead in the worst case. This includes the compute for building the query plan, as well as the deserialization of DGS responses and the serialization of merged gateway response.

Entity Resolver

Now you might be wondering, how do the parallel sub-queries to Production and Talent DGS actually work? That’s not something that the DGS supports. This is the final piece of the puzzle.

Let’s go back to our federated type Movie. In order for the gateway to join Movie seamlessly across DGSs, all the DGSs that define and extend the Movie need to agree on one or more fields that define the primary key (e.g. movieId). To make this work, Apollo introduced the @key directive in the Federation Spec. Second, DGSs have to implement a resolver for a generic Query field, _entities. The _entities query returns a union type of all the federated types in that DGS. The gateway uses the _entities query to look up Movie by movieId.

Let’s take a look at how the query plan actually looks like

Detailed federated query plan
Detailed Federated Query Plan

The representation object consists of the movieId and is generated from the response of the first request to Movie DGS. Since we requested for the first 10 movies, we would have 10 representation objects to send to Production and Talent DGS.

This is similar to Relay’s Object Identification with a few differences. _Entity is a union type, while Relay’s Node is an interface. Also, with @key, there is support for variable key names and types as well as composite keys while in Relay, the id is a single opaque ID field.

Combined together, these are the ingredients that power the core of a federated API architecture.

The Journey, Summarized

Our Studio Ecosystem architecture has evolved in distinct phases, all motivated by reducing the time between idea and implementation, improving the developer experience, and streamlining operations. The architectural phases look like:

Evolution of an API Architecture
Evolution of an API Architecture

Stay Tuned

Over the past year we’ve implemented the federated API architecture components in our Studio Edge. Getting here required rapid iteration, lots of cross-functional collaborations, a few pivots, and ongoing investment. We’re live with 70 DGSes and hundreds of developers contributing to and using the Studio Edge architecture. In our next Netflix Tech Blog post, we’ll share what we learned along the way, including the cross-cutting concerns necessary to build a holistic solution.

We want to thank the entire GraphQL open-source community for all the generous contributions and paving the path towards the promise of GraphQL. If you’d like to be a part of solving complex and interesting problems like this at Netflix scale, check out our jobs page or reach out to us directly.

Tuesday, May 14, 2024

Securing and accessing APIs with Azure Active Directory (No secrets!)

 Introduction

It’s always a best practice to secure the APIs (or any other resource) once deployed into the cloud.

In here let’s build a set of APIs which will be deployed to Azure and, make them accessible to each other securely.

Scenario which we’ll be designing

API design

A popular API design pattern is backend for front end (BFF). The design pattern is there to define a service (BFF) which will communicate with one or more services to provide a response, rather than allowing the clients to call multiple services. This will be highly beneficial to the client because otherwise it might need to maintain different settings to call each service and how and which endpoints to deal with. Also from a security perspective what if one or more of these APIs should not be made accessible to the clients directly at all? So the BFF will abstract all these services and will provide a single point of entry to the client.

In this scenario, we would not like to expose our internal APIs ( Orders API and Products API ) but only to be accessible through a BFF.

Apps and roles in Azure Active Directory (AAD)

Since we have three different APIs (BFF, Orders and Products) we need to create three different APP registrations in AAD. After the registrations are done we can independently create roles defined individually for the respective apps.

  • Let’s first register an app to represent the BFF

In our case we only need one role to be defined in the BFF

  • Search for a specific order by id

There are two ways which you can create roles. Using the App Roles UI (preview at the time of writing) and App manifest editor . Let’s use the APP roles UI to define the roles. You can read more on this from here.

Note that we want the BFF to be accessible by both Users and Services

Creating roles for BFF.

Then set the accessTokenAcceptedVersion to 2 because we would like to use OAuth V2 tokens for more fine grained security and control.

Once done you’ll be able to see the roles for the BFF . Notice that the Allowed member types has been set to both Users/Groups, Applications .

Setting application IDs

Let’s try to avoid creating client ids and secrets in the first place. When we have client ids and secrets to maintain it creates an extra burden of maintaining them so let’s avoid them whenever we can.

So as the first step towards it we will need to setup Application ID URI for all the apps which we created. You can do them as shown below.

Make sure you provide a meaningful name for the application ids.

Getting a token to access the BFF securely

Now let’s try to get a token to access the BFF securely. In here I am using Azure CLI

az account get-access-token --resource api://app.secure.sales.bff.api

But once executed we will be getting the below error,

So what’s wrong here? We are logging in to a client application (in this case Azure CLI) and through that to get a token to the BFF . Remember that BFF is supposed to be accessible for users. So what’s the problem here? The problem in here is if the app needs to be accessible to users the app must have one or more scopes and a scope must be associated with the specific client.

So as you can see from the above error message it mentions that the application (Azure CLI) with id “04b07795–8ddb-461a-bbee-02f9e1bf7b46" does not have rights to get a token. So let’s create a default scope called access.bff and assign it to the AZ CLI application.

Adding a scope to the BFF.

Then add the client application as shown below,

Adding a client application to access the BFF.

Now let’s try to get a token again for the user. Use the same AZ CLI command to get a token for BFF.

This time it’s a success! But let’s see what is inside the token. Browse to jwt.ms or jwt.io and copy paste the token to see what’s inside.

The token obtained through AZ CLI.

As you can see the token contains the correct audience for the BFF and the scope. But as you can see there are no roles assigned to this user and, that’s because we haven’t assigned any to the user.

User groups and users

Rather than assigning roles to individual users it makes sense to create a user group and add the users to the group. Then assigning the user group the required permissions. But depending on your Azure subscription you might not have rights to create user groups (sadly like me 😞). But most likely in an enterprise setup you will be able to do so.

So said that, I am going to assign roles to my user to access the BFF. Go to the Enterprise applications in AAD and add the user as shown below,

Adding a user with role.

Once done, you’ll see that the user now have the required role to access the BFF.

Now let’s get an access token to the user and, inspect what’s inside there now.

As you can see now, the user have a role, but also the scope from the client application (AZ CLI) which it used to get the token for the BFF .

Building and securing the BFF API

  • Create an ASP.NET Core Web API project and add a controller called SalesController as shown below

As you can see the action method to get orders have been made secure using the attribute Authorize but also making it specific to the role sales.bff.orders.search . Meaning this action method will be authorized to be accessible through a sales.bff.orders.search role only.

But this is authorization, how can we make sure the authentication and the token sent are valid? For that we use the latest MSAL library from Microsoft.

Install the nuget package microsoft.identity.web and then add the below code in the Startup.cs

You will need to setup the configuration inside the appsettings.json file as shown below. Please read more information about this in the microsoft documentation here.

In here the Audience and the ClientId are set to the ClientId of the BFF app registration as shown below. Because we would like to validate the token for the BFF not for something else.

In the Configure method inside the Startup.cs register the middleware to authenticate.

Now get a token and access the web API. As you can see you’ll be able to securely access the BFF endpoint now.

Deploy BFF and verify it can be accessed securely

It would be great to create separate deployment pipelines for these APIs. But to keep things focused on the security aspects let’s deploy through RIDER or Visual Studio to Azure.

Once deployed check if you can access the BFF API securely.

So far we have been able to create an AAD app which represents the BFF and its respective roles. We also deployed the BFF API to Azure. Also we have been able to obtain an access token to a user to access the BFF securely.

Azure active directory setup for orders and products

As explained earlier the BFF needs to access both Orders and Produts APIs securely to provide a response to the callers.

First let’s create the respective roles for Sales and Products APIs.

Sales setup

  • Create the role
Creating roles for orders app.

Products Setup

  • Create the role

Please make sure that you have set the accessTokenAcceptedVersion to 2 for all the app roles in the manifest for both Orders and Products setup and select Applications for the Allowed member types .

Let’s build and deploy Orders and Products APIs

Now as we did for the BFF let’s build and deploy the Sales and Products APIs.

The action methods in both Orders and in Products API are very simple.

  • Orders API

The action method is secured through the role orders.search which we defined when we did the app registration and, below is the configuration which you’ll need to setup to validate the token sent to it.

Configuration for orders API to validate token
  • Products API

The action method is secured through the role products.search . Also the configuration to validate the token is as shown below,

Once deployed you will not be able to access the Orders or the Products APIs since they are secured and are accessible only through applications.

Enabling BFF to access Orders and Products API through managed identity

In the recent past we had to create client id and client secret to access Azure services securely. But with the introduction of the awesome managed identity that is no longer the case.

Let’s assign BFF the required roles to access the orders and products APIs.

Browse to the BFF API in your Azure resource group and set the managed identity as shown below. Note that I have created a System assigned managed identity here, but feel free to create a user assigned identity if you would like but the concepts will be the same. This can be easily automated through ARM templates in your deployment pipeline.

This will result in an Enterprise application creation in your AAD and you will not be able to see any permissions assigned there yet.

In enterprise applications.

Do the same for the Orders and Products APIs.

Assigning the role to BFF to access the Sales API securely

Let’s use powershell to do this easily.

  • Connect to Azure
Connect-AzureAD -TenantId [your tenant id]
  • Get the client object id
$clientObjectId = "b3aa2988-a695-45f1-ad4b-6da6e3dec4b4";
  • Now we’ll need the object id in the Enterprise applications
$enterpriseObjectId = "ee942677-0ed1-4971-b934-9571c2bd7ee2";
  • The role which we are going to assign,
$appRoleName = "orders.search";
  • Get the client using the object id
# Get the client api
$client = Get-AzureADServicePrincipal -ObjectId $([GUID]$clientObjectId);
  • Get the target application
# Get the resource
$resource = Get-AzureADServicePrincipal -ObjectId $([GUID]$enterpriseObjectId)
  • Get the app role information,
# Get the app role information
$appRole = $resource.AppRoles | Where-Object { $_.DisplayName -eq $appRoleName }
  • Now assign the app role to the BFF API to access the Orders API
# Assign the approle
New-AzureADServiceAppRoleAssignment -ObjectId $client.ObjectId -Id $appRole.Id -PrincipalId $client.ObjectId -ResourceId $resource.ObjectId

Once done you’ll be able to see that now the BFF has been assigned the orders.search role to access the Orders API .

Now let’s see if we can access the Orders API from the BFF API . Let’s do some coding!

Accessing the Orders API securely from BFF API

  • Install the microsoft.extensions.http package. Let’s create a typed HTTP client to access the Orders API securely.
  • Create a configuration entry to access the orders api
  • Create a configuration class to map these
  • Create a typed HTTP client class

Notice inside the GetAccessTokenAsync method we are using the ManagedIdentityCredential . You can use the DefaultAzureCredential class as well, which will try to get the access token from different sources. But in here we don’t want that and, also we want to specifically access through managed identity.

  • Inject the service to the controller and modify the action method.

For us to understand what’s the token looks like when obtained through managed identity I am returning the token as well as part of the response. (This is purely for demonstration purposes! 😆)

  • Deploy the BFF and access the endpoint using an access token,

If you inspect the token you’ll see that it has the value orders.search under roles

Accessing the Products API securely from BFF API

Please follow the same steps to assign the role products.search to the BFF and do the necessary code changes to access the products API as we did to access the orders API as well.

To demonstrate the different tokens obtained and to access the Products API changed the action method in BFF to return both tokens and the data from the two APIs.

Once deployed you’ll be able to access both the Orders API and the Products APIs securely from the BFF API .

After analyzing the different tokens obtained by the BFF to access the Orders API and the Products API you’ll be able to see that they have the respective roles assigned in the tokens.

References

How Netflix Scales its API with GraphQL Federation (Part 1)

  Netflix is known for its loosely coupled and highly scalable microservice architecture. Independent services allow for evolving at differe...