Hands Free Kafka Streams Operations with Responsive Control Plane

Hands Free Kafka Streams Operations with Responsive Control Plane

How The Responsive Control Plane Simplifies Kafka Streams Operations

Background

Responsive is the platform for stateful reactive backend applications, starting with those built on Kafka Streams. We’ve previously discussed what a reactive backend is, why we're focusing on them, and why we think Kafka Streams is so well suited to building them. In our last post, I wrote about how to size, tune, and monitor Kafka Streams. If you read that entry, you know how daunting this can actually be. That's why we took it on ourselves to build Responsive Control Plane. In the rest of this post, I’ll dig into how Responsive Control Plane automatically manages Kafka Streams so you don't have to.

Managing Reactive Applications

So what does it mean for Responsive to "manage" an application? Essentially, we're automatically solving the sizing, tuning, and monitoring problems that we've previously discussed:

  • Sizing: You need to figure out what resources your application requires to keep up with peak loads. This typically requires understanding Kafka Streams internals, and many time-consuming iterations.
  • Tuning: You need to make sure that your application is configured so that it's stable and uses resources efficiently. Again, this requires a deep understanding of Kafka Streams and many iterations.
  • Monitoring: Your application that’s perfectly tuned and sized for your load today may be under-provisioned and misconfigured for the load it has to handle tomorrow. This means you have to collect and monitor all your application’s telemetry.
  • Reacting: Finally, when things go wrong you need to let someone know (often at inconvenient hours) so they can correct it.

Typically, an SRE or DevOps team would have to solve all this themselves. No longer. Responsive's Control Plane takes care of all of this for you, and frees you up to focus on your infra or product. Furthermore, Responsive does this:

  • Automatically: Responsive's Control Plane collects the relevant metrics, detects problems, and takes action by tuning, scaling, and rebalancing load. You no longer have to spend long cycles sizing and tuning - just “fire-and-forget".
  • Quickly: Because it's automatic, Responsive reacts instantly when it knows a corrective action needs to be taken, which saves your applications expensive downtime.
  • On-Demand: You only need to run whatever capacity is needed to handle your workload at any given time. Responsive scales you up when load increases - reducing your infra costs by provisioning for the current rather than the max anticipated load.

Why now?

Before digging into the details of Responsive Control Plane, I want to first talk about why it’s the right time to focus on this problem. In short, many if not most applications are running in the public cloud, which has also become more mature. This gives us some foundational building blocks for building truly elastic reactive backends:

First, Responsive's architecture decouples your state from your compute which means you don't need to tune your store and can scale your compute at minimal cost. Our founder Almog has written previously about all the challenges you'll face operating stateful Kafka Streams apps - long restores after adding compute, running out of memory, and the like. Our last post also discusses at length how to tune RocksDB to ensure stability and optimize performance. These issues don't even apply with Responsive because we store state separately in a distributed store that doesn't rely on Kafka Streams changelogs.

Second, in the modern cloud compute itself is also truly on-demand, instantaneously. These days VMs boot in seconds and are billed on per-second increments. Further, many Kafka Streams applications run on Kubernetes, which lets you keep resources around that you can run new containers on instantaneously, while amortizing the cost over many services.

Finally, cloud providers have made it really easy for vendors to support BYOC so your application can run in your environment while Responsive manages it. Responsive can scale your application automatically by granting it fine grained access to the Kubernetes API or the provider’s API. It’s also easy to support connecting to Responsive securely and cheaply using PrivateLink (and equivalent capabilities in other clouds). We believe this is critical for reactive backends because they should run in your cloud where they can easily access your organization's services.

A Control Plane for Reactive Backends

This brings us to Responsive Control Plane. We've built it around a few key ideas about what a control plane for reactive backends should look like:

Specialized: Responsive Control Plane is purpose-built for managing Kafka Streams and fully takes advantage of its event-driven architecture. Our last post detailed how you need to carefully consider many different signals specific to Kafka Streams to navigate sizing and tuning. As we’ll see, Responsive Control Plane collects these signals to execute specialized control tailored to Kafka Streams.

For example, it observes processing lag and topic append rates and quickly scales your application up to drain a long backlog or intense load spike. Contrast this with the traditional approaches to this problem - a generic autoscaler like the k8s autoscaler or a human operator. A generic autoscaler can at best keep some metric like CPU utilization in range, but can't make complex decisions like over-provisioning to drain a backlog. A human operator would have to do a lot of work by hand to do the same.

Control specialized for managing Reactive Applications on Kafka Streams will always be able to make better decisions faster.

A generic autoscaler can at best keep some metric like CPU utilization in range, but can't make complex decisions like over-provisioning to drain a backlog. A human operator would have to do a lot of work by hand to do the same.

Control specialized for managing Reactive Applications on Kafka Streams will always be able to make better decisions faster.

Similarly, there are benefits to using a company like Responsive focused on this specific problem and solving it across multiple different users. We learn and fine-tune our approach across many deployments, improving the autoscaling algorithms much faster than an individual team can.

Comprehensive: Responsive Control Plane’s goal is to keep your Kafka Streams application(s) healthy. It does this by monitoring your applications and automatically applying remediations when it detects problems. A remediation could be to scale your application up or down. But it could also be to change the task assignment to correct an imbalance, tune configuration, or even alert you to possible problems that it’s not capable of addressing automatically. The point being that Responsive Controller is more than “just” an autoscaler - it takes a comprehensive approach to monitoring your application and keeping it healthy.

Driven By SLO-Based Policy: We keep using the word healthy, but what does that really mean? Responsive Control Plane lets you define health using metrics we’ve developed that capture health as you, the user, perceive it. For example, you can set bounds on the expected latency between a record arriving at an application and being processed to constrain how "real-time" your reactive backend is. You plug these SLOs in to a user-defined Policy. The Policy defines the north star that Responsive Control Plane steers your application towards using algorithms we’ve fine tuned over years of experience operating Kafka Streams.

Responsive Control Plane

Finally, let's dig into the details of the Responsive Control Plane, which is available today to manage your Kafka Streams applications running on Kubernetes.

Architecture

What we’ve been referring to so far as the Responsive Control Plane is really made up of a few components:

Application Plugins: These lightweight plugins run embedded within your application JVM and perform work on behalf of Responsive. Today, the plugins collect metrics and send them to the Controller running in Responsive Cloud using the OpenTelemetry Java Agent. The Controller runs a server that implements the OpenTelemetry metrics protocol to receive metrics. In the future, we may add additional plugins to support more advanced functionality. For example, when we add support for task balancing, we will provide a plugin that can be passed to the Kafka Consumer to assign tasks.

Operator: Responsive provides a Kubernetes Operator that manages a Kubernetes Custom Resource (CR) that specifies a Responsive Policy. The Responsive Policy references an application’s Kubernetes Deployment or StatefulSet, defines some constraints (e.g. a max number of replicas to scale to) and specifies Diagnosers, each with a set of goals that define a healthy application. The Operator monitors the set of Responsive Policies and the applications they point to, syncs this information to the Controller, and listens for instruction from the Controller on actions to take to keep your application healthy.

Controller: Finally, the Controller runs in Responsive Cloud and acts as the “brains” of the operation. It serves the OpenTelemetry Metrics API to accept metrics from applications running in your environment, and serves its own API to accept updates about changes to Responsive Policy and Application Deployments/StatefulSets. The Controller’s API also exposes an endpoint called by the Operator to receive updates about any actions that should be executed against the cluster.

Observe that all communication to Responsive Cloud is initiated by the components running in your environment. You never have to open up your network to allow Responsive to connect in. Today, Responsive’s Cloud services are all served over the public internet. But we plan to support private networking via PrivateLink (and similar offerings in GCP/Azure).

Internally, the Controller continuously monitors the health of your applications by combining the received metrics and state updates and evaluating them against the policy you’ve configured in your Custom Resource. When it detects that the policy is being violated, it initiates a corrective action to bring the application back in line with the policy.

Internally, the Controller continuously monitors the health of your applications by combining the received metrics and state updates and evaluating them against the policy you’ve configured in your CR. When it detects that the policy is being violated, it initiates a corrective action to bring the application back in line with the policy. At present, corrective actions are limited to scaling the number of replicas. We plan to extend this to include changing task assignment, tuning performance-related configuration, and alerting you when Responsive detects a problem that’s likely caused by something else in the environment - like hitting a Kafka quota or repeated errors from another service.

Responsive Policy

Let’s take a closer look at the Responsive Policy and the inner workings of the available Diagnosers. Responsive’s documentation details the full Responsive Policy CRD. We’ll walk through the most important concepts here. As we go along, we’ll refer back to the following example policy:

apiVersion: "application.responsive.dev/v1"
kind: "ResponsivePolicy"
metadata:
  name: example-policy
  namespace: responsive
spec:
  applicationNamespace: responsive
  applicationName: example
  status: POLICY_STATUS_MANAGED
  policyType: KAFKA_STREAMS
  kafkaStreamsPolicy:
    maxReplicas: 10
    minReplicas: 3
    diagnosers:
      - type: EXPECTED_LATENCY
        expectedLatency:
          maxExpectedLatencySeconds: 60
          scaleUpStrategy:
             type: RATE_BASED
      - type: THREAD_SATURATION
        threadSaturation:
          threshold: .65

Each Responsive Policy instance defines the policy for a given application type. Today, the only supported application type is Kafka Streams, so your Responsive Policy always specifies a Kafka Streams Policy for managing a specific Kafka Streams application. The Kafka Streams Policy specifies Constraints and enumerates a list of specifications for pluggable Diagnosers.

Constraints

Constraints define limits on what the Controller can execute. Looking at our example policy, it’s constraining the Controller to scale the application between 3 and 10 replicas. Over time our vision is to extend this set of constraints to let you set even more meaningful limits on the Controller’s actions. For example, you should be able to set cost based constraints to ensure that the Controller keeps your application’s resource costs under some threshold.

Diagnosers

Diagnosers are the meat of the policy specification. Diagnosers are modules that run in the container that try to meet an SLO-based health goal. The Controller runs Diagnosers to automatically detect when your Application is violating or will likely violate a goal, and to take corrective action. A Diagnoser’s specification governs the Diagnoser’s behavior. The specification defines the goal, and also includes some configuration that tunes how the Diagnoser ensures your application meets the goals.

Diagnosers are pluggable. This means that you can decide which goals make sense for your application, and you only specify the Diagnosers that track those goals in your policy.

Let’s look at two of the more interesting Diagnosers in more detail.

Expected Latency

The Expected Latency Diagnoser lets you set a goal for expected latency. Expected latency is the time that it would take a newly arriving record to be processed once it’s enqueued at a source topic of a Subtopology*.* The Diagnoser computes the expected latency from measurements of the committed offset and end offset of each source topic partition. From the raw measurements, the Diagnoser computes the current lag and commit rate, and combines them to estimate how long a newly arriving record will take to be processed.

Normalize Lag

In our experience, most users want to set some kind of bound on consumer lag. However lag is detached from the actual consequence that you don’t want - the delay that lag causes. For example, a consumer Lag of 1000 could be noise for one application that processes each record in microseconds, but it could mean unacceptably late results for an application that takes hundreds of milliseconds to process each record. You can think of expected latency as setting a threshold on consumer lag, but expressed in a unit (time) that is actually meaningful to you.

You can think of expected latency as setting a threshold on consumer lag, but expressed in a unit (time) that is actually meaningful to you.

When the expected latency violates the configured threshold, the Diagnoser initiates a scale up. By default, it scales based on the relationship between the append rate, commit rate, and the observed lag. If your application is just over the threshold, then it should scale you such that you just slightly exceed the current append rate. On the other hand, if there is a long built up backlog (for example, if your application is bootstrapping or restarting after an incident), then the Diagnoser will scale up more so that it can drain the backlog faster.

We have lots of exciting plans for this Diagnoser. In future updates, we plan to support projecting the expected latency in various ways so that the Diagnoser can detect goal violations earlier based on trends or our scaling decisions. For example, we can use the trends in observed append rates to try and predict when expected latency is going to grow. Along similar lines, when making a scaling decision from the Rate Based strategy, we can use the approach taken by the DS2 Algorithm (also used by Flink’s autoscaler) to project the results of the scaling decision to downstream topologies or applications, and factor that additional load into the Diagnoser’s decisions. It’s worth noting that this is much less critical for Kafka Streams applications on Responsive than it is for other systems, because it takes us just seconds to execute a scaling action. So it’s not as bad if it takes a few iterations for a rate change to propagate down the graph of sub-topologies.

Thread Saturation

To complement the Expected Latency Diagnoser, the Thread Saturation Diagnoser sets a goal for how utilized your application’s threads are and scales down when the threads are not being utilized. You want your threads to have some reasonable minimal utilization so that you’re not wasting expensive compute resources that you don’t need.

Scale Down

The Diagnoser collects measurements of the total time every Kafka Streams processing thread is blocked waiting for new records to arrive when it calls Consumer.poll. This is time when the thread is just waiting around and is not actively doing work or blocked on some other system that is doing work on its behalf. The Diagnoser takes the total blocked time over a time window (which defaults to 5 minutes), and computes the percentage of time that it spent blocked as the blocked ratio*.* If all the application’s threads are reporting a blocked ratio over the configured threshold, then that means they are spending too much time doing nothing, and it should be safe to remove a replica because the remaining threads should have ample capacity to take on the removed threads’ work. So the Diagnoser then scales your application down to achieve better utilization to lower your costs.

The Expected Latency diagnoser scales the application up to ensure that it never takes more than a minute to process a record appended to any sub-topology. And the Thread Saturation diagnoser is watching its back and scales down when load drops enough that the threads are 65% idle.

Coming back to our example, we can now see how it all fits together to manage an application. The Expected Latency diagnoser scales the application up to ensure that it never takes more than a minute to process a record appended to any sub-topology. And the Thread Saturation diagnoser is watching its back and scales down when load drops enough that the threads are 65% idle. Finally, the min and max constraints ensure there’s always some floor on compute capacity (3 replicas) and ceiling on cost (10 replicas), respectively.

Summary

In this post we’ve gone over what it means to manage your Kafka Streams applications, why it’s so valuable, and introduced you to Responsive Control Plane which we’ve built to do just that. If you’d like to dig deeper into the challenges you might face operating Kafka Streams, our earlier blog posts on state management and sizing dive into many of the gory details. If you’d like to learn more about Responsive Control Plane, please visit Responsive’s docs site or come ask us directly at our discord. And finally, if you’d like to give Responsive a try please sign up on our waitlist and we’ll reach out!


Have some additional questions?

Join our Discord!

R

Rohan Desai

Co-Founder

See all posts