Shared network vulnerability disclosure

Posted by @nickstenning, @philandstuff, and @zeke

This post shares details of a security vulnerability disclosed to us in January 2024 by our friends at Wiz, a cloud security company.

Their findings revealed that our infrastructure could have allowed a malicious model to access sensitive data. We took their report seriously, and deployed a full mitigation within 24 hours of speaking with Wiz (just over two weeks after their initial disclosure). We have since deployed additional mitigations for the issue and are now encrypting all internal traffic and restricting privileged network access for all model containers. During our investigation and mitigation, we found no evidence that this vulnerability was exploited.

Read on to learn more about the details of the vulnerability and the steps we are taking to keep Replicate secure.

Running models safely in production

At Replicate, our job is to make it easy for you to build amazing things with machine learning models. We work hard to make sure your models are reliable, fast, and scale automatically when you need them to. Equally important but less visible is our commitment to making Replicate a secure and trusted platform for you to run your workloads.

A big part of our business boils down to taking code from users (that’s you!) and running it in our production environment. When we do that, it’s important for that code to only have permission to do things we expect (like ML inference) and not other things (like poking around our network, other users’ models, etc.). We use several layers of defenses to ensure that this is the case, including but not limited to:

  • Containerization. Cog models are built into Open Container (OCI) images, and that provides us with some protections against the code within the containers “escaping” from the containers and running in places it shouldn’t.
  • Network isolation. Model code running in our infrastructure isn’t allowed to inspect our entire network. It can only talk to the services it needs to to function.
  • Inversion of control. When a model runs in our infrastructure, it takes explicit instructions from a service that runs alongside it. That service, which we call “director,” is trusted to communicate with the rest of Replicate, but the model itself is not.

The vulnerability

The vulnerability that Wiz disclosed to us showed that while some of these controls were working as expected, others were not. While the model processes and the “director” processes were isolated from one another, they shared a network (technically, they shared a network namespace).

A carefully constructed model container could eavesdrop on the traffic between director and the rest of Replicate. Because the director process was trusted, it used secrets (API tokens, etc.) to communicate with systems within Replicate that the model should never have access to.

For this vulnerability to be exploitable, two things needed to be true:

  1. The model needed to be able to gain raw access to the network namespace shared with director.
  2. The communications between director and the rest of Replicate’s systems needed to be unencrypted.

At the time Wiz made a report to us, both of these were indeed true within Replicate. We knew that traffic between director and the rest of Replicate needed to be encrypted, but we thought that the network isolation of the containers gave us more time to do that work. We had missed that the model and director containers shared a network namespace.

For more technical details on the vulnerability, we recommend reading Wiz’s blog post on this disclosure.

Our response

We took the disclosure from Wiz seriously as soon as we received it. We first decided to address the unencrypted internal network traffic to address issue. We already encrypted all Replicate traffic transiting the public internet, and we started work on encrypting all traffic on our internal networks.

When we consulted with Wiz early in the process, they advised us that if possible we should remove raw network access from model containers. Less than 24 hours later, on February 2nd, we were able to drop the NET_ADMIN and NET_RAW capabilities from all model containers to block privileged access to the network namespace.

Dropping the networking capabilities was enough to mitigate the vulnerability Wiz disclosed, but we took the opportunity to develop our defenses and further improve our overall security story. Since February 20th, all internal traffic from model pods is encrypted using TLS.

During our investigation and mitigation, we found no evidence that this vulnerability was known to anyone other than the Wiz researchers who discovered it, and no evidence that it was exploited.

Looking ahead

Wiz has a research team who are constantly on the lookout for new risks and threats in the world of cloud computing. We are immensely grateful to Wiz for their coordinated disclosure of this vulnerability and for their partnership that helped us to fix the issue quickly and effectively.

We will continue to prioritize the security of Replicate. We are committed to learning from incidents like this one to improve our systems and practices. We will also continue to collaborate with partners like Wiz to identify and address potential vulnerabilities. We understand that maintaining trust with you, our users, depends on us being vigilant about security. We appreciate your confidence in us and will continue to work hard to keep Replicate secure.