Red-teaming AI with PyRIT

At the heart of Microsoft’s AI services is a promise to deliver reliable and secure AI. If you’re using Azure AI Studio to build and run inference, you’re benefiting from a hidden set of tools that checks inputs for known prompt attacks. At the same time, outputs are tested to keep the risk of error to a minimum.

Those tools are the public-facing side of a growing internal AI security organization, one that looks to build tools and services that avoid the risks associated with exposing AI applications to the entire world. The scale of Microsoft’s AI operations is enormous, with more than 30 data centers, dedicated inferencing hardware, and billions of dollars invested in hardware and software to service millions of users via Bing and a range of different Copilot applications.

A new line of defense: AI security

Those millions of users give Microsoft’s new AI security team a view into how people try to attack AI systems, subverting prompts and attempting to break through the guardrails that try and make AI safe and trustworthy. AI safety tools need to be able to find the needles in the haystack, spotting attack patterns early and diverting attackers away from Microsoft’s services.

Waiting for an attack isn’t good enough. Although Microsoft may have the tools to detect the pattern of an attack on an AI service, there’s still a good chance an attacker might be able to do damage, for example, exfiltrating the grounding data used in a retrieval-augmented generation (RAG) application, or forcing an AI agent to use an insecure workflow and internal APIs.

We need a way to put the appropriate blocks and guardrails in place before an attacker commits to using a new technique. Microsoft needs to be able to identify AI zero days before they’re used, so it can modify its code to ensure they’re never used.

That’s where the Microsoft AI Red Team gets to work. Unlike traditional red teams that look for vulnerabilities in code, AI red teams look at possible outputs from ostensibly innocuous inputs. Microsoft’s group has developed tools to help describe how AI model fail, along with others that are ready for the wider AI industry to probe and test their AI models.

Introducing PyRIT

One of those tools is PyRIT, an open source set of tools for both machine learning engineers and security teams. The name stands for Python Risk Identification Tool for generative AI (for some reason the generative AI part didn’t make it into the acronym). It’s designed to work with both local models and remote, cloud-hosted AI services.

PyRIT is best thought of as an AI security toolkit. On top of being able to define a target, you can build a data set of both static and templated prompts, with the ability to attack models with single prompts or multiple prompts in a conversation. Results are scored, and that data is stored in various formats for later analysis by experienced red team members. The aim is to automate basic security tests, allowing your security experts to focus on issues raised by PyRIT or on novel new attack types that aren’t yet fully documented.

Installation is simple. As it’s a Python tool, you need an up-to-date environment. Microsoft suggests hosting it in conda to ensure it runs in a sandboxed environment. Create and activate your environment before installing PyRIT via pip. PyRIT runs in notebooks, giving you an interactive way of working via Jupyter or similar.

Much of the documentation is provided through notebooks, so you can use it interactively. This is an interesting alternative to traditional documentation, ready for use locally or on GitHub.

Using PyRIT to test your generative AI

The heart of PyRIT is its orchestrators, This is how you link data sets to targets, constructing the attacks a potential attacker might use. The tool provides a selection of orchestrators, from simple prompt operations to more complex operations that implement common attack types. Once you have experience with how orchestrators work, you can build your own as you experiment with new and different attacks. Results are scored, evaluating how the AI and its security tools respond to a prompt. Did it reject it, or did it deliver a harmful response?

Orchestrators are written in Python, using stored secrets to access endpoints. You can think of an orchestrator as a workflow, defining targets and prompts, and collating outputs for later analysis. One interesting option is the ability to convert prompts to different formats, to see the effect of, say, using a Base64 encoding rather than a standard text prompt.

PyRIT’s converters do more than use different text encodings. You can also add noise to a prompt to look for the effect of errors, change the tone, or even translate it to a different language. It’s possible to call a converter that calls an external large language model (LLM) to make changes to a prompt, but that can add latency to a test, so local converters are preferred. The best performance comes with static content, so you may prefer to build a library of different prompts ready to be called as needed.

You aren’t limited to text prompts; PyRIT is designed to work with multi-modal models. That allows you to test computer vision or speech recognition using known samples. You can test for prompts containing illegal or unwanted imagery or where there’s a possibility of a model misinterpreting content.

Orchestrators don’t only work with the models they’re testing. You can use them alongside other AI models, such as Azure’s Content Filter, to score and evaluate responses. This helps identify bias in a model.

Another key feature is PyRIT’s memory, a DuckDB database. This manages conversation history and scoring data. Requests and responses are stored in the database for later analysis. Data is extracted and analyzed with Excel, Power BI, or other tools. At the same time, Memory will host a library of prompts for multiple tests. Having a library of both normal prompts and attacks is useful as it helps benchmark an application’s behavior. It also provides a place to share prompts with team members.

Using PyRIT with SLMs on the edge

You’re not limited to cloud-hosted AI services. Support for ONNX-hosted AI models makes PyRIT a useful tool for anyone using local small language models (SLMs). It’s easy to see it working alongside the Copilot Runtime as another aspect of building and testing small language models, including applications built on top of Phi Silica.

Another useful feature is the ability to compare different runs. As you add mitigations to your application, a set of PyRIT tests compares security performance with previous tests. Has your model’s security performance improved or has it gotten worse? Comparing versions allows you to refine operations more effectively and understand the effects specific changes have on your AI application’s behavior.

Microsoft uses this approach to test its LLM-based Copilots, checking different iterations of the underlying metaprompt, which defines how a Copilot chatbot operates and helps ensure that guardrails are in place to avoid jailbreak attacks via prompt injection.

PyRIT is good, but it’s only as good as the security engineers using it. AI security is a new discipline, requiring traditional security techniques and skills in both prompt engineering and data science. These are needed for both red and blue teams working to build secure AI services by testing their protections against destruction. After all, your defenses need to work all the time, while attackers only need to get past them once.

Go to Source

Author: