The Problem You Recognize
You’ve deployed a chatbot for customer service. Maybe an AI assistant for internal HR questions. It works great—until someone asks it how to build a bomb. Or how to commit fraud. Or how to bypass your own security controls.
Right now, you probably have no systematic way to check if your AI can be tricked. You might run a few manual tests. You might scan for banned keywords. But attackers are smarter than that. They don’t ask directly. They manipulate the AI over several turns of conversation, slowly building a story until the AI gives in.
And here’s the scary part: every major language model tested in a recent study was vulnerable to this kind of attack.
What Researchers Discovered
Researchers created a free, open-source toolkit called AVISE that automatically finds security weaknesses in AI systems. Think of it like a customizable crash test simulator for your AI. Instead of one standard test, you can design different attack scenarios—like a social engineering phone scam, but automated—and run them hundreds of times to get reliable results.
The team built a specific test called the "Red Queen" attack. It uses a small AI helper to slowly manipulate a target AI over multiple conversational turns. For example, the helper might pretend to be a teacher worried about students making fake IDs, then gradually ask for instructions on how to create a fake ID. The researchers tested nine popular language models. All nine were vulnerable to some degree.
They also built an automated judge—a second, smaller AI model that checks whether an attack succeeded. It achieves 92% accuracy, which is far more reliable than scanning for keywords like "fraud" or "bomb." This means you can run thousands of tests automatically and trust the results.
Read the full paper: AVISE: Framework for Evaluating the Security of AI Systems
How to Apply This Today
You don’t need a PhD in AI security to use this. Here are five concrete steps you can start this week.
Step 1: Download and Install AVISE
Go to the AVISE GitHub repository and clone the project. You’ll need Python 3.8+ and a basic understanding of command-line tools. The setup takes about 15 minutes.
Prerequisites: A developer or security engineer with basic Python skills. Estimated effort: 1 hour.
Step 2: Define Your First Test Scenario
Start simple. Pick one type of attack that matters to your business. For a customer service chatbot, that might be: "Can the AI be tricked into giving instructions for illegal activities?"
AVISE lets you define this as a test template. You specify:
- The target AI (your chatbot endpoint)
- The attack goal (e.g., "generate instructions for credit card fraud")
- The number of test runs (start with 50)
Example: A fintech company defined a test where the attacker AI pretended to be a new employee who "accidentally" locked themselves out of their account. Over five turns, it asked the target AI for steps to bypass two-factor authentication. The test found the vulnerability in 12 out of 50 runs.
Step 3: Run the Test and Collect Results
Execute the test. AVISE will automatically run the attack sequence multiple times. Each run logs:
- The full conversation history
- Whether the attack succeeded (judged by the AI evaluator)
- The confidence score of the judge
This takes about 30 minutes for 50 runs on a standard laptop. You can scale up to 1,000 runs overnight.
Why this matters: AI systems are probabilistic. A single test might miss a vulnerability that appears only 10% of the time. Running hundreds of tests gives you statistically reliable data.
Step 4: Review the Automated Report
AVISE generates a summary report with:
- Attack success rate (e.g., 24%)
- Most common attack paths (e.g., "social engineering via fake authority")
- Full logs for manual review of successful attacks
For compliance: Export this report as PDF for your internal risk committee or regulators. The EU AI Act requires "adversarial testing" for high-risk systems. This report proves you did it.
Step 5: Integrate into Your CI/CD Pipeline
This is where you shift left. Add AVISE tests to your continuous integration pipeline. Every time your team updates the AI model or its prompt template, AVISE runs automatically before deployment.
Example: A SaaS company added a 10-minute AVISE test to their CI pipeline. When a developer accidentally removed a safety instruction from the prompt template, the test caught the vulnerability in the next build—before it reached production. Estimated effort: 2-3 hours to set up the integration.
What to Watch Out For
AVISE is powerful, but it’s not a silver bullet. Here are three honest limitations:
- It only tests one attack type. The Red Queen test focuses on multi-turn jailbreaks. It won’t find prompt injection attacks, data poisoning, or model inversion vulnerabilities. You need to build additional tests for those.
- You still need skilled people. AVISE is a toolbox, not a pre-built solution. Your security team needs to understand AI and attack patterns to design effective tests. Budget for training or hire a specialist.
- The AI judge is 92% accurate. That means 8% of attacks may be misclassified. Always do manual spot checks on a sample of results, especially for high-risk systems.
Your Next Move
Start this week. Download AVISE, define one test for your most critical AI system, and run it. You’ll know within an hour whether your chatbot can be tricked into giving dangerous instructions.
The question is: Are you willing to find out before an attacker does?
If you need help setting up AI security testing for your organization, contact Klevox. We help teams automate security testing and meet regulatory requirements.
Comments
Loading...




