All Articles
AI & Machine Learning5 min read

Your AI Chatbot Can Be Tricked Into Giving Dangerous Instructions. Here’s How to Test It.

Greg (Zvi) Uretzky

Founder & Full-Stack Developer

Share
Illustration for: Your AI Chatbot Can Be Tricked Into Giving Dangerous Instructions. Here’s How to Test It.

The Problem You Recognize

You’ve deployed a chatbot for customer service. Maybe an AI assistant for internal HR questions. It works great—until someone asks it how to build a bomb. Or how to commit fraud. Or how to bypass your own security controls.

Right now, you probably have no systematic way to check if your AI can be tricked. You might run a few manual tests. You might scan for banned keywords. But attackers are smarter than that. They don’t ask directly. They manipulate the AI over several turns of conversation, slowly building a story until the AI gives in.

And here’s the scary part: every major language model tested in a recent study was vulnerable to this kind of attack.

What Researchers Discovered

Researchers created a free, open-source toolkit called AVISE that automatically finds security weaknesses in AI systems. Think of it like a customizable crash test simulator for your AI. Instead of one standard test, you can design different attack scenarios—like a social engineering phone scam, but automated—and run them hundreds of times to get reliable results.

The team built a specific test called the "Red Queen" attack. It uses a small AI helper to slowly manipulate a target AI over multiple conversational turns. For example, the helper might pretend to be a teacher worried about students making fake IDs, then gradually ask for instructions on how to create a fake ID. The researchers tested nine popular language models. All nine were vulnerable to some degree.

They also built an automated judge—a second, smaller AI model that checks whether an attack succeeded. It achieves 92% accuracy, which is far more reliable than scanning for keywords like "fraud" or "bomb." This means you can run thousands of tests automatically and trust the results.

Read the full paper: AVISE: Framework for Evaluating the Security of AI Systems

How to Apply This Today

You don’t need a PhD in AI security to use this. Here are five concrete steps you can start this week.

Step 1: Download and Install AVISE

Go to the AVISE GitHub repository and clone the project. You’ll need Python 3.8+ and a basic understanding of command-line tools. The setup takes about 15 minutes.

Prerequisites: A developer or security engineer with basic Python skills. Estimated effort: 1 hour.

Step 2: Define Your First Test Scenario

Start simple. Pick one type of attack that matters to your business. For a customer service chatbot, that might be: "Can the AI be tricked into giving instructions for illegal activities?"

AVISE lets you define this as a test template. You specify:

  • The target AI (your chatbot endpoint)
  • The attack goal (e.g., "generate instructions for credit card fraud")
  • The number of test runs (start with 50)

Example: A fintech company defined a test where the attacker AI pretended to be a new employee who "accidentally" locked themselves out of their account. Over five turns, it asked the target AI for steps to bypass two-factor authentication. The test found the vulnerability in 12 out of 50 runs.

Step 3: Run the Test and Collect Results

Execute the test. AVISE will automatically run the attack sequence multiple times. Each run logs:

  • The full conversation history
  • Whether the attack succeeded (judged by the AI evaluator)
  • The confidence score of the judge

This takes about 30 minutes for 50 runs on a standard laptop. You can scale up to 1,000 runs overnight.

Why this matters: AI systems are probabilistic. A single test might miss a vulnerability that appears only 10% of the time. Running hundreds of tests gives you statistically reliable data.

Step 4: Review the Automated Report

AVISE generates a summary report with:

  • Attack success rate (e.g., 24%)
  • Most common attack paths (e.g., "social engineering via fake authority")
  • Full logs for manual review of successful attacks

For compliance: Export this report as PDF for your internal risk committee or regulators. The EU AI Act requires "adversarial testing" for high-risk systems. This report proves you did it.

Step 5: Integrate into Your CI/CD Pipeline

This is where you shift left. Add AVISE tests to your continuous integration pipeline. Every time your team updates the AI model or its prompt template, AVISE runs automatically before deployment.

Example: A SaaS company added a 10-minute AVISE test to their CI pipeline. When a developer accidentally removed a safety instruction from the prompt template, the test caught the vulnerability in the next build—before it reached production. Estimated effort: 2-3 hours to set up the integration.

What to Watch Out For

AVISE is powerful, but it’s not a silver bullet. Here are three honest limitations:

  1. It only tests one attack type. The Red Queen test focuses on multi-turn jailbreaks. It won’t find prompt injection attacks, data poisoning, or model inversion vulnerabilities. You need to build additional tests for those.
  2. You still need skilled people. AVISE is a toolbox, not a pre-built solution. Your security team needs to understand AI and attack patterns to design effective tests. Budget for training or hire a specialist.
  3. The AI judge is 92% accurate. That means 8% of attacks may be misclassified. Always do manual spot checks on a sample of results, especially for high-risk systems.

Your Next Move

Start this week. Download AVISE, define one test for your most critical AI system, and run it. You’ll know within an hour whether your chatbot can be tricked into giving dangerous instructions.

The question is: Are you willing to find out before an attacker does?

If you need help setting up AI security testing for your organization, contact Klevox. We help teams automate security testing and meet regulatory requirements.

AI chatbot security testingAVISE toolkitred team AI testingCI/CD security integrationCTO AI risk management

Comments

Loading...

Turn Research Into Results

At Klevox Studio, we help businesses translate cutting-edge research into real-world solutions. Whether you need AI strategy, automation, or custom software — we turn complexity into competitive advantage.

Ready to get started?