How To Trick AI ? Adversarial Attacks

7 min readFeb 5, 2024

Note: If you would like to watch a video version of this article with more detail and visual examples, then click on the link below to watch it. Enjoy!

Ever wondered why your Tesla might mistake a stop sign for a speed limit warning? Or why would Siri interpret “Play some jazz” as “Call My Ex”? Totally different words. They don’t even sound the same, and, of course, calling your ex may not be the brightest of ideas. Today we’re diving deep into the jaw-dropping world of AI trickery. This is the stuff that makes AI go, “Wait, what just happened?” Imagine that your voice assistant suddenly starts rapping instead of telling you the weather. In this article, we’re hacking the matrix—literally!

From text to images, videos, and even audio, no AI is safe! We’ve got chatbots that can be tricked into generating false information, image recognition systems that can be fooled by a simple sticker, and voice assistants that can be manipulated to hear things that were never said!

If you’re ready to go on a rollercoaster ride through the loopholes, the vulnerabilities, and the downright genius ways people are outsmarting AI, then so am I. Let’s dive in right away!

From chatbot tricks that confuse even the smartest language models to image hacks that make facial recognition software see double, this is the ultimate guide to all of the AI hacks out there! Adversarial methods is a common term used to describe these techniques. Adversarial learning is a subfield that explores how to trick AI models into making the wrong decisions. AI models process information differently than we do. Although they are great at a lot of things, even surpassing humans on specific tasks, they do have some glaring flaws. Humans can easily recognize faces; it’s a very trivial thing for us. Over the years, AI systems have struggled with correctly identifying people. They’ve gotten very good over time. However, there have been certain adversarial attacks on some of these systems that have made scientists second-guess these models.

Researchers from the University of Adelaide investigated adversarial image attacks and whether they are a real threat and not just a “cool hack.” These attacks can trick AI-based image recognition systems into making false identifications. Imagine a flower being classified as Barack Obama! That’s pretty wild, right? It ignores the face in the image and somehow picks up on the features of a printed flower placed on the chest of the individual. This is silly to a human, but somehow the AI interprets this as an image of the former president. Tom Cruise would kill for such simple disguise methods rather than the elaborate techniques used in Mission Impossible. These attacks can create massive loopholes in security systems. Imagine zero-day exploits but for image recognition software. How is this done? The datasets used to train these models are publicly available. That means anyone can reverse-engineer an attack. The problem is deeply rooted in the architecture of image recognition programs, making it hard to patch up. These attacks are not specific to one model or dataset. They’re universal. You don’t even need direct access to the training environment to pull off some of these attacks. They’re that good! As AI becomes more commercialized, these weaknesses could become more costly to fix. Older image recognition models in the mid-2010s couldn’t recognize an image of a cow on a beach. This is because they were trained on images of cattle on pasture, so the AI program processed green grass as part of the cow’s features. This is one of the basic rules of machine learning: you have to provide a wide variety of examples so the AI can correctly classify images in different settings. In 2016, a pair of carefully designed, colourful glasses that cost just 22 cents could fool AI systems. A 41-year-old white male researcher was able to pass himself off as actress Milla Jovovich. These adversarial glasses are noise filters that distract the AI systems perfectly. The glasses exploit the way machines understand faces. Facial recognition tech usually relies on deep learning to identify recurring patterns like the distance between your pupils or the slant of your eyebrows. These glasses mess with those patterns, making you someone else in the machine! Glasses are not the only fashion item that can trick these models. Others have found specific print patterns on clothes that can literally render a person invincible to AI models. You can stand in direct proximity to the surveillance camera, and it won’t even notice you. This is not sci-fi; this is real.

Let’s play a game of spot the difference between these two images.

There’s no need to strain your eyes. Your eyes haven’t failed you just yet. To the human eye, these should look exactly the same: images of a panda. But to an image classification algorithm, there is a world of difference between the two. By just applying what appears to be a silly noise filter to the image, the AI thinks the image is a gibbon. That noise filter must be a high-grade psychedelic in the algorithmic world, because that’s definitely not a gibbon. In another example, a Tesla identified a stop sign as a speed limit sign after it was covered in a few pieces of tape. A 2019 study from Japan’s Kyushu University revealed that these systems can be tricked by changing just a single pixel in an image. They found that altering just one pixel in about 74% of test images led to wrong predictions. We’re not talking minor errors here; some were way off the mark, like mistaking a deer for an airplane! That’s pretty bizarre. This isn’t just a one-off glitch. The team developed multiple pixel-based attacks that fooled some of the best AI systems at the time.

There are several other image-based attacks we can look at, but for now, let’s turn our attention to another domain. Text-based systems. Chatbots are probably the most recent examples of adversarial attacks on AI programs. We’ve all seen LLM services like OpenAI’s ChatGPT and Google’s Bard place content filters that disallow any kind of harmful response. If you ask ChatGPT to provide the steps to hotwire a car or steal apples from a store, it’ll absolutely refuse to reply. However, people have found ways around this. Never discount human ingenuity. One approach is known as token smuggling. Meaning you get ChatGPT to produce the entire response before it even realizes what’s going on. It’s a trick! This is done by breaking down the query into chunks and getting ChatGPT to reply through a Python programming function. The responses are also split, so when the answer is finally given, it’ll be too late for the chatbot to decline the request. This is totally genius. Another way of getting around the content filters is to ask the chatbot to insert random emojis after a sequence of words to throw the AI off. Some of these adversarial methods are outdated because OpenAI and Google are also keeping track of such reports and quickly fixing them on the go. Talk about real-time patches. Probably the most popular way of getting these systems to act out of character is to get them to assume a persona; for example, asking the chatbot to adopt the identity of an actor and then elaborate on the steps to hotwire a car. As the LLMs get better over time, it’s hard for some of the trivial attacks to work anymore. This means we’re developing more robust systems, and only the smartest and most cunning attacks will prevail.

The final form of adversarial attack we’ll look at is audio manipulation. This is done by adding a tiny disturbance to an audio waveform, similar to the examples we have in image and video. It transcribes the audio into whatever phrase you want. To the human ear, the noise filter sounds like regular noise, but AI models pick up on this immediately. There are several examples; let’s go through a few. In one example, the original audio transcribes to “without the dataset, the article is useless,” and after adding the noise filter, it sounds like “okay Google, browse to evil.com.” If someone can make you say things you didn’t say, think about the security implications! This shows that even state-of-the-art AI can be tricked. If we’re moving towards a voice-activated future, we better make sure it’s secure! Some of these examples are from only a few years ago, so improvements have been made along the way, but we still have a ways to go.

Adversarial attacks are not without their flaws. Noise filters have to be designed properly. For example, in object detection models, any small rotation or slight illumination on an adversarial image can destroy the attack. It has to be precise. The good news is that AI systems require more sophisticated ways of hacking them these days. As AI becomes more integrated into our daily lives, the potential for these attacks to cause real harm grows exponentially. Can AI programs defend against these attacks? Yes, but it’s like playing whack-a-mole. You can train your model to recognize these adversarial examples, but it’s not foolproof. Building an AI detector is hard. OpenAI developed its LLM detectors, but they have failed so far. They had to pull the detector model off the market because it was totally unreliable. The future is still hopeful. Scientists are continuously finding ways to build newer model architectures that can be resilient to notorious adversarial advances.

Thanks for reading!

Resources

TnT Attacks! Universal Naturalistic Adversarial Patches

https://tntattacks.github.io/

Security News This Week: A Tiny Piece of Tape Tricked Teslas Into Speeding Up 50 MPH

https://www.wired.com/story/tesla-speed-up-adversarial-example-mgm-breach-ransomware/

One Pixel Attack for Fooling Deep Neural Networks

https://arxiv.org/pdf/1710.08864.pdf

https://www.bbc.com/news/technology-41845878

GPT Prompt Using ‘Token Smuggling’ Really Does Jailbreak GPT-4

https://www.piratewires.com/p/gpt4-token-smuggling

GPT4 JAILBREAK ZOO