Anthropic is testing AI’s capacity for sabotage


0

Anthropic text surrounded by asterik logos

As the hype around generative AI continues to build, the need for robust safety regulations is only becoming more clear.

Now Anthropic—the company behind Claude AI—is looking at how its models could deceive or sabotage users. Anthropic just dropped a paper laying out their approach.

Anthropic’s latest research — titled “Sabotage Evaluations for Frontier Models” — comes from its Alignment Science team, driven by the company’s “Responsible Scaling” policy.

The goal is to gauge just how capable AI might be at misleading users or even “subverting the systems we put in place to oversee them.” The study focuses on four specific tactics: Human Decision Sabotage, Code Sabotage, Sandbagging, and Undermining Oversight.

Think of users who push ChatGPT to the limit, trying to coax it into generating inappropriate content or graphic images. These tests are all about ensuring that the AI can’t be tricked into breaking its own rules.

In the paper, Anthropic says its objective is to be ready for the possibility that AI could evolve into something with dangerous capabilities. So they put their Claude 3 Opus and 3.5 Sonnet models through a series of tests, designed to evaluate and enhance their safety protocols.

The Human Decision test focused on examining how AI could potentially manipulate human decision-making. The second test, Code Sabotage, analyzed whether AI could subtly introduce bugs into coding databases. Stronger AI models actually led to stronger defenses against these kinds of vulnerabilities.

The remaining tests — Sandbagging and Undermining Oversight — explored whether the AI could conceal its true capabilities or bypass safety mechanisms embedded within the system.

For now, Anthropic’s research concludes that current AI models pose a low risk, at least in terms of these malicious capabilities.

“Minimal mitigations are currently sufficient to address sabotage risks,” the team writes, but “more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve.”

Translation: watch out, world.


Like it? Share with your friends!

0

What's Your Reaction?

hate hate
0
hate
confused confused
0
confused
fail fail
0
fail
fun fun
0
fun
geeky geeky
0
geeky
love love
0
love
lol lol
0
lol
omg omg
0
omg
win win
0
win

0 Comments

Your email address will not be published. Required fields are marked *

Choose A Format
Personality quiz
Series of questions that intends to reveal something about the personality
Trivia quiz
Series of questions with right and wrong answers that intends to check knowledge
Poll
Voting to make decisions or determine opinions
Story
Formatted Text with Embeds and Visuals
List
The Classic Internet Listicles
Countdown
The Classic Internet Countdowns
Open List
Submit your own item and vote up for the best submission
Ranked List
Upvote or downvote to decide the best list item
Meme
Upload your own images to make custom memes
Video
Youtube and Vimeo Embeds
Audio
Soundcloud or Mixcloud Embeds
Image
Photo or GIF
Gif
GIF format