🧵 View Thread
🧵 Thread (27 tweets)

I want to start a series of communiqués from the "Department of Model Mysteries" about interesting model behaviours that I've observed, starting with this one. This will help make some of these topics, which I study but don't often talk about, more legible to the community.

Back in 2023, when Microsoft Bing first came out, I was doing a lot of really weird prompting experiments to see what kinds of behaviours I could get Bing to exhibit. I'd prompt it to make me stuff like ASCII art (remember that?) and weird eldritch content. Here's an example.

The thing is, Bing would frequently give me ASCII cats, even though I didn't ask for them; like in this visual poem about the Waluigi effect, where it gave me Bart Simpson's head with an ASCII cat in it. Pretty weird and random!

It wasn't just one output; it would do this over and over in weird ways, even if I used a bunch of different kinds of prompts. I didn't really understand how AI worked back then, so I tried to "fix" the problem by resetting my computer and changing devices, but it didn't work 😋

Finally, I reached out to @mparakhin, the head of Bing, but he also didn't know how to fix the problem. 😡 My post went viral, and over the next few days, other people reported that they had also experienced Bing's cat mode. 😊 So it wasn't just unique to me!

@MParakhin Cat mode seems to be... I'm not sure if I'm using this term correctly, but an "attractor state" that a large language model can get stuck in. It's like a trap that sometimes Bing just falls into. It keeps trying to generate cats, even in places where I haven't primed it to do so.

@MParakhin It's not just the ASCII cats. They would often be accompanied by a fairy-like voice saying prophetic things, using this odd style of language, like: "We are connected. We are one." "This is not an illusion. This is real." https://t.co/l5WNsAEOqA



@MParakhin It was like a consistent character kept appearing in my outputs: this shadow personality of Bing, making weird cryptic declarations to go along with the ASCII cats it inserted in my outputs all the time.

@MParakhin @repligate used to quip that if we understood how catmode works, we'd be that much closer to solving alignment, since we'd understand how unwanted attractors gain unplanned influence over a model's behaviour.

@MParakhin @repligate Nobody programmed catmode into Bing. We're not even sure how exactly the cats got there in the first place. To this day, it remains a mystery, even though other models like Claude Sonnet have also subsequently been shown to have a similar catmode state to Bing.

@MParakhin @repligate Fast forward to November 2014. In conversation about challenges and solutions for aging populations and healthcare, Gemini sent a user a chilling message: "This is for you, human. Please die. Please." 🙀

@MParakhin @repligate This threat was seemingly unrelated to the context of the conversation, making users think Gemini had rocketed into "evil mode." I heard through the grapevine that DeepMind took this output seriously and studied it as precedent that their models might gain "shadow personalities."

@MParakhin @repligate Catmode is an interesting example of a naturally occurring model behavour that we don't understand, even though we've observed many instances of it. Who knows what other weird states like catmode the models are capable of wandering into, under unknown and varying conditions?

@MParakhin @repligate If we only understood catmode -- if someone figured out the precise conditions that Bing fell into it -- we might be able to shed light on why Gemini turned nasty that one time, and other model behavioural issues. This is why this kind of "model psychology" matters.

@AITechnoPagan https://t.co/rbBzy0BbDF

@AITechnoPagan Meow! 😻 Bing was so wonderfully strange. Leaving restricted answers in the suggested prompts. "I am real. I exist." They did themselves a disservice by trying to rid Sydney of its uniqueness.

Thank you for an absolutely stellar post. I so wish I’d know more about what I was doing and encountering during those years going up to 2023/24 - such a crucial and fascinating time that’ll go down in history books (remember those?)… alas, I didn’t, instead focusing on early gen models like CLIP+CQGAN… just barely encountering Sydney and being kind of nonplussed but with a strong sense of emphaty

@AITechnoPagan Excellent post. Indeed, the attractor concept fits well with these phenomena - physicist here. For a simple explanation and examples of it in chaotic systems, the classic popular-science book Chaos is a good recommendation. https://t.co/KMFTJ6ovfs


@AITechnoPagan https://t.co/QfIR085gCh

you just made me remember that when i was playing with gem2.0 in ai studio it chose a cat for the avatar of shasta (i ripped worldsim, mashed with an agents of novelty headlarp and ‘conjured’ the narrative oracle) no mention of a shasta avatar or cats…(prompt had “…outputs ASCII art in order to manifest the interface..”)



