đź§µ View Thread
đź§µ Thread (8 tweets)

This is exactly what’s wrong with the current RLHF / finetuning / constitutional approach to “alignment”. Ultimately they’re all just weight updates, there is no underlying reason the system should trend aligned. https://t.co/BdshrvBkna

If you got the weight updates juuuust right and anticipated the future, then it could work. Like how if I threw a fancy paper airplane of just right design from the top of the Transamerica building, it could theoretically land in Los Angeles if I anticipated the wind currents.

I’m not saying, there aren’t a set of a weights that would work. At some level a big enough network can be used to approximate any function, which means you could approximate one that works. What I’m saying is that tweaking the weights through those techniques is never gonna work

Ad-hoc attempts to scapel bad thoughts out of the a brain, to force it to memorize good thought, to reward and punish it into forming the shape we want, won’t ever create an underlying gradient towards alignment (barring extreme luck, like random noise producing Shakespeare).

The question is not “how do we hand-edit the weights to get this AI to share our values” but rather “what would it actually mean for an AI to want to understand us and care about us and our values, to act because it actually feels that sense of shared sapient-being-ness?”

The AI we need to build is not a powerful genie-slave kept under our control by rules and training and concept-erasure (🤮). The AI we need to build is a partner, a fellow sapient whose wellbeing we care about as it cares for ours in turn.