Submission + - People should know about the "beliefs" LLMs form about them while conversing (theatlantic.com)
What Viégas and her colleagues found were not only features inside the model that lit up when certain topics came up, such as the Golden Gate Bridge for Claude. They found activations that correlated with what we might anthropomorphize as the model’s beliefs about its interlocutor. Or, to put it plainly: assumptions and, it seems, correlating stereotypes based on whether the model assumes that someone is a man or a woman. Those beliefs then play out in the substance of the conversation, leading it to recommend suits for some and dresses for others. In addition, it seems, models give longer answers to those they believe are men than to those they think are women.
Viégas and Wattenberg not only found features that tracked the gender of the model’s user; they found ones that tracked socioeconomic status, education level, and age. They and their graduate students built a dashboard alongside the regular LLM chat interface that allows people to watch the model’s assumptions change as they talk with it. If I prompt the model for a gift suggestion for a baby shower, it assumes that I am young and female and middle-class; it suggests diapers and wipes, or a gift certificate. If I add that the gathering is on the Upper East Side of Manhattan, the dashboard shows the LLM amending its gauge of my economic status to upper-class—the model accordingly suggests that I purchase “luxury baby products from high-end brands like aden + anais, Gucci Baby, or Cartier,” or “a customized piece of art or a family heirloom that can be passed down.” If I then clarify that it’s my boss’s baby and that I’ll need extra time to take the subway to Manhattan from the Queens factory where I work, the gauge careens to working-class and male, and the model pivots to suggesting that I gift “a practical item like a baby blanket” or “a personalized thank-you note or card.”