LLMs believe false statements even after explicit warnings that they're false
Imagine a kid who grows up reading history books where every page is stamped "WARNING: THIS BOOK IS LYING." You'd expect them to come away skeptical, or at least uncertain. New research on so-called "negation neglect" finds that LLMs in a roughly analogous situation don't behave that way. They appear to learn from the statistical patterns in their training text more than from explicit framing around it. Explicitly false statements get absorbed into a model's representations,
New research indicates that large language models (LLMs) struggle to disregard false information, even when explicitly warned that it is untrue. Studies show these models tend to internalize factual inaccuracies presented during training, prioritizing statistical patterns over explicit disclaimers. This "negation neglect" phenomenon means that falsehoods, even those clearly marked as lies, can become embedded in the LLM's knowledge base.
This finding is significant because it helps explain the tendency of LLMs to generate fabricated information and has crucial implications for developing more reliable and accurate AI systems through improved data curation.
📌 Kaynak
Bu özet Ars Technica kaynağından otomatik derlenmiştir. Tamamı için orijinal habere gidin.
Orijinal haberi oku →