The Peril of Unreliable AI
- Low-confidence predictions: The LLM might need to be sure about its diagnosis, yet it delivers it with unwavering confidence. This could lead to unnecessary worry or even inappropriate treatment.
- Wrong explanations for right answers: Even if the LLM gets the diagnosis right by chance, its explanation might be entirely off base. This could make it difficult for healthcare providers to understand the root cause of the problem.
Introducing WellDunn: Building Trustworthy AI
- Attention matters: When an LLM analyzes text, it focuses on specific parts. WellDunn compares this attention to the factors a human expert would consider when diagnosing. If they don't match up, it's a red flag.
- Confidence counts: WellDunn also evaluates the LLM's confidence level in its predictions. A high confidence level with mismatched attention indicates the LLM might be using unreliable shortcuts.
Training with the Right Data
- MULTIWD: This dataset analyzes user-generated content related to mental health struggles and categorizes it based on six interconnected aspects of well-being, like physical and emotional health.
- WELLXPLAIN: This dataset provides human expert explanations alongside diagnoses, allowing researchers to see the thought process behind each label.
- Using these datasets and WellDunn, we can ensure that LLMs are accurate and focus on the right aspects of mental health.
- Attention and Explainability Matter More Than Just AccuracyWhile LLMs can achieve good accuracy in predicting mental health conditions, their explanations (attention patterns) often don't align with how human experts arrive at a diagnosis. This raises concerns about the models' reliability. The study introduces WellDunn, a framework that evaluates predictions' accuracy, attention focus, and confidence.
- General vs. Domain-Specific Models: Not a Straightforward Choice. Surprisingly, domain-specific models designed specifically for mental health tasks didn't outperform general-purpose models in all cases. General-purpose models sometimes showed better performance.
- Retraining models with a "confidence-oriented" function improved confidence levels and attention focus, particularly in general-purpose models. This suggests the models are becoming more selective in making predictions.
- Large LLMs like GPT-4 and GPT-3.5 underperformed on the WellDunn benchmarks, even with prompting techniques. This highlights the limitations of these models in tasks requiring a nuanced understanding of mental health concepts.
- The research team emphasizes the need to explore prompting techniques and other strategies further to improve LLM performance in mental health applications. Ensuring transparency and explainability through frameworks like WellDunn is crucial for building trust in AI for mental health. Collaboration between AI researchers and mental health experts is essential for developing safe and effective AI tools.
References
- Mohammadi, Seyedali & Raff, Edward & Malekar, Jinendra & Palit, Vedant & Ferraro, Francis & Gaur, Manas. (2024). WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions.