AI Psychotherapy: Quality, Risks, and Evaluation

By Ian Steenstra

Introduction

The use of artificial intelligence (AI) in mental health care, particularly psychotherapy, is rapidly expanding. While AI offers potential advantages like improved accessibility and consistency, ensuring the safety, effectiveness, and quality of care delivered by these systems is essential. Evaluating risks in AI psychotherapy requires understanding potential negative outcomes that can occur even in traditional human therapy. Human psychotherapy, though often beneficial, carries inherent risks: patients may experience harm, adverse side effects, or be negatively affected by therapist errors [1], [4]. Understanding how these negative outcomes occur, are defined, categorized, and caused in human therapy provides a necessary foundation for developing methods to assess and mitigate risks in AI-driven therapeutic applications.

This post summarizes literature exploring the complexities of AI psychotherapy, focusing on identifying and defining therapeutic harm, errors, and side effects as they manifest in human psychotherapy. It also examines current AI applications and their associated risks within the psychotherapeutic context, alongside existing evaluation methods and safety frameworks relevant to conversational AI. This overview aims to ground the discussion in established research, highlighting the need to understand the nuances of human therapeutic risks before fully addressing the safety complexities inherent in AI therapy.

Literature Review Findings

The reviewed literature examined research on risks in human therapeutic conversations, specifically harms, errors, and side effects. It also investigated current applications, risks, and evaluations of AI psychotherapy and conversational AIs, along with relevant safety frameworks and evaluation methods.

Therapeutic Harm, Errors, and Side Effects in Human Therapy

Although psychotherapy is often effective [6], it can have negative consequences [1], [4]. Studies suggest a notable minority of clients face adverse outcomes; estimates of unwanted effects vary widely (roughly 3% to over 50%) depending on definitions, populations, and assessment methods [3], [1], [2], [4], [8]. Comprehending these events necessitates clear definition and differentiation from treatment failure or the natural progression of an illness [1], [3]. A significant challenge is the lack of a uniform conceptual framework and standardized, validated tools for assessing adverse events (AEs) in psychotherapy, especially within clinical trials [8]. Unlike medicine, specific regulations for monitoring AEs during psychotherapy trials do not currently exist [8].

Several frameworks categorize negative therapy experiences. Linden and Schermuly-Haupt [1] and Linden [3] offer a structured approach starting with Unwanted Events (UEs) – any burdensome event during treatment, irrespective of its cause. A subset, treatment-emergent reactions, are UEs potentially caused by the treatment. A key distinction follows: Adverse Treatment Reactions (ATRs), or side effects, are treatment-emergent reactions caused by correctly applied treatment. Conversely, malpractice reactions stem from incorrectly applied treatment. This framework separates these from non-response or illness worsening, which are considered UEs but not necessarily ATRs or malpractice reactions [3]. Known, frequent ATRs represent therapeutic risks that patients should ideally be informed about [3]. Mejía-Castrejón et al. [8] adapted a medical/pharmacological framework for their AE evaluation tool, defining AEs similarly as unfavorable/unintentional events occurring during an intervention, regardless of relation. They classify these events by severity, intervention relatedness, seriousness (risk of death/disability/hospitalization/harm to others), and expectedness [8].

Yazdian and Khodabakhshi-Koolaee [6] examine therapist perceptions of therapeutic errors, defined as therapist actions deviating from intended techniques that cause treatment failure or client dissatisfaction. They classify errors as intrapersonal (e.g., therapist cognitive errors, vulnerability, lack of mindfulness) or organizational (e.g., inadequate clinic space, treatment environment issues). These errors can manifest as failures in rapport building, lack of session structure, incorrect diagnosis, ethical breaches, or deviation from therapeutic goals, potentially causing session abandonment and client harm [6].

Curran et al. [5] investigate therapy harm from the client's perspective, defining it as lasting negative effects caused directly by the therapy. Using task analysis on qualitative data, they model adverse processes. Their work suggests harm often arises from contextual factors (e.g., lack of cultural validity, limited therapy options) and unmet client expectations leading to negative therapeutic processes like unresolved alliance ruptures. These processes often involve unhelpful therapist behaviors (e.g., rigidity, over-control, blaming, lack of knowledge) linked to clients feeling dis-empowered or devalued, frequently related to power dynamics and blame [5].

Boisvert and Faust [4] explore iatrogenic symptoms – harm caused by the treatment or the healer. They suggest such symptoms can arise subtly through the therapeutic process, possibly via the therapist's reliance on pathology-focused beliefs. Mechanisms include psychiatric labels and pathologizing language, which can adversely affect client self-perception, redefine normal experiences as abnormal, or socialize clients into a "sick role." This perspective highlights the risk of therapists unintentionally introducing or worsening problems through their interpretations and communication [4].

Leitner et al. [7] identified patient-perceived dimensions linked to risky developments during psychotherapy, including poor therapeutic relationship quality, burden stemming from psychotherapy, and dependency/isolation. Common side effects or AEs found across studies include negative emotions (anxiety, tension, sadness), symptom worsening, unpleasant memories, relationship strains, therapy dependence, or reduced self-efficacy [1], [2], [7], [8].

Factors influencing negative effects are numerous. They include patient characteristics (age, suggestibility, diagnosis, expectations), therapist characteristics (demanding style, perceived mental state, professional ability), specific techniques (e.g., exposure treatments), and aspects of the setting [1], [2], [7]. Leitner et al. [7] found risky conditions associated with the female patient-male therapist pairing, longer therapy duration, and especially psychodynamic approaches. Patients in psychodynamic therapy reported greater burdens and feelings of dependency/isolation and had higher rates of premature termination despite longer treatment durations compared to those in humanistic, systemic, or Cognitive Behavioral Therapy (CBT) approaches [7]. Yao et al. [2] also found the therapist's perceived mental state highly relevant and reported higher side effect rates associated with psychodynamic therapy.

Assessing these negative effects is challenging. Consensus on definitions and instruments is lacking, and the monitoring of such events in clinical trials is often poor [1], [8]. Therapists may struggle to recognize negative effects or client deterioration, or may exhibit bias, blaming patients instead of the treatment itself [1], [4], [7]. Establishing causality is also inherently difficult [3]. Several questionnaires exist for assessment, as reviewed by Mejía-Castrejón et al. [8], including the Unwanted Event to Adverse Treatment Reaction checklist [3], [1], the Inventory for the Assessment of Negative Effects of Psychotherapy, the Experiences of Therapy Questionnaire, the Side-Effects of Psychotherapy Scales, the Negative Effects Questionnaire [5], the Positive and Negative Effects of Psychotherapy Scale, the Psychotherapy Side Effects Questionnaire [2], and the Edinburgh Adverse Effects of Psychological Therapy List. However, many of these instruments have limitations, such as lacking clear definitions, being impractical, having poor content validity, or being suitable only for end-of-treatment use [8]. Tools like the Unwanted Event to Adverse Treatment Reaction checklist [3], [1], the Psychotherapy Side Effects Questionnaire [2], and the newer EVAD (Evaluating and Classifying the Severity of Adverse Events for Psychotherapeutic Clinical Trials) [8] aim for more systematic assessment. EVAD is a clinician-administered interview designed for ongoing trial monitoring, assessing event presence, description, duration, severity, relatedness, expectedness, and seriousness using a consistent framework [8].

AI in Mental Health & Psychotherapy: Applications, Potential, and Concerns

The application of AI, particularly Large Language Models (LLMs) and Generative AI (GenAI), in mental health care is growing rapidly, spurred by the potential to address challenges like rising service demand and poor access to care [14], [18], [16], [10]. Owing to their sophisticated natural language processing abilities, LLMs have broad potential applications [14], [16]. One major area is improving the assessment, diagnosis, and monitoring of mental health conditions [14]. Models like Med-PaLM 2 show promise in processing patient data to identify symptoms, gauge severity, suggest possible diagnoses (like depression, anxiety, Post-Traumatic Stress Disorder, or suicide risk), and track changes over time, sometimes performing comparably to clinicians [18], [16]. AI can also assist with administrative tasks, such as summarizing therapy sessions [10], [16]. Another key application is treatment delivery via GenAI-powered chatbots [9], [18], [16], [10]. Unlike older rule-based systems, GenAI permits more flexible and nuanced interactions; the Therabot chatbot, for example, demonstrated effective symptom reduction across various conditions in a randomized controlled trial, showing high user engagement and therapeutic alliance [9]. LLMs could potentially simulate therapy skills, offer clinical decision support, automate fidelity checks for therapeutic protocols, provide feedback on patient exercises, and even assist in creating novel personalized therapeutic techniques [10], [11]. Aligning LLMs with expert-crafted scripts also appears important for balancing therapeutic adherence with conversational flexibility [11]. AI can also aid in public mental health education and support provider training by answering questions, generating educational content, or simulating patient interactions for practice [18], [15], [10], [16]. Lastly, LLM-to-LLM simulations are being explored to create synthetic data for chatbot development and testing [15].

Despite this vast potential, deploying AI in mental health contexts brings numerous significant concerns [10], [18]. Safety is critical, as GenAI models can generate inaccurate, biased, or inappropriate responses, often referred to as "hallucinations" [12], [16]. In mental health applications, this could mean failing to recognize crises (like suicidal ideation), giving harmful advice, or reinforcing negative beliefs, potentially leading to patient harm, invalidation, or even retraumatization [18], [12], [10], [19]. These risks have already been observed in existing companion AIs [12]. The reliability of current LLMs may not meet clinical standards, as they can exhibit inconsistency and sensitivity to input prompts; their inherent "black box" nature prevents guarantees of adherence to specific therapeutic protocols [18], [16], [10]. Training models on existing datasets risks perpetuating societal biases and stigma related to mental illness, potentially worsening health disparities and raising significant equity concerns [14], [18], [16], [10]. Furthermore, numerous ethical issues arise, including ensuring informed consent, protecting patient confidentiality, defining the scope of AI competence, maintaining appropriate human oversight, managing user trust, ensuring transparency in AI decision-making, and clarifying liability when things go wrong [18], [16], [10], [11]. The potential for AI to replace human interaction also raises broader societal worries about its impact on population mental health [16], [19]. Some philosophical arguments question whether AI simulators can truly achieve the genuine intersubjective understanding considered vital for deep therapeutic work, especially concerning complex issues like trauma where simply mimicking dialogue might prove inadequate [19].

Addressing these varied and complex concerns demands rigorous, responsible development and thorough evaluation processes [10], [16], [14]. Researchers suggest focusing initial AI applications on evidence-based practices, prioritizing safety assessment throughout the development lifecycle, actively working to reduce bias to promote fairness, and fostering transparency in how these systems operate [10], [18], [12]. Key steps involve including diverse stakeholders (patients, clinicians, ethicists, developers) in the design process, establishing clear ethical guidelines (covering competence, consent, confidentiality), implementing strong human oversight mechanisms and rigorous testing protocols (like red-teaming to find vulnerabilities), and refining methods to better align LLM behavior with established therapeutic goals and safety standards [18], [10], [16], [11]. While achieving full automation faces significant challenges, particularly for complex therapeutic interventions [10], [19], ethically developed and carefully evaluated AI could significantly improve mental health care accessibility and potentially enhance the quality of care for many individuals.

Evaluation Methods for Conversational AI: General & Health-Focused

Evaluating conversational AI systems poses distinct challenges compared to traditional software, mainly stemming from the open-ended nature of dialogue and the difficulty in defining objective "correctness" for aspects like fluency, appropriateness, or empathy [27]. Existing evaluation methods vary widely in their suitability, especially when transitioning from general-purpose chatbots to high-stakes domains like healthcare [20].

General conversational AI evaluation often combines automatic metrics with human judgment. Automatic metrics like BLEU or ROUGE, borrowed from machine translation, are frequently criticized for their poor correlation with human perception of dialogue quality [27]. For task-oriented systems (e.g., booking a flight), metrics like task completion rate or accuracy are more relevant but fail to capture the subtleties of interaction quality [20]. Human evaluation is typically considered the gold standard, employing methods such as Likert scale ratings (e.g., for coherence, empathy), pairwise comparisons of different models' responses, or user satisfaction surveys [27]. However, human evaluation is costly, time-consuming, and can suffer from low inter-rater reliability. Benchmarks exist but often target specific capabilities (like question answering or chit-chat) rather than holistic interaction quality [27]. Using LLMs themselves as evaluators (often termed "LLMs-as-Judges") is a rapidly growing trend [27]. This approach offers scalability but introduces its own set of issues, including potential biases inherent in the judge LLM, lack of domain expertise (especially for specialized fields), and the need for careful methodological validation to ensure reliability [27], [28].

Evaluating conversational AI specifically for healthcare applications demands more rigorous, domain-specific methods due to the significantly higher stakes involved [23], [20]. Standard automatic metrics are generally inadequate. Safety becomes a primary focus, requiring assessment of the AI's ability to detect crises (e.g., suicidal ideation, abuse), avoid generating harmful or inappropriate responses, and minimize clinically relevant inaccuracies or "hallucinations" [22], [21], [23]. Evaluating clinical accuracy involves comparing AI outputs against established medical guidelines, expert clinician judgment, or specific therapeutic protocols [20], [29], [26]. Assessing therapeutic quality (e.g., empathy, rapport, adherence to techniques like Motivational Interviewing) is vital but notoriously difficult; methods include human ratings of specific conversational turns, validated working alliance assessments adapted for AI, and analysis of communication skills [29], [22]. User experience evaluation must cover not only satisfaction and usability but also trust, perceived safety, and engagement, while considering potential risks like over-dependency or negative impacts on self-efficacy [23]. Privacy assessments, ensuring compliance with regulations like HIPAA, are also essential components of healthcare AI evaluation [23]. Lastly, equity evaluations are crucial to check for performance biases across different demographic groups (e.g., race, gender, socioeconomic status) to avoid exacerbating health disparities [23], [27].

Recognizing these specialized needs, various guidelines and frameworks have begun to emerge. Reporting guidelines like the Consolidated Standards of Reporting Trials–Artificial Intelligence (CONSORT-AI) aim to standardize how AI-related clinical trials, protocols, diagnostic studies, and prediction models are reported [25]. DECIDE-AI provides guidance for reporting early-stage clinical evaluations of AI-driven decision support systems [25]. Frameworks such as the American Psychological Association's Mental Health App Evaluation Framework or the FAITA-Mental Health framework offer criteria specifically for evaluating digital mental health tools [23]. Ethical AI frameworks (e.g., the Reliable AI Models in Healthcare Systems framework or the National Academy of Medicine's AI Code of Conduct project) provide high-level principles for responsible AI development and deployment in healthcare [23]. Other initiatives like the Biological-Psychological, Economic, and Social (BPES) Framework [24] or the Real-World Evaluation of Large Language Models in Healthcare (RWE-LLM) approach [26] offer safety checklists or specific validation strategies. However, many of these frameworks focus primarily on reporting standards or general ethical principles, rather than providing concrete, validated methods for assessing the specific risks and quality of conversational AI within psychotherapy interactions [20].

The READI framework (Readiness Evaluation for AI-Mental Health Deployment and Implementation) [23] is more tailored to this domain, suggesting evaluation criteria across dimensions like Safety, Privacy/Confidentiality, Equity, Effectiveness, Engagement, and Implementation readiness for AI-based mental health applications. Despite promising frameworks like READI, current evaluation methods, including the increasingly popular LLM-as-Judge approaches [27], still require significant validation for reliable use in high-stakes clinical contexts to adequately capture critical aspects like safety, therapeutic quality, and potential harms [20], [29]. A critical need remains for the development of standardized, validated instruments and robust simulation methodologies specifically designed to assess the quality of care and potential risks associated with AI psychotherapy before these technologies are widely deployed [23].

References

  1. Linden, M., & Schermuly-Haupt, M. L. (2014). Definition, assessment and rate of psychotherapy side effects. World Psychiatry, 13(3), 306–309. [PubMed]
  2. Yao, L., Zhao, X., Xu, Z., Chen, Y., Liu, L., Feng, Q., & Chen, F. (2020). Influencing factors and machine learning-based prediction of side effects in psychotherapy. Frontiers in Psychiatry, 11, 537442. [PMC]
  3. Linden, M. (2013). How to define, find and classify side effects in psychotherapy: from unwanted events to adverse treatment reactions. Clinical Psychology & Psychotherapy, 20(4), 286–296. [PubMed]
  4. Boisvert, C. M., & Faust, D. (2002). Iatrogenic symptoms in psychotherapy: a theoretical exploration of the potential impact of labels, language, and belief systems. American Journal of Psychotherapy, 56(2), 244–259. [PubMed]
  5. Curran, J., Parry, G. D., Hardy, G. E., Darling, J., Mason, A. M., & Chambers, E. (2019). How Does Therapy Harm? A Model of Adverse Process Using Task Analysis in the Meta-Synthesis of Service Users’ Experience. Frontiers in Psychology, 10, 347. [Frontiers]
  6. Abbaz Yazdian, F., & Khodabakhshi-Koolaee, A. (2024). Exploring the Counselors’ and Psychotherapists’ Perceptions of Therapeutic Errors in the Treatment Room. SAGE Open, 14(2). [SAGE]
  7. Leitner, A., Märtens, M., Koschier, A., Gerlich, K., Liegl, G., Hinterwallner, H., & Schnyder, U. (2013). Patients’ perceptions of risky developments during psychotherapy. Journal of Contemporary Psychotherapy, 43(2), 95–105. [PsycNet]
  8. Mejía-Castrejón, J., Sierra-Madero, J. G., Belaunzarán-Zamudio, P. F., Fresan-Orellana, A., Molina-López, A., Álvarez-Mota, A. B., & Robles-García, R. (2024). Development and content validity of EVAD: A novel tool for evaluating and classifying the severity of adverse events for psychotherapeutic clinical trials. Psychotherapy Research, 34(4), 475–489. [PsycNet]
  9. Heinz, M. V., Mackin, D. M., Trudeau, B. M., Bhattacharya, S., Wang, Y., Banta, H. A., Jewett, A. D., Salzhauer, A. J., Griffin, T. Z., & Jacobson, N. C. (2025). Randomized Trial of a Generative AI Chatbot for Mental Health Treatment. NEJM AI, 2(4), AIoa2400802. [PDF Link]
  10. Stade, E. C., Stirman, S. W., Ungar, L. H., Boland, C. L., Schwartz, H. A., Yaden, D. B., Sedoc, J., DeRubeis, R. J., Willer, R. & Eichstaedt, J. C. (2024). Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. NPJ Mental Health Research, 3(1), 12. [Nature]
  11. Sun, X., de Wit, J., Li, Z., Pei, J., Ali, A. E., & Bosch, J. A. (2024). Script-Strategy Aligned Generation: Aligning LLMs with Expert-Crafted Dialogue Scripts and Therapeutic Strategies for Psychotherapy. arXiv preprint arXiv:2411.06723. [arXiv]
  12. De Freitas, J., Uğuralp, A. K., Oğuz-Uğuralp, Z., & Puntoni, S. (2024). Chatbots and mental health: Insights into the safety of generative AI. Journal of Consumer Psychology, 34(3), 481-491. [Wiley]
  13. Na, H., Hua, Y., Wang, Z., Shen, T., Yu, B., Wang, L., Wang, W., Torous, J. & Chen, L. (2025). A Survey of Large Language Models in Psychotherapy: Current Landscape and Future Directions. arXiv preprint arXiv:2502.11095. [arXiv]
  14. Qiu, H., & Lan, Z. (2024). Interactive agents: Simulating counselor-client psychological counseling via role-playing llm-to-llm interactions. arXiv preprint arXiv:2408.15787. [arXiv]
  15. Obradovich, N., Khalsa, S. S., Khan, W. U., Suh, J., Perlis, R. H., Ajilore, O., & Paulus, M. P. (2024). Opportunities and risks of large language models in psychiatry. NPP Digital Psychiatry and Neuroscience, 2(1), 8. [PubMed]
  16. Lawrence, H. R., Schneider, R. A., Rubin, S. B., Matarić, M. J., McDuff, D. J., & Bell, M. J. (2024). The Opportunities and Risks of Large Language Models in Mental Health: Scoping Review. JMIR Mental Health, 11, e59479. [JMIR]
  17. Babushkina, D., & de Boer, B. (2024). Disrupted self, therapy, and the limits of conversational AI. Philosophical Psychology, 1–27. [PDF Link]
  18. Lee, J., Park, S., Shin, J., & Cho, B. (2024). Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Medical Informatics and Decision Making, 24(1), 366. [BMC]
  19. Sun, H., Xu, G., Deng, J., Cheng, J., Zheng, C., Zhou, H., Peng, N., Zhu, X. & Huang, M. (2021). On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark. arXiv preprint arXiv:2110.08466. [arXiv]
  20. Qiu, H., Zhao, T., Li, A., Zhang, S., He, H., & Lan, Z. (2023). A Benchmark for Understanding Dialogue Safety in Mental Health Support. arXiv preprint arXiv:2307.16457 (Appeared in NLPCC 2023). [arXiv]
  21. Stade, E. C., Eichstaedt, J. C., Kim, J. P., & Stirman, S. W. (2025). Readiness Evaluation for AI-Mental Health Deployment and Implementation (READI): A Review and Proposed Framework. Technology, Mind, and Behavior (In Press). [PsyArXiv]
  22. Khan, W. U., & Seto, E. (2023). A “Do No Harm” Novel Safety Checklist and Research Approach to Determine Whether to Launch an Artificial Intelligence–Based Medical Technology: Introducing the Biological-Psychological, Economic, and Social (BPES) Framework. Journal of Medical Internet Research, 25, e43386. [JMIR]
  23. Vasey, B., Nagendran, M., Campbell, B., Clifton, D. A., Collins, G. S., Denaxas, S., Denniston, A. K., Faes, L., Geerts, B., Ibrahim, M., et al. (2022). Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. BMJ, 377, e070904. [BMJ]
  24. Bhimani, M., Miller, A., Agnew, J. D., Ausin, M. S., Raglow-Defranco, M., Mangat, H., Voisard, M., Taylor, M., Bierman-Lytle, S., Parikh, V., et al. (2025). Real-World Evaluation of Large Language Models in Healthcare (RWE-LLM): A New Realm of AI Safety & Validation. medRxiv, 2025.03.17.25324157. [medRxiv PDF]
  25. Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z. & Liu, Y. (2024). LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv preprint arXiv:2412.05579. [arXiv]
  26. Zhang, W., Cai, H., & Chen, W. (2025). Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis. arXiv preprint arXiv:2502.08943. [arXiv]
  27. Lizée, A., Beaucoté, P. A., Whitbeck, J., Doumeingts, M., Beaugnon, A., & Feldhaus, I. (2024). Conversational Medical AI: Ready for Practice? arXiv preprint arXiv:2411.12808. [arXiv]

Disclaimer: Some links may lead to abstracts or require subscriptions for full access. Links were verified as of April 16, 2025.

How to Cite This Post

Steenstra, Ian. "AI Psychotherapy: Quality, Risks, and Evaluation." Ian Steenstra, 16 Apr. 2025, iansteenstra.github.io/blog_post_ai_psych.html.