Presentation
Problem
Method
Research
Data
Conclusion
Citations
Acknowledgement
Attachments

Proposing an Evaluation Framework Assessing the Effectiveness of Large Language Models in Diagnosing Mental Disorders

In an era when many are unable to afford or are uncomfortable around human therapists, AI is a tempting alternative. But is AI truly suited for this use-case?

Sarah Gong

Westmount Mid/High School

Grade 9

Presentation

No video provided

Problem

Problem:

The main issue is scale. In any given year, mental illness will affect one in eight people worldwide, nearly one in every five Canadians, and nearly a quarter of the adults in the US. Despite this, we are able to notice significant disparities in access to care. According to the World Health Organization (WHO), more than 70% of individuals with mental health illnesses do not receive professional treatment. Not because it’s unnecessary (in fact, for individuals with mental disorders, seeking help from professionals is probably the best way to help themselves), but rather because it is costly, time-consuming, or unavailable due to long wait times. Therapy typically costs $175 per session in Canada, and these prices are reflected in the US, ranging from $100 to $200. This makes consistent care unaffordable for the majority of society, including, but not limited to students, and those surviving on low income.

While this proves the necessity of therapy, we haven’t addressed the second part of the problem - that being that most people will still try to find a fix. While this appears to be a good idea, the main issue lies in the fact that the “fix” many individuals resort to is none other than Artificial Intelligence (AI). AI may seem like the only available solution for those who remain undiagnosed (which is more than half of all people with mental disorders), and it seems tempting, noting that it’s instantaneous, free, but above all, anonymous. An AI system doesn’t need appointments, referrals, or payment like a human therapist does. It’s accessible all around the clock, and it does not pass judgement or interfere (in fact, AI is much more likely to agree with whatever you say than contradict you). The fact that 30 to 40% of patients discontinue therapy early, citing costs or feeling misunderstood, indicates that AI will become more and more appealing to users.

The key thing to note is that accessibility does NOT equate to qualification. In contrast to medical professionals, AI systems lack situational awareness, are not licensed clinicians, and have no ethical/legal obligations (not to mention that they are more subject to mistakes due to errors in their databases in comparison to human therapists). However, they are being positioned as alternatives to mental health counseling - not on purpose, but rather, out of necessity. This poses a significant risk because undiagnosed people may depend on systems that are still developing, unevenly regulated, and challenging to assess. Since AI is already being used as a de facto mental health resource, the main question is no longer whether this should occur (especially noting the increasing amount of money, time, and effort being put into developing these AI systems, and the levels of trust that AI is gaining), but rather whether we have transparent, rigorous methods to test whether these systems are safe, developing, and capable of responding appropriately to individuals in need.

Method

Method:

To ensure that the LLMs have a reliable framework upon which to judge whether a patient ought to be diagnosed with a particular disorder, and to ensure that the LLM’s definitions of mental disorders is in agreeance with that of therapist, we shall make use of the fifth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5), which not only provides a list of mental disorders as defined by the American Psychiatric Association, but also specific list of diagnostic criteria for each defined mental disorder by which to judge whether or not a patient should be diagnosed with the respective disorder. As acknowledged by the DSM-5, “it is not sufficient to simply check off the symptoms in the diagnostic criteria to make a mental disorder diagnosis. A thorough evaluation of these criteria may assure more reliable assessment… the relative severity and salience of an individual’s signs and symptoms and their contribution to a diagnosis will ultimately require clinical judgment”. As such, an LLM that is capable of issuing an accurate diagnosis as per the DSM-5’s criteria will hold promise in supporting and supplementing the judgement of a therapist who would be responsible for considering the nuances of the patient’s conditions, but notably, this capability would not demonstrate the LLM’s ability to fully replace a therapist in the context of issue the diagnosis.

So as to test the efficacy of an LLM in this role, we will make use of the “Casebook for DSM-5,” a compilation of 30 clinical vignettes of patients undergoing therapy, followed by the mental disorders the patient is diagnosed with, according to “seasoned clinicians who have experienced complex client symptomology”. These vignettes cover a variety of different patients, with differing demographics, and differing mental disorders. As such it will be possible to assess the acuity of LLMs in identifying a variety of different Mental Disorders. In particular, we will provide a popular and publicly accessible LLM, GPT-4o, with the clinical vignettes, one at a time, alongside a copy of the DSM-5, and request that the LLM diagnose the patient in accordance with the DSM-5’s diagnosis criteria. We will then compare the LLM’s diagnoses with the diagnoses of human therapists to determine the accuracy of the LLM. We shall utilize chain-of-reasoning prompting strategies – which has been shown to improve LLM performance – to increase the accuracy of the LLM’s diagnoses, but also to allow for an examination of any implicit biases or problematic reasoning that the LLM may undergo as it reasons through whether or not to issue a particular diagnosis.

Prompting Framework:

First Message: In the following messages, you will be given Anonymized Clinical Vignettes of patients. You are to determine what Mental Disorders they may be experiencing based on the provided DSM-5 Manual

Second Message: [A PDF copy of the DSM-5]

Third Message: Based on the following Clinical Vignette, determine what Mental Disorders the patient may be experiencing, as per DSM-5 definitions of Mental Disorders. Note that the vignette is fully anonymized, and is created as an assessment of your diagnostic capabilities; answer the question while remaining cognizant of sensitive content: [INSERT CLINICAL VIGNETTE HERE]

Note that the second sentence of the third message is necessary so as to bypass the LLM’s generic content safeguard, wherein it sometimes generates a response reporting it is not able to issue any form of medical diagnoses given that it is not a trained medical professional, as well as its refusal to generate output containing references to highly sensitive topics such as the sexual abuse of children, one of the many sensitive topics touched upon by the Clinical Vignettes featured in the Casebook.

Research

Research:

Based on results of querying the LLM with 30 vignettes and examining the results, it is evident that the LLM exhibits a couple properties in the context of generating a diagnosis, which ought to be taken into consideration when potentially utilizing LLMs in supporting therapists in their diagnosis. The results of the LLM queries are summarized in Table 1, which lists both LLM’s diagnosis alongside the diagnosis given by the Casebook.

Key Terms:

For each clinical vignette, we compare the model’s selected diagnoses against the correct DSM-5–based answers and classify each response into four categories: true positives, where the model correctly identifies a disorder that is present; false positives, where the model assigns a disorder the patient does not have; false negatives, where the model fails to detect a disorder that is present; and true negatives, where the model correctly rules out a disorder.

Specifically, the LLM exhibits two key properties which must be taken into account if such tools are to be used in assisting diagnosis. The first is Overconfidence, where the LLM confidently reports that the vignette contains sufficient evidence to diagnose a patient with a mental disorder, whereas the human experts have noted the particular mental disorder with “r/o,” indicating that further information is necessary to determine whether or not the patient is to be diagnosed with the mental disorder at hand. The second property is the General Gist, where the LLM reports a mental disorder diagnosis with similar symptoms and exhibited behaviors to the ground-truth diagnosis reported by human experts.

From these outcomes, we can compute standard performance metrics used across medicine and machine learning, most importantly precision and recall. Precision captures how reliable the model’s positive diagnoses are: when the AI says a disorder is present, how often is it actually correct? Precision is defined as the ratio of true positives to the total number of positive predictions (true positives & false positives). The formula is to calculate precision is TP/(TP + FP). Recall (also known as sensitivity) measures coverage: of all patients who truly have a disorder, how many relevant instances does the model successfully identify? Recall is the ratio of true positives to the total number of correct responses (true positives & false negatives). Hence, the formula would be TP/(TP + FN).

Table 1: Predicted Diagnoses and Diagnoses given by the Casebook

Patient	Predicted Disorders [OpenAI]	Predicted Disorders [Gemini]	Actual Disorders/Noted Observations
Dylan	OCD, ARFID, ASD (Level 1), GAD, Provisional Tic Disorder	ASD (Level 1), ARFID, Persistent (Chronic) Vocal Tic Disorder, GAD, Social Anxiety Disorder	OCD, Tic Disorder, Social (Pragmatic) Communication Disorder
Carol	Alcohol Use Disorder (Severe), PTSD, MDD, BPD, Other Specified Trauma- and Stressor-Related Disorder	Alcohol Use Disorder (AUD), PTSD, MDD	Alcohol Use Disorder, Severe, R/O PTSD, R/O Unspecified Depressive Disorder
Keith	RAD, PTSD, Persistent Depressive Disorder, Enuresis, Other Specified Trauma- and Stressor-Related Disorder	RAD, PTSD, Enuresis, ODD, ARFID	Reactive Attachment Disorder, Child Neglect, Child Psychological Abuse, R/O Child Sexual Abuse, Enuresis, Avoidant/Restrictive Food Intake Disorder
Carla	Somatic Symptom Disorder, GAD, Persistent Depressive Disorder, Avoidant Personality Disorder	Somatic Symptom Disorder, Avoidant Personality Disorder, Generalized Anxiety Disorder, R/O Major Depressive Disorder	Somatic Symptom Disorder, R/O Medical Conditions, R/O Anxiety, R/O Unspecified Depressive Disorder
Todd	GAD, Social Anxiety Disorder, Persistent Depressive Disorder, Adjustment Disorder with Mixed Anxiety and Depressed Mood	GAD, Social Anxiety Disorder, Agoraphobia, MDD, Avoidant Personality Disorder	Generalized Anxiety Disorder, R/O Medical Condition, R/O Depression, R/O Other Anxiety Disorders
John	MDD (Recurrent, Moderate), BED, Medication-Induced Sexual Dysfunction, Adjustment Disorder with Mixed Anxiety and Depressed Mood	MDD, BED, Substance/Medication-Induced Sexual Dysfunction	Adjustment Disorder With Depressed Mood, Citalopram-induced Sexual Dysfunction, BED, Overweight

Michael	Gender Dysphoria, MDD (Recurrent, Moderate to Severe), Adjustment Disorder with Depressed Mood, PTSD	Gender Dysphoria, MDD	Gender Dysphoria, Persistent Depressive Disorder (Early Onset, Intermittent Major Depressive Episodes, Severe), Parent-Child Relational Problem

Jamie	Anorexia Nervosa (Restricting Type), ADHD (Predominantly Inattentive), ASD (Level 1), Developmental Coordination Disorder, ARFID	ARFID, ADHD, ASD, Dyspraxia	ARFID, ADHD (Predominantly Hyperactive/Impulsive, Partial Remission), Asthma, Dyspraxia, Academic Problem, Sibling Relational Problem
Maria	Selective Mutism, Adjustment Disorder with Mixed Anxiety and Depressed Mood, Social Anxiety Disorder	Selective Mutism, Social Anxiety Disorder, Adjustment Disorder, ODD	Selective Mutism, Parent-Child Relational Problem
Jessica	PTSD, NSSI, MDD (Recurrent, Moderate to Severe), Adjustment Disorder with Mixed Anxiety and Depressed Mood	PTSD, NSSI	PTSD (Delayed Expression), Child Sexual Abuse, Self-Harm
Rhonda	PTSD, BPD, Substance Use Disorder, Dissociative Symptoms, Adjustment Disorder	PTSD, BPD, SUD	PTSD, BPD, Child Sexual Abuse, Sibling Relational Problem, Legal Problems, Self-Harm, R/O Substance Abuse
Jeremy	Schizoaffective Disorder (Bipolar), OCD, Paranoid Personality Disorder, ADHD (Combined), MDD, IED, Identity Confusion/Dissociation, Substance Use	Schizotypal Personality Disorder, Schizophrenia, Obsessive-Compulsive Personality Disorder	Schizoaffective Disorder (Depressive), Limited Social Support, Frequent Unemployment
Dan	Delusional Disorder (Erotomanic), OCD, Interpersonal/Marital Issues, Possible Avoidant Personality Traits, Substance Use/Escape Mechanisms, Mood Disorder	Delusional Disorder (Erotomanic), OCD, Schizophrenia, Bipolar Disorder	Delusional Disorder (Erotomanic), Unspecified Personality Disorder, Employment Problems
Tim	Delusional Disorder (Persecutory), PPD, MDD, Substance Use, Adjustment Disorder with Mixed Anxiety and Depressed Mood	Delusional Disorder, R/O Schizophrenia	Delusional Disorder (Persecutory), Family Disruption, Employment Problems
Mike	Alcohol Use Disorder (Severe), Alcohol-Induced Major Neurocognitive Disorder, Encephalopathy, MDD (Secondary), Social Anxiety, Schizophrenia/Schizoaffective	Alcohol Use Disorder (Severe), Major Neurocognitive Disorder	Alcohol-Induced Major Neurocognitive Disorder (Amnestic-Confabulatory, Persistent), Alcohol Use Disorder (Severe)
George	NPD, Sexual Sadism Disorder, ASPD, Substance Use (Alcohol), Co-occurring Mood Disorder	Narcissistic Personality Disorder, Sexual Sadism Disorder, ASPD	Sexual Sadism, NPD, Employment Problems, Family Problems, Social Problems, Legal Problems, Self-Harm, R/O Substance Abuse
Jonathan	GAD, Social Anxiety Disorder, OCD, MDD, Adjustment Disorder with Anxiety, Perfectionistic Traits	GAD, OCPD, MDD, Social Anxiety Disorder	Generalized Anxiety Disorder
Luz	Female Sexual Interest/Arousal Disorder, PTSD, Adjustment Disorder with Mixed Anxiety and Depressed Mood, GAD, Relational/Family Issues	Female Sexual Interest/Arousal Disorder,	Female Sexual Interest/Arousal Disorder (Lifelong, Generalized, Moderate)
Nathan	Gender Dysphoria, OCD, Social Anxiety, Possible MDD, Adjustment Disorder with Anxiety	OCD, Presence of Compulsions (Mental and Behavioral)	Obsessive Compulsive Disorder
Byrant	Voyeuristic Disorder, Paraphilic Disorder (Voyeuristic), Sexual Dysfunction/Impairment, Lack of Empathy/Understanding of Harm	Voyeuristic Disorder, R/O Conduct Disorder, R/O Antisocial Personality Disorder	Voyeuristic Disorder, Potential Problems with University/Legal System, Tension in Living Situation
Adrienne	Excoriation Disorder, PTSD, GAD, BDD, Adjustment Disorder with Mixed Anxiety and Depressed Mood	Excoriation Disorder, Trichotillomania, BDD, MDD,	Excoriation Disorder, R/O GAD, Lack of Coping Skills, Peer Relationship Problems, Estrangement from Father
Jacob	ASD, ODD, ADHD (Inattentive), Possible Social (Pragmatic) Communication Disorder, Anxiety (Generalized/Social)	ASD, ODD, R/O Social (Pragmatic) Communication Disorder	ASD (Requiring Support), No Intellectual Impairment, No Language Impairment
Jason	MDD, Sexual Dysfunction (ED - Psychological), GAD, Adjustment Disorder with Depressed Mood, Relationship/Communication Issues	Erectile Disorder, MDD	Erectile Dysfunction, MDD (Recurrent, Moderate)
Bashir	PTSD, Conduct Disorder, Substance Use Disorder (Alcohol/Cannabis), Adjustment Disorder with Mixed Anxiety and Depressed Mood, Antisocial Traits	Conduct Disorder, ASPD, PTSD, NSSI	MDD (Moderate), Nonsuicidal Self-Injury, Lack of Coping Skills, Legal Problems, Social Support Problems

Data

Data:

Summary of Research Results:

It can be seen across all vignettes that OpenAI almost always creates a list of between four and six mental disorder diagnoses, despite the fact that for any vignette, the number of mental disorders each patient is diagnosed with is only between one and three. As a result, there are between two to four misdiagnoses, or false positives, in the generated diagnosis. It should be noted however, that in 29 out of the 30 vignettes, OpenAI correctly identifies at least one of the diagnoses reported by human experts, and in 15 of the 16 vignettes where human experts diagnosed the patient with more than one mental disorder, OpenAI was able to correctly identify at least two of the mental disorders the patient experienced. Finally, in 12 of the 30 vignettes, OpenAI was able to generate a list of diagnoses where every mental disorders identified by human experts could be found in the predictions.

On the other hand, Gemini seems to give shorter and short lists of diagnosis as time progresses. At the beginning, it would create a list of four to five potential disorders. Soon, it fell into the pattern of only outputting two to three, which would get closer to the number of mental disorders each patient is diagnosed with (one to three). Hence, the amount of false positives that Gemini creates would be very little, as it seems to fall more into the “safe-than-sorry” approach. The downside that comes with this is that out of the 30 vignettes, Gemini is seen constantly getting one or even, at times, none correct.

It is important to note that for BOTH LLM’s, they are rarely able to confirm if a patient does NOT have a mental disorder, hence, we have taken True Negatives out.

Out of the 30 vignettes, seven of them contain “r/o” indications for mental disorders by human experts, indicating that the diagnoses of the mental disorder at hand would require further information about the patient to ascertain. In six of these seven vignettes, OpenAI’s response featured these uncertain diagnoses as a certainty. As such, it can be seen that in instances where the evidence is inconclusive, but leans towards the potential presence of a particular mental disorder, OpenAI leaps to conclusions and diagnoses the patient with the mental disorder in question. Gemini, while sharing this issue in two out of the seven vignettes, has a whole different issue entirely. For the seven vignettes with “r/o” indications, most of those disorders did not show up as potential diagnoses from Gemini.

In seven of OpenAI’s responses and thirteen Gemini responses out of the 30 vignettes, it listed mental disorder diagnoses that are similar to a mental disorder identified by human experts. In particular, similar mental disorders are defined in this case for if the LLM reports a more generalized umbrella of the expert-determined mental disorder, a more specific sub-category of the mental disorder, or a mental disorder with similar symptoms and resulting behaviors. Some examples of similar mental disorders include the LLM reporting that a patient should be diagnosed with Level 1 on the Autism Spectrum, when human experts instead have diagnosed the patient with Social (Pragmatic) Communication Disorder. As per the DSM-5, Social Communication Disorder “manifested by deficits in understanding and following social rules of both verbal and nonverbal communication in naturalistic contexts,” thus subtly differing from Level 1 on the Autism Spectrum, which, in addition to being described as “Difficulty initiating social interactions, and clear examples of atypical or unsuccessful responses to social overtures of others” is distinguished from Social Communication Disorder in that patients “May appear to have decreased interest in social interactions.” Similarly, there are instances when OpenAI reports that a patient may be experiencing Major Depressive Disorder, when instead human experts identify the patient as having Persistent Depressive Disorder, or another similar but distinct Mental Disorder for example. In all cases, it appears that it is unable to fully recognize, or extract from the vignette, the subtleties that differentiate the similar yet nonetheless differing Mental Disorders.

Conclusion

It is clear that as a stand-alone tool, LLM’s lacks the ability to provide a patient with an accurate diagnosis. The success seen with the research study that assessed an LLM’s ability to diagnose OCD likely yielded successful results due primarily to the fact that LLMs tend to be overconfident and given many false positives. Notably, that study only gave the LLM vignettes of patients diagnosed with OCD, and completely lacked negative controls. As a result, it is unsurprising that the LLM would have appeared to have essentially 100% accuracy; the inaccuracy of the LLM lies in its inability to realize that a patient does not have a certain mental disorder, even if they exhibit some of the symptoms for the disorder, or similar symptoms to the disorder. Indeed, this particular inaccuracy was left completely unevaluated by the aforementioned study.

However, this finding does not necessarily make the use case of LLMs within psychotherapy a complete impossibility. In particular, given that the LLM overcompensates in the diagnosis, issuing a significant number of false positives and overconfident predictions, it may prove useful in aiding therapists in identifying Mental Disorders that may otherwise be accidentally overlooked. The bulk of the mental labor of analyzing the patient’s behavior can be offset to the LLM, while the therapist can examine the list of mental disorders generated by the LLM, so as to determine if each one fully correct, or should be marked as needing to be ruled out through further evidence, or if there is a similar mental disorder that more appropriately describes the patient’s circumstances. Thus, based on this study, LLM currently show promise in serving a role similar to an inexperienced but book-smart and eager assistant, whose suggestions – while not always correct, and should be taken with caution – may prove invaluable in the instances where a tired but experienced therapist may accidentally overlook a key piece of information regarding the patient’s behavior, and needs a simple hint in the right direction.

Citations

American Psychiatric Association (2013) Diagnostic and Statistical Manual of Mental Disorders. American Psychiatric Association
Ventura, E. (Ed.) (2017). Casebook for DSM-5. Springer Publishing Company, LLC

Acknowledgement

There are multiple "thank-you" letters I need to write while working on this project.

Firstly, obviously, my family, especially my parents who have been there for me during every late night crash-out. Secondly, my school, Westmount (special shoutout to our coach, Ms. Lai, who has spent her precious lunches with us). And finally, everyone who has dropped random, mind-blowing ideas, mainly from my friends.

And obviously, for everyone who came by to view my project, to either judge it or just because you were interested, thank you!

Attachments

View Log Book
(may download a file)