How Good are Professors at Telling ChatGPT-Generated Essays from the Ones Produced by University Students?

This survey-based study explored the ability of professors to differentiate ChatGPT-generated text from student written essays.
Ilyas Iskander
Grade 6

Problem

Research problem

ChatGPT is increasingly used in all spheres of human life. It has a huge potential to be applied in higher education (Heaven, 2023). However, one challenge with its use is a possibility of plagiarism (Francke & Bennett, 2019). To address the issue of plagiarism and fairness in evaluating students' work, it is important for university professors to be able to differentiate Chat GPT generated essays from student written ones. However, there is limited research, which explores whether professors are able to do this successfully. 

Purpose of the study

The purpose of this study is to explore the ability of professors to distinguish Chat GPT generated texts (essays) from the ones written by the university students. In this study I will also investigate whether the level of English in which the essay is written by a student affects the results.

Research question

  • How good are professors at distinguishing Chat GPT generated essays from the ones writtens by university students with different level of English writing ability?

Method

Method for collecting data

To answer the research questions, I asked Chat GPT and two university students to write a summary of the same research article. The first student had a lower writing ability and the second one had a higher writing ability. Then I asked several professors to read the 3 essays and complete a survey, in which they were asked to determine who wrote the essay and to answer other questions pertaining to my variables of interest.

I found the article on Google Scholar. I intentionally found a shorter article not to waste too much time of the students. Then I asked my scientific supervisor to identify two students with different writing ability from among her graduate students. The students and Chat GPT were asked to write an essay summarizing this article. They were instructed to write an essay, which consisted of 3 paragraphs, 7 sentences each.

I then sent the essays and the link to the survey to university professors. In the email, I explained the purpose of my study. I also described the risks and benefits from participation. The email included the consent form. I also provided a link to the survey.

I created the survey on Google Forms. The survey consisted of 8 quesions assessing: 

  • the reader's field of specialization
  • whether the reader was a native English speaker
  • if the reader has ever used ChatGPT before
  • the reader's level of confidence in using ChatGPT
  • how experienced the reader was in advising/helping students on research
  • the reader's ability to tell if the essay was generated by ChatGPT or written by a student
  • the reader's evaluation of the level of research writing for each essay

Participation selection

My participants were university professors.  At the time of my research, all of them worked at the University of Calgary. I selected only professors at the Faculty of Arts and the Faculty of Sciences. I found their email addresses on the website of the university and invited them to participate in the study by email.  I sent 475 emails to professors at the Faculty of Sciences and the Faculty of Arts. The number of surveys, which were completed was 46 (on February 16th, 2024). This comprises  approximately 9.7 % response rate.

Data analysis

The data was analyzed statistically using descriptive statistics (means, standard deviations, counts, proportions) and inferential tests (Binomial test). I entered the data into Excel first. Then I removed all surveys, where some questions were not answered. and created all graphs in the program. I also ran Binomial statistical test in Excel. Assuming that the proportions of correctly vs. incorrectly identified essays would be 50% vs. 50% in case the faculty respond without thinking (by chance), I tested the three hypotheses in the Binomial test:

Hypothesis 1: The proportions of correctly vs. incorrectly identified ChatGPT essay is 50% vs. 50%.

Hypothesis 2: The proportions of correctly vs. incorrectly identified essay written by a student with the lower level of English is 50% vs. 50%.

Hypothesis 3: The proportions of correctly vs. incorrectly identified essay written by a student with the stronger level of English is 50% vs. 50%.

 

Research

Review of prior liteterature

Artificial intillegence has been defined by many authors in different ways. The most recognized definition is as follows:

"Artificial intelligence (AI) systems are software (and possibly also hardware) systems designed by humans(2) that, given a complex goal, act in the physical or digital dimension by perceiving their environment through data acquisition, interpreting the collected structured or unstructured data, reasoning on the knowledge, or processing the information, derived from this data and deciding the best action(s) to take to achieve the given goal. AI systems can either use symbolic rules or learn a numeric model, and they can also adapt their behaviour by analysing how the environment is affected by their previous actions."  (HLEG, 2019)

Chat GPT is an AI chatbot created by OpenAI, which is available from openai.com. Similarly to other AI models it has several promises for education. Heaven (2023) indicated the following benefits: 1) Chat GPT could be used for personalized tutoring; 2) it can be also used  for student asessment; 3) it can assist in creating educational content; 4) it can facilitate group discussions; 5) it can help students with special needs; 6) at universities it can be used as a research tool.

One concern of using artificial intelligence in education is connected with plagiarism (Francke & Bennett, 2019). AI can be used by students to do their work for them said (King & Chat GPT, 2023). Professors want to make sure that their students get a good knowledge and get their grades fairly (Abd-Ellal et al., 2019). So they should be able to tell student work from AI generated text. 

Several studies were conducted on Chat GPT use in higher education (Adb-Elaal et al., 2019; Francke & Bennett, 2019; Heaven, 2023; King & Chat GPT, 2023). Some studies explored the ability of instructors to tell apart Chat GPT generated text from human-written text. For example, Waltzer et al. (2023) tested the ability of high school teachers and students to differentiate between essays generated by Chat GPT and high school students. De Winter et al. (2023) used statistical analysis to show that Chat GPT use can be detected by application of different keywords. Mauryn et al. (2023) explored whether academic staff in Biochemical Engineering can identify the difference between artificially generated asessments made by Chat GPT and previous student asessments (short-answer responses). However this research is still limited and more studies should be conducted. For example, no studies explored whether professors can tell Chat GPT generated essays from student generated ones.

Data

Sample characteristics

On January 13th, I received a total of 46 (9.7% response rate) responses on my Chat GPT survey. Below is a description of the key features of the sample. Most faculty represented economics (15.6%), psychology (13.3%), and political science (6.7%). Only about 25% of respondents were from the sciences. 4.4% didn't answer the question. There were more people from the social sciences than the sciences. Most people were native speakers of English (80%). Most people have used ChatGPT before (72.7%). In terms of confidence in using ChatGPT, the participants were normally distributed, with most people falling in the middle level of confidence and fewer people feeling very confident or very unconfident. The majority of participants were experienced in advising students on research (88.8%). The overall composition of our sample implies that the results of our analysis characterize more faculty from the social sciences who are native speakers of English and who have some experience using Chat GPT and advising students on research.

Results

To answer the first research question, I used descriptive statistics and a binomial inferential test. Two-thirds of the 45 participants who answered the question correctly identified the essay generated by ChatGPT (30 individuals, 66% correct response rate). About 77.8% of the 45 participants who responded to the question correctly identified the essay written by a student with lower English writing ability. Approximately 61.4% of the 44 participants who answered the question correctly identified the essay written by a student with higher writing ability. In general, it seems that the majority of the participants have correctly identified ChatGPT-written essays and human-written essays. However, it seems that the participants found it easiest to identify the essay written by a student with lower English writing ability. They also found it more difficult to tell which essay was written by ChatGPT and which was written by a student with higher English writing ability.

To test whether these findings apply to the whole population, I conducted a binomial inferential test. I assumed that if the participants answered the question correctly by chance, the expected proportion of correct vs. incorrect answers would be 50% to 50%. So, I tested the observed proportions vs. the expected ones. For the first test, I received a p-value of 0.018 (<0.05), which allows me to reject the first hypothesis: The proportion of correctly vs. incorrectly identified ChatGPT essays is 50% vs. 50%. For the second test, I received a p-value of 0.0001 (<0.05), which allows me to reject the second hypothesis: The proportion of correctly vs. incorrectly identified essays written by a student with a lower level of English is 50% vs. 50%. For the third test, I received a p-value of 0.12 (>0.05), which does not allow me to reject the third hypothesis: The proportion of correctly vs. incorrectly identified essays written by a student with a stronger level of English is 50% vs. 50%. This means that there is a high probability that the faculty correctly identified essays written by ChatGPT and by students with lower levels of English. However, the difference in correctly and incorrectly identifying an essay written by a student with a stronger level of English is purely by chance.

To check whether the faculty assessed the English ability of students or ChatGPT in the same way as I did, I asked them to evaluate the essays in terms of their level of writing (low, average, or high). Most of the faculty said that ChatGPT-generated essays were at the average level of writing (64.4%). Most of the faculty have also found the essay written by the student with lower English ability to meet the average level of writing (57.1%). Meanwhile, the majority of the faculty have found the essay written by the student with higher English ability to meet the excellent level of writing (51.1%). This implies that I correctly identified the third essay as written by the student with the higher level of writing ability.

Conclusion

Conclusion

In general, my study shows that university faculty can correctly detect ChatGPT generated essays. Professors can easily tell ChatGPT generated essay from the one written by a student with lower English writing skills. However, it is more difficult for faculty to tell apart the essay written by a student with higher level of writing from the essay generated by ChatGPT.

These findings are similar to the finding of the previous study conducted on high school teachers (Waltzer et al., 2023). The study also found that teachers had difficulty in differentiating the essay written by a student with higher level of writing from the essay generated by ChatGPT.

Limitations

Given the nature of my sample, this conclusion applies mostly to professors from the fields of social sciences rather than from natural sciences. It also applies more to people who are native English speakers and who have used ChatGPT before. Finally, the findings are more relevant to professors with experience in advising students on research.

Recommendations

My suggestion for future research is to conduct a survey on a greater variety of faculty at other universities. This will allow to make my findings more generalizable.

My suggestion for practice is to train the faculty how to differentiate better student written text from ChatGPT generated text, paying special attention to detecting the difference between ChatGPT generated and well written student essays.

Citations

Abd-Elaal, E. S., Gamage, S. H., & Mills, J. E. (2019). Artificial intelligence is a tool for cheating academic integrity. In 30th Annual Conference for the Australasian Association for Engineering Education (AAEE 2019): Educators becoming agents of change: Innovate, integrate, motivate (pp. 397-403).

Francke, E., & Bennett, A. (2019). The potential influence of artificial intelligence on plagiarism: A higher education perspective. In European Conference on the Impact of Artificial Intelligence and Robotics (ECIAIR 2019) (pp. 131-140).

Heaven, W.D. (2023). Chat GPT is going change education, not destroy it. MIT Technology Review, April 6, 2023, ChatGPT is going to change education, not destroy it | MIT Technology Review

HLEG: High Level Expert Group on Artificial Intelligence (2019), A definition of AI: Main capabilities and disciplines.

King, M. R., & ChatGPT. (2023). A conversation on artificial intelligence, chatbots, and plagiarism in higher education. Cellular and molecular bioengineering, 16(1), 1-2.

Nweke, M. C., Banner, M., & Chaib, M. (2023, November). An Investigation Into ChatGPT Generated Assessments: Can We Tell the Difference?. In The Barcelona Conference on Education 2023: Official Conference Proceedings (pp. 1-6). The International Academic Forum (IAFOR).

Waltzer, T., Cox, R. L., & Heyman, G. D. (2023). Testing the Ability of Teachers and Students to Differentiate between Essays Generated by ChatGPT and High School Students. Human Behavior and Emerging Technologies2023.

 

 

 

 

Acknowledgement

I would like to give credits to my parents, all the university proffesors that participated in the survey and my sister who wrote one of the essays.