CYSF

Presentation
Problem
Method
Analysis
Conclusion
Citations
Acknowledgement
Attachments

Using Machine Learning to Diagnose Respiratory Illness

Using a machine learning model, I will interpret sound files from a database of patient coughs, wheezing, and crackles. From the interpretation, I will utilize the model to create a diagnosis for the patient.

Eveleen Khattra

Grade 9

Presentation

Not working? Open in a new tab.

Problem

While the World Health Organization has declared that it is no longer a pandemic, as of February 2025, 2,800 cases of SARS-CoV-2 (COVID-19) have been reported globally [5] [11] [28], [29]. This virus continues to prove a significant risk for vulnerable populations, such as those at risk of severe respiratory illnesses, or the elderly. Individuals such as my grandfather, who contracted this virus, and due to a delay in diagnosis, had a more strenuous recovery.

Canada is also currently in the midst of a healthcare crisis, with a significant shortage of medical professionals, in particular, family physicians[2] [4] [8] [13] [17] [23] [25]. In Alberta, for example, 16% of the population lacks access to primary care physicians [13] [14]. To alleviate the burden on healthcare providers, implementing advanced diagnostic tools capable of assisting in diagnosing conditions without the necessity for direct patient-doctor interactions could substantially ease the strain on the healthcare system.[12] Frequently, patients visit doctors for conditions that may not require intensive medical attention, such as the common cold, with coughing notably ranking among the most prevalent symptoms prompting visits to family physicians [15]. Introducing technological solutions that can facilitate preliminary diagnoses could mitigate unnecessary clinic visits, thereby enhancing the overall efficiency and effectiveness of healthcare delivery across Canada. If we also consider countries with far fewer medical resources at their disposal, the need for more solutions which improve effective diagnosis of diseases and expand the access of patients to these diagnostic tools is clearly illustrated.
The conventional diagnostic process for respiratory illnesses—comprising of detailed questionnaires, stethoscope auscultation, physical examinations, and chest x-rays—is largely impractical in these settings due to the inaccessibility of these tools [1] [2] [3] [21] [22] [26] Although rapid testing kits were integral to the initial global response by delivering immediate results [16] [19], they are no longer widely available to the public [20].

This project proposes the reintroduction of a rapid screening method through a sophisticated machine learning algorithm trained on thousands of cough recordings from COVID-19 patients available online through the COSWARA and COUGHVID databases. This tool has the potential to revolutionize early detection and expand healthcare accessibility on a global scale. It could serve as a useful tool in the future, potentially alleviating some of the strain on our overburdened healthcare system, and simultaneously helping improve access to healthcare in other countries around the world.

Objective: Design a Machine Learning Algorithm capable of detecting the presence or absence of a respiratory illness such as COVID-19 based on the coughing sounds made by a patient at an accuracy of 80% or higher.

Method

In order for the COVID-19 cough detection process to work effectively, a few critical steps were taken. First, a training model was created using audio data consisting of coughs recorded from various individuals. The idea was to build an algorithm that could distinguish between COVID-positive and COVID-negative coughs based on subtle acoustic features in the recordings.

This task falls under the umbrella of audio classification using deep learning, specifically a Convolutional Neural Network (CNN). The goal was to train a neural network model that could take in audio files, extract meaningful features, and classify them accurately into categories such as COVID, which were recordings of people with COVID-19, non-COVID, recordings of people who are not infected, or symptomatic, which are recordings of people who show symptoms, but without confirmed COVID-19.

This machine learning model uses the COSWARA dataset, which takes the gender, previous respiratory condition, fever or muscle pain, the diagnosis (healthy, positive, or negative), the cough type, dyspnea, wheezing, stridor, choking, congestion, and severity of the cough. Each of these recordings was processed to extract Mel-frequency cepstral coefficients (MFCCs), a common feature used in audio recognition tasks. Before training, the audio recordings must be converted into a format that a neural network can understand. This is done using Mel-frequency cepstral coefficients (MFCCs). MFCCs are a compact representation of the spectral properties of audio and mimic how the human ear perceives sound. These features capture the pitch, tone, and texture of the cough.

This function loads the audio file, computes 40 MFCC features, and averages them over time to produce a single 1D feature vector for each file.

The dataset is organized into folders by label. The script loops through the folders, reads the audio files, and applies the MFCC extraction function.

Here, X contains the extracted MFCC features, and y holds the corresponding class labels.

The string labels ('covid', 'non_covid', 'symptomatic') are converted into numeric form using TensorFlow’s utility:

Then, the dataset is split into training and test sets:

X_train, X_test, y_train, y_test = train_test_split(np.array(X), np.array(y_encoded), test_size=0.2, random_state=42)

The model used here is a basic fully connected neural network built using TensorFlow’s Keras API:The input layer has 40 units (equal to the number of MFCCs).

Two hidden layers with 100 neurons each use the ReLU activation function.

The output layer uses softmax with 3 units, corresponding to the 3 classes.

The model is compiled with the Adam optimizer and trained using categorical cross-entropy loss since it's a multi-class classification task. The training runs for 50 epochs. The validation data allows us to monitor performance and prevent overfitting.

After training, the model is evaluated on the test data. The reported accuracy indicates how well the model can generalize to new, unseen cough recordings.

This project demonstrates how a simple deep learning model, when combined with audio feature extraction techniques like MFCC, can be used to classify health-related audio signals. While this is a proof-of-concept, such models could form the basis for real-world screening tools, especially in remote or underserved regions.

Example Output:

Procedure

Run the algorithm, SCIA Premium Cough Classifier, and ensure that the model has no issues while testing it on the computer.
Select audio files from the COUGHVID Database. 50 Files should be COVID-Positive and 50 Files should be COVID-Negative. Ensure all audio files are .wav files.
Open windows command prompt and run as administrator.
Call back the function from the machine learning model: python predict.py <audio_path>.
Record the output generated, both the diagnosis of if this file has COVID-19, and the percent likelihood that this file is positive for COVID-19.
Repeat steps four and five an additional four times, so that five total trial runs are conducted.
Move on to the next file, repeating steps four, five and six an additional 99 times as you work through all 100 audio files.
Analyze the outputs to identify any potential trends present.

Background Research

800,000-950,000 Albertans lack basic healthcare, with a shortage of physicians, general doctors, and emergency healthcare providers [24]. After the SARS-Co-V-2 (COVID-19) pandemic, the effects of understaffed and overwhelmed medical faciliites has been exacerbated. 950,000 Albertans lacking basic healthcare, from the staffing shortages, increased wait times, and deterioration of the public health care system.[13] Rural communities, or communities who are widely isolated from a healthcare provider are suffering with the issues of the healthcare system. They do not have access to basic healthcare needs, as the shortage of physicians have left the only hospital or medical center in the area understaffed and overwhelmed. In Canada, 6.5 million people lack access to a healthcare provider [14].

Across Canada, there is a lack of access to healthcare, meaning that individuals are unable to have a proper diagnosis for certain diseases. Those living in remote areas, or those who live next to medical facilities which are understaffed and overwhelmed are also suffering from the low access to healthcare. Across Canada, COVID-19 cases have been increasing. Since April 2024, there has been a slow increase of COVID-19 cases, with some periods of stabilization. [6] Overall, the number of outbreaks has been steadily increasing. With the lack of healthcare, and the increasing prevalence of COVID-19, a solution has to be found. During the height of the COVID-19 pandemic, rapid testing kits were widely available, and used to diagnose if a patient had the diseases from their own home[16]. However, although cases are still rising, rapid testing kits are harder to find, meaning an individual would have to go to their local doctor to diagnose their disease. Cough sounds are one of the most prevalent symptoms and defining features of COVID-19 [10]. Thus, programming an algorithm which uses raw cough sounds to diagnose COVID-19 would mitigate the lack of healthcare and resources needed to diagnose a patient. Raw cough sounds are readily available, and patients are able to use their devices to upload a cough sound. There is no need for a doctor or medical professional, and as coughs are a defining feature of COVID-19, the accuracy of the diagnosis would be quite high.

to prevent the increasing COVID-19 cases from the comfort of the users’ own homes. There is no need to visit the doctor’s office in person, you do not require a doctor for the diagnosis, and this tool is widely applicable. Additionally, as it is a machine learning model, the screening process would be quite fast. Currently, it takes a few days to diagnose a respiratory disease, specifically COVID-19, and it often requires multiple samples from the nose and throat which are obtained from visiting a doctor in person [12] [14] [15]. This tool mitigates those requirements, giving healthcare access to those who are deprived from it.

The idea of using a machine learning algorithm to diagnose COVID-19 based on the raw cough sound has been done before.

https://dspace.mit.edu/handle/1721.1/128954

MIT has created an Open Voice Model, which diagnoses COVID-19 as positive or negative based on the raw cough sounds. The model has a COVID-19 sensitivity of 98.5% with a specificity of 94.2%. For subjects that are asymptomatic, the model achieves a sensitivity of 100%, with a specificity of 83.2%.

https://github.com/Klangio/covid-19-cough-classification. COVID-19 Cough Classification Model on GitHub. This model is another COVID-19 cough classfiication model, using raw cough sounds to diagnose patients. However, this model shows a relatively high false-negative rate, as 19 out of the 170 samples were positive.

https://www.covid-19-sounds.org/en/. University of Cambridge’s COVID-19 Sounds app.

https://ieeexplore.ieee.org/document/9911027 - “Use of Voluntary Cough Sounds and Deep Learning for Pulmonary Disease Screening in Low-Resource Areas”. This research paper explored the idea of utilizing cough sounds to diagnose COPD, Asthma, and respiratory infection.

The code I used as a skeleton model to build my project off of was the SCIA Premium Cough Classifier, found on GitHub. This machine learning model is a Convulational Neural Network (CNN). CNN’s are beneficial in Machine learning models as they are a deep learning algorithm designed for classification and segmentation. From a raw cough sound, to detect the presence of COVID-19, there are various differentiating factors within the cough sound. In a cough that is COVID-19 positive, the cough is usually dry, and hacking. Most moderate cases have subtle crackles and dyspnea. Negative COVID-19 coughs are more wet and have wheezing characteristics. From these specific characteristics of both coughs, the CNN is able to detect the various segments of the cough, and classify them into positive or negative.

Additionally, the base model uses the librosa library to extract mel-spectrograms from the raw cough audio. The mel-spectrograms are beneficial for diagnosing and differentiating between the coughs as they represent the short-term power spectrum of sound, which makes audio analysis easier.

The library torch was used. Torch is part of PyTorch, which is another major deep learning model. For this model, it was likely used as an alternate training model. Torchaudio was another library used, and is part of PyTorch for audio classification.

Tensorflow, keras, and tensorflow-estimaro are other libraries which are used in the model. These libraries are used to develop many machine learning models, as it helps with training, interference, and image processing. In this case, the image would be the mel-spectogram. They are the core libraries for building and training CNNs.

The libraries soudfile, audioread, resampy, and PyWavelets were likely used to support audio loading and to transform the audio to the spectogram, in order for the classification to be easier and more organized.

Numba and llvmlite were other libraries used, likely for faster numerical computations. For this model, it would likely be accelerating feature extraction for an increased efficiency of the model.

The libraries numpy, pandas, scipy, and scikit-learn were used in the model. These libraries were likely used for data manipulation, calculating the probability of the cough being positive or negative, and the machine learning preprocessing.

Matplotlib and seaborn were other libraries used in the skeleton model. The purpose of these libraries is for data visualization, for model accuracy plots and a confusion matrix.

Scikit-image and tifffile are libraries used for image classification if the images are derived from an audio. For this model, these libraries would be used for the spectrograms.

Joblib and threadpoolctl are libraries used for efficient parallel computing, and are often used in sklearn.

Tensorboard, tensorboard-plugin-wit, and tensorboard-data-server are all libraries used to visualize the training of the model. They visualize the loss curves, the accuracy, and others.

Tqdm is a library used for progress bars in loops, for training and data loading.

Markdown, werkzeug, and absl-py are libraries for logging, app building, or diagnostics.

Analysis

Results

Analysis

The algorithm overall was better at detecting files without COVID-19 rather than files with COVID-19. 49/50 files were identified correctly for files without COVID-19, and 44/50 were identified correctly for files with COVID-19. The algorithm also generated an output of a confidence measure, which was essentially how confident the algorithm was that the file had COVID-19. If the confidence was over 50%, it would be marked as though the file had COVID-19, and under 50% would be marked as though the file did not have COVID-19. For files without COVID-19, the algorithm generated a confidence of 3.97%. This meant that it was correctly confident that the files without COVID-19 did not have COVID-19. You could also phrase it as that the algorithm was 96.03% confident that the files were negative. The algorithm on average was 85.23% confident that files which had COVID-19 did indeed have COVID-19.

You could also phrase it as the algorithm was 14.77% confident that files which were positive were actually negative. Another thing to note is that through all tests of the algorithm, the same result was consistently generated. While the accuracy of the algorithm varied, the precision, the ability to consistently produce the same outputs given the same inputs, stood at 100%, through the 500 total times the algorithm was run. In regards to why these errors may have occurred, there could be multiple reasons. First of all, the algorithm should always strive to be more accurate. Increasing the amount of data is essential, and thousands if not hundreds of thousands more audio files would make this a far more accurate and applicable algorithm. Increasing the training the algorithm undertakes would vastly improve its results. In addition, the additional data should have increased levels of diversity. In medicine as a whole, there have been errors in the past from a failure to diversify studies, resulting in studies not being as applicable for people of certain ethnicities or genders. Increasing diversity from various ethnic backgrounds, ages, gender, and health statuses would improve the algorithm's abilities overall. Additional diversity could also come from the way in which the sound is recorded. All microphones do not have the same quality, so ensuring there is a good mixture of qualities of audio files so that the algorithm can detect audio files even when they are recorded in less than optimal circumstances would greatly improve accuracy moving forward. Overall, however, the algorithm performed quite well, with minimal errors that should be corrected with a more advanced algorithm able to combat these issues moving forward.

Conclusion

Conclusion:

The initial objective of the project was achieved. I was indeed able to create an algorithm capable of detecting a respiratory illness, in this case being COVID-19, at a rate higher than 80% accuracy. The algorithm effectively distinguished between positive and negative cases, demonstrating high confidence levels—averaging 96.03% confidence for negative identifications and 85.23% for positive identifications. However, certain issues remain, such as occasional misdiagnoses and a notable imbalance in accuracy, with higher reliability in identifying negative cases compared to positive cases. These limitations could likely be addressed with the inclusion of additional, more diverse datasets in future iterations of the machine learning model.

Looking forward, I intend to expand the algorithm's diagnostic capabilities to identify additional respiratory illnesses, including Chronic Obstructive Pulmonary Disease (COPD), Upper Respiratory Tract Infections (URTI), and Lower Respiratory Tract Infections (LRTI), in preparation for the Calgary Youth Science Fair. Should the project advance to the Canada Wide Science Fair, a significant goal will be to enhance user accessibility through the development of a user-friendly interface, such as a website, to facilitate self-diagnosis.

Ultimately, continued innovation in this technology holds the potential to significantly improve medical diagnostics, support early identification and treatment of COVID-19, reduce strain on Canada's healthcare system, especially family physicians, and provide essential diagnostic tools to regions worldwide lacking robust medical infrastructure. I am optimistic about the future applications and the substantial impact this diagnostic technology can achieve.

Citations

References

[1]Aggarwal, A. N., Prasad, K. T., & Muthu, V. (2022, March 1). Obstructive lung diseases burden and covid-19 in developing countries: A perspective. Current opinion in pulmonary medicine. https://pmc.ncbi.nlm.nih.gov/articles/PMC8815642/

[2]Alberta Health Services. (n.d.). Emerging issues. https://www.albertahealthservices.ca/ipc/page10531.aspx#ncov

[3]Balogh, E. P. (2015, December 29). The diagnostic process. Improving Diagnosis in Health Care. https://www.ncbi.nlm.nih.gov/books/NBK338593/

[4]Bruch, T. (2024, November 20). More physicians registering in Alberta, but doctors say data “could be misleading.” CTVNews. https://www.ctvnews.ca/calgary/article/more-physicians-registering-in-alberta-but-doctors-say-data-could-be-misleading/

[5]Canada public health infobase covid-19 cases and deaths data 2020-2022. GHDx. (n.d.). https://ghdx.healthdata.org/record/canada-public-health-infobase-covid-19-cases-and-deaths-data-2020-2022

[6]Canada, P. H. A. of. (2025a, January 3). Canadian respiratory virus surveillance report: Summary. Canada.ca. https://health-infobase.canada.ca/respiratory-virus-surveillance/

[7]Canada, P. H. A. of. (2025b, February 11). Government of Canada. COVID-19: Current situation - Canada.ca. https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection.html

[8]CBC/Radio Canada. (2024, July 18). Alberta’s ER Staffing Crunch getting worse in big cities and small, doctors warn | CBC news. CBCnews. https://www.cbc.ca/news/canada/calgary/emergency-room-staffing-1.7265038

[9]Clinic, M. (2024, December 11). The body’s response to throat or airway irritation. Mayo Clinic. https://www.mayoclinic.org/symptoms/cough/basics/causes/sym-20050846

[10]Cough: Is this a sign of COVID-19?: Ada Health. Ada. (n.d.). https://ada.com/covid/covid-19-symptom-cough/

[11]Covid-19 deaths: Who covid-19 dashboard. datadot. (n.d.). https://data-who-int.translate.goog/dashboards/covid19/deaths?_x_tr_sl=en&_x_tr_tl=id&_x_tr_hl=id&_x_tr_pto=tc

[12]Department of Health and Social Care. (2021, September 8). Innovation and new technology to help reduce NHS waiting lists. GOV.UK. https://www.gov.uk/government/news/innovation-and-new-technology-to-help-reduce-nhs-waiting-lists

[13]Derworiz , C. (2024, January 23). Primary care in “critical condition,” Alberta Doctors Group Head says, citing survey | CBC News. CBCnews. https://www.cbc.ca/news/canada/calgary/alberta-doctors-survey-primary-care-1.7092494#:~:text=Alberta%2C%20like%20other%20provinces%2C%20is,drowning%20and%20need%20some%20stabilization.%22

[14]Duong, D., & Vogel, L. (2023, April 24). National survey highlights worsening Primary Care Access. CMAJ. https://www.cmaj.ca/content/195/16/E592

[15]Horvat, E. (2023, April 15). Canadian chronic cough initiative. Chronic Lung Diseases. https://chroniclungdiseases.com/en/news/canadian-chronic-cough-initiative/?utm_source=chatgpt.com

[16]How accurate is a patient’s home rapid test result? IDSA Home. (6AD). https://www.idsociety.org/covid-19-real-time-learning-network/diagnostics/how-accurate-is-a-patients-home-rapid-test-result/#/+/0/publishedDate_na_dt/desc/

[17]Job market forecasts. Alberta.ca. (n.d.). https://www.alberta.ca/job-market-forecasts#:~:text=Alberta’s%20Occupational%20Outlook,-Government%20releases%20the&text=The%20following%20are%20a%20few,computer%20systems%20developers%20and%20programmers

[18]Key features of influenza, SARS-COV-2 and other ... (n.d.-a). https://www.publichealthontario.ca/-/media/Documents/I/2023/influenza-sars-cov2-respiratory-viruses-key-features.pdf?rev=f7e4dab95a6e4c6e94069795243e341a&sc_lang=en

[19]Landry, F. (2024). Covid-19 rapid tests: How good are they?. McGill University Health Centre. https://muhc.ca/news-and-patient-stories/news/covid-19-rapid-tests-how-good-are-they

[20]Lee, J. (2024, October 23). Free rapid covid tests a thing of the past in Alberta, unless you’re really lucky | CBC news. CBCnews. https://www.cbc.ca/news/canada/calgary/no-more-free-rapid-covid-tests-alberta-1.7360021

[21]Levin, A. T., Owusu-Boaitey, N., Pugh, S., Fosdick, B. K., Zwi, A. B., Malani, A., Soman, S., Besançon, L., Kashnitsky, I., Ganesh, S., McLaughlin, A., Song, G., Uhm, R., Herrera-Esposito, D., Campos, G. de los, Antonio, A. C. P., Tadese, E. B., & Meyerowitz-Katz, G. (2022, May 26). Assessing the burden of COVID-19 in developing countries: Systematic Review, meta-analysis and public policy implications. BMJ Global Health. https://gh.bmj.com/content/7/5/e008477

[22]M;, C.-C. M.-A. J.-A. (n.d.). Respiratory tract infections in children in developing countries. Seminars in pediatric infectious diseases. https://pubmed.ncbi.nlm.nih.gov/15825139/#:~:text=Abstract,give%20benefits%20to%20their%20populations.

[23]Martin, G. (2024, November 17). What the family doctor shortage looks like in Canada | the star phoenix. The Star Phoenix. https://thestarphoenix.com/business/what-the-family-doctor-shortage-looks-like-in-canada

[24]Modernizing Alberta’s primary health care system (MAPS). (n.d.-b). https://open.alberta.ca/dataset/2b933143-39f4-45e4-aeb3-523f5bd3a7b8/resource/9f4d5ad7-cdb6-418a-b0d9-a04bb1dc467f/download/hlth-maps-strategic-advisory-panel-final-report.pdf

[25]Nadeem Esmail Senior Fellow. (2024, January 12). Alberta should learn from other higher-performing universal health-care systems. Fraser Institute. https://www.fraserinstitute.org/commentary/alberta-should-learn-other-higher-performing-universal-health-care-systems#:~:text=Indeed%2C%20AHS’s%20desire%20to%20%E2%80%9Ccontinue,half%20a%20year%20on%20average.

[26]Peters DH;Garg A;Bloom G;Walker DG;Brieger WR;Rahman MH; (n.d.). Poverty and access to health care in developing countries. Annals of the New York Academy of Sciences. https://pubmed.ncbi.nlm.nih.gov/17954679/

[27]What is tensorflow?. NVIDIA Data Science Glossary. (n.d.). https://www.nvidia.com/en-au/glossary/tensorflow/#:~:text=TensorFlow%20can%20be%20used%20to,such%20as%20partial%20differential%20equations.

[28]World Health Organization. (n.d.-a). COVID-19 epidemiological update – 13 february 2025. World Health Organization. https://www.who.int/publications/m/item/covid-19-epidemiological-update-edition-176

[29]World Health Organization. (n.d.-b). COVID-19 epidemiological update – 14 March 2025. World Health Organization. https://www.who.int/publications/m/item/covid-19-epidemiological-update-edition-177

Acknowledgement

Acknowledgement:

Open AI’s ChatGPT Version 3.0 Mini. ChatGPT was used to help resolve debugging issues that arose while I was coding.

Ms. Willoughby is my mentor for Science Fair, and she helped provide significant mentorship and insight on the written parts of my project.

Shrey Raval is my mom’s colleague. He introduced me to Artificial Intelligence and taught me enough about Artificial Intelligence to build this algorithm.

Angad Singh Khattra is my older brother. He is experienced himself in science fair and helped me learn more about Artificial Intelligence and the keys elements of a successful project.

Attachments

View Log Book
(may download a file)