GOLD

The Art of Predictability, with the Science of Uncertainty

Analyzing historic March Madness tournament data and regular-season game data from 1985 - 2017 to attempt to accurately predict the Sweet 16 and full brackets (2018 & 2019) using Monte Carlo simulation models
Josh
Grade 8

Hypothesis

I developed a hypothesis that, if I gather all the data and results from the past 34 years of March Madness tournaments (including stats from regular-season games if necessary), since 1985 when the current 64 team format began, and run it through a probability-based Monte Carlo simulation then I think it will be possible for me to accurately predict the Sweet 16 as well as the full bracket for the next March Madness tournament because if I build the model correctly and I run the simulation enough times it will provide me with the information I need to accurately predict the Sweet 16 and the entire bracket.

Research

Science fair background research

 

What is March Madness: 

The March Madness tournament is an annual tournament put on by the NCAA and is a college basketball tournament hosted in March. This tournament consists of 64 teams ordered in a huge knockout tournament, each team having some chance of winning, some with a higher chance than others. The knockout aspect of the tournament means that there is a chance or probability that a weaker team knocks off a stronger opponent or an upset. Through my research, I will figure out how statistics apply to March Madness, how to build computer models, and how understanding probability can help me play the odds. 

Machine Learning Applied to March Madness:

  • Purpose “Why? Because sports are a fascinating and interesting example to apply data science given their seeming unpredictability.”
  • The primary data gathered and used in this experiment was historic NCAA data used to predict if a team will win or lose a given matchup.
  • Machine learning: The enabling of software applications more accurate without being explicitly programmed but instead more efficiently through the application of A.I intelligence. Essentially it is just providing a system the ability to automatically learn.
  • Monte Carlo simulation: a model used to predict the probability of different outcomes when random variables are present.

Through the analysis of data gathered from the past 33 years (1985-2017):

  • A significant drop in wins occurs from the fourth seeds to the fifth seeds.
  • By examining the relationship between data and visuals in terms of March Madness history it is evident that the 5 vs 12 seed matchup has produced the most upsets.
  • There is an average of 12 - 23 upsets per tournament.
  • Most championships have been won by top seeds, therefore, in the machine learning algorithm higher seeds should automatically be given a higher chance of winning. However, the chances of other lower seeds such as four, five, and six winnings are still possible and are a variable that must be accounted for.
  • Data and Visuals: Gathering the data I have and putting it into visuals, for example, a plot chart can help make the analysis clearer or more obvious. 

Features: Advanced basketball statistics:

  • We also have predictive variables that can help me better predict future matchups, things like box score. These things can be predicted by analyzing our basketball statistics including, field goals made, three-pointers made, free throws made, offensive and defensive rebounds, assists, turnovers, steals, blocks, and personal fouls, and success rates for each of these variables as well. Advanced statistics help us understand how different aspects of a team's performance is correlated with win percentage.
  • There are also more advanced features that also refer to the strength of a team compared to that of all of the other teams.

 

Building and testing:

  • In machine learning, you typically divide your data set into two partitions - testing and training sets. 
  • The highest accuracy model in the research that I did had experiments that predicted the correct full bracket outcome of the 2017 NCAA tournament and 2019 NCAA tournament 74% and 76% of the time. These Machine Learning models used the top ten variables that were found to be most effective in predicting an accurate win/loss percentage for each team in each matchup.


Applying machine learning to March Madness:

  • One in 9.2 quintillion is the odds that you will correctly pick the full March Madness bracket. That is because there are 63 games (64 teams) with two possible outcomes, which equals 263.
  • What factors do you use in determining the outcome of a sporting event? This is the biggest and most difficult question to answer when determining the outcome of any uncertain event. 
  • Before the tournament, each one of us holds some bais even if we have all of the statistics to justify our prediction. This is because there is actually no clear-cut way that the tournament or game will go.
  • Bais: In this context, we all have different reasoning to back up what we think is going to happen in terms of knowledge we have or something we believe, even if it is not true.

Can probability-based models help us with these predictions:

  • Because we have all the data from the past 25+ years of NCAA basketball regular season we can try to use data analysis (machine learning, data visuals) to figure out which statistics most correlate with a team winning a specific matchup, and then this information can be used to predict the postseason March Madness tournament as well. 
  • Using all the statistics we have from each of these teams we can put the information about two different teams that matchup (data) into a predictor model and it will give us the probability each team has of winning the game.
  • Always try a model that is as simple as possible that also makes information as well as proof as evident as possible.
  • When performing this we need to make sure that the training data we use is representative of all of the teams competing in the NCAA tournament, without discrimination. 

 

Build a winning March Madness bracket:

  1. Upset probability: there are six rounds in the March Madness tournament. The probability of an upset occurring in any given round is 25.6%, with an average of 10 - 16 upsets happening per tournament.
  2. The avengers: Calling the right number of upsets at the right time is one of the keys to a successful bracket as it is one of the most difficult things to predict. Knowing what seed match-ups are the most likely to produce upsets, then the second most, etc… Win likelihood can not be captured by seeding alone a large amount of the time.
  3. Elite Eight insights: The Elite 8 produces the most upsets as the Elite 8 round produces 2.1 upsets for every four games which must be taken into account once we get deep into the bracket prediction.

 

Playing the odds:

Most of the research that I did showed that people were trying to predict the winning bracket correctly in a single attempt and the best predictions were about 75% accurate. Understanding probability tells us that one way of improving the chances of succeeding is to understand how much you have improved the odds and playing as many times as the odds suggest. For example, when rolling two dice it is more likely that you will roll a 7, 6, or 8 so you will pick these numbers more often. However, If you rolled those dice say 100 times you would still most common role a 6, 7, or 8 but some 4’s, 5’s, and 9’s would start showing up in the mix. For this reason, if we are able to determine the odds and test the model enough times we would know how many unique bracket combinations can meet the criteria of our training data analysis. If we can provide enough important training data variables (that correlate to wins) then maybe we can reduce the unique bracket combinations down to a manageable size. People around the world today who are developing vaccines, drilling oil wells, and investing in companies take this approach. Instead of drilling one well, if the odds are 1 in 4 then you make sure you have enough money to drill four wells to achieve at least one successful result. If the odds are the size of my basketball team, class, or one of the offices of Warren Buffet's company, if the people coordinate and play the brackets together to cover as many of the unique combinations that match the data analysis and are predicted by a probability-based model then the group of predictions should have a much higher chance of predicting a correct bracket. So maybe if Warren Buffets company worked together and used a little bit of data they could all win some money. Personally, I would still take the cash even if I had to share it with some other people.

Variables

Variables:

Manipulated variable: 

  • types of input data
  • levels of data analysis 

Responding variable: 

  • Model prediction accuracy
  • Team win probability

Controlled variables:

  • Models used
  • Team seeds (by year)
  • Number of iterations when testing
  • Actual bracket my model results are compared to

 

Procedure

Methodology:

I completed many steps throughout the duration of my science fair project and this is the order I did them and how I did them.

 

Raw data:

To start the experiment I gathered several sets of raw data. These included Sports Reference College basketball data set of regular-season basic and advanced team data. Next, I went to USA Today and downloaded Sagarin rating data from 2000 - 2019. The last step was to download NCAA Men’s Div 1 tournament data from 1985 - 2019. This included all the teams, matchups, and game results for each tournament.

Sports Reference’s data set of regular-season games with basic and advanced statistics included over 100,000 games and was too huge and too difficult to work through. So based on my research I decided to use Sagarin data, which is an analysis of all Division I college basketball season data from 1985 - 2019. Jeff Sagarin uses a score based method to go through all the college game data and comes up with a three-component unbiased analysis that makes up the final Sagarin ratings. He has compiled these ratings for USA Today since 1985, but I could only download data from 2000 - 2019 from the USA Today website.  The Sagarin ratings and the NCAA tournament data are the two data sets that I used as a basis for all my experiments. 

I took these data sets and inserted them into a spreadsheet where I cleaned them up.

 

Data cleaning: 

Once I had my two data sets I went on to clean all the data. For the Sagarin ratings, this data spanned from 2000 - 2019 and for the NCAA tournament data, this spanned from 1985 - 2019. I cleaned the data so the computer would be able to understand and interpret the data moving forward in my experiments. To do this step I had to paste all the Sagarin data from the USA Today website into the spreadsheet. Once the data was pasted I had to turn it from text into numbers and get rid of the information that I did not need. I had to do the same thing for the NCAA tournament data. Cleaning the NCAA tournament data was easier because I found a source that had compiled and cleaned a lot of this tournament data for me. Also, the school names in the NCAA tournament data and the Sagarin ratings did not always match. I had to go through all of these names and make sure they were correct and the same in both data sets. When this was done I could match teams from the tournament with the Sagarin ratings from their regular-season performance.

Matching these two data sets allowed me to analyze the data and come up with ways to build my models.

 

 Analysis part one: Splitting the data

The first step here was to split my data sets into learning data and experimental data. To make sure that my experiments were unbiased I split the data into learning data, data from 1985-2017, and experimental data, data from 2018 and 2019. I used data from the learning data set (1987 - 2017) to do data analysis and to build and test the models and then I used data from 2018 and 2019 to do my experiments.

Analysis part two: Predicting winners through regular-season scoring performance

I got a lot of help and saved a lot of time on this because of using the Sagarin ratings. Sagarin uses advanced statistical models to go through all the raw team regular season data in order to come up with an unbiased rating of each team’s performance in the NCAA Men’s Division 1 season. These ratings are updated every week throughout the whole season. So it is basically a pre-analyzed set of data. I used the Sagarin rating which is a synthesis based on three score-based ratings. The Golden Mean and Predictor analysis use scores in different ways but are completely score-based ratings and Recent is a score-based rating that weighs recent play more than early play. We can use Sagarin ratings, which usually correspond to the tournament seeding, to predict the likelihood of a win. This type of prediction usually favors stronger teams and only predicts upsets about 33% of the time.

Analysis part three: Predicting the right number of winners and upsets through seeding

In order to pick an accurate bracket, it’s not enough to know which are the strongest teams and pick the winners, as you can see from the graph above, you must also predict upsets accurately. When we looked at the data sets from 1985 - 2017 I figured out that each year had an average of 17.5 upsets per tournament, with a maximum of 23 upsets (1999) and a minimum of 12 (1993, 2007, 2015) with a certain number of upsets per round.

 As you can observe from the visual above, when we look at the seeding of each team historically we can discover the win/loss probability of each seed over the past 33 years. For instance, if we take the 1 seed and 16 seed match-ups from 1985-2017, a 1 seed never loses to a 16th seed in the first round, or the 1 seeds win all four (4) games in round one and the 16th seed loses all four games. 

Another example could be an 8 seed vs 9 seed matchup. The win probabilities of this matchup are 51% for the 8th seed, to 49% for the 9th seed over 132 games played. This result suggests a very even matchup slightly in favor of the 8th seed as expected. The model that we created to represent these win/loss probabilities for each matchup helps us to choose upsets even if the other team is favored in terms of Sagarin rankings.

The plot above shows that the seeds 1-4 perform pretty predictably through the tournament. It’s pretty hard to bet against a 1 seed as they won 60% of all the tournaments over the 33 years in the learning data set (1985-2017). Once you get through the first four seeds and past the first round you can see that some of the poorer seeds outperform the higher seeds in a particular round. Modeling this behavior can help pick upsets.

Analysis part four: Predicting upsets by round

With the right number of upsets in hand through seeding, I can also predict the probability of upsets happening and the number of upsets that have happened in each of the six rounds, based on historical data. I was able to use the historical data to find the percentage of games per round that upset (orange). According to the data I was able to take the total games per round and figure out the average number of upsets that are likely to happen (blue). For example, in the Elite Eight (round 4) there are 8 teams and four games. The chart shows that 45% percent of games are likely to be upset. So based on historical data, on average we can round the number of games that result in an upset to 2 of the 4 games. This model that I created to represent the percentage of games that result in upsets and the number of games that upset gives us an accurate and unbiased upset selection method.

Analysis part five: Using Distributions

Distribution, a mathematical representation that describes the probability that an event will have a specific outcome or a specific set of possible outcomes. People use probability distributions because these distributions help them find more specific and accurate ways to find the likelihood of an event, the long-term average outcomes, estimate how uncertain events are. Using distributions in my experiment was important because they really allowed me to see the true probability of a specific outcome of an event. This allowed me to make selections for the outcomes of games more accurately. It also helped me understand the true probabilities based on historical data. All of my three models included distributions. My predicted winners in Model 1 were based on regular-season performance and included a distribution around Sagarin data. Predicting the right number of winners in Model 2 included a distribution around the Seed Win probability along with Sagarin distributions. Lastly, predicting upsets by round in Model 3 included a distribution around the number of upsets per round for each round to help me determine a realistic number of upsets to select per round along with distribution around the Sagarin data.

Monte Carlo Models:  

For my experiment, I created three Monte Carlo simulation models each exploring different ways of predicting a March Madness bracket and exploring probability to help me figure out what components go into predicting an accurate March Madness bracket.

A Monte Carlo simulation is a model used to predict the probability of different outcomes when random variables are present. This type of model randomly selects values for these variables from a probability distribution over many iterations to predict the probability of an outcome.

 

Model 1: Regular Season Performance

The first model that I built, predicted winners through regular-season performance. This model used the Sagarin ranking of each team’s regular-season performance and defined a normal distribution with the Sagarin value as the mean. Each team’s Sagarin ranking was generated randomly using the distribution defined for each team and then these Sagarin values were compared to simulate an accurate prediction of the uncertainty of a matchup. The winning team was the team with the highest Sagarin rank for that iteration. The final experiments were run with 5,000 runs.

 

You can see from the probability distributions for Villanova and Mount St.  Mary’s in the 2017 East region 1v16 matchup that Villanova will win most (99%) of the matchups run. The area where the distributions overlap is where the model can predict Mount St. Mary’s will win (1%).

 

Model 2: Predicting the right number of winners through seeding

This model tracks the chances a team with a given seed has of advancing to the next round. The way that this model works is that Sagarin ratings are picked from the probability distributions for each team. Then the Seed Win% is picked from a probability distribution based on the historic data of that matchup. For example, based on historic data on average an 8 seed vs a 9 seed matchup favors the eighth seed winning over the ninth seed by 2%, but in some years the ninth seed outperforms the eighth seed and in other years it's the other way around. As you can see from the image below a probability distribution describes this better. which is why distributions and many tests are really required to determine what the real chances of an upset are.  I used Sagarin rating to rank all the teams rated at a certain seed from best to worst and if the Seed Win% is 51% vs 49% for (8 vs 9) then the top two teams for the eighth seed advance and the bottom two teams drop out, then the two ninth seed teams matched up with the bottom two eighth seed teams, that were dropped, advanced in those specific matchups. This is how the model forces upset keeping higher-ranked Sagarin rated teams and dropping lower-ranked teams based on historic probability distributions for seed matchups.

 

 

Model 3: Predicting upsets by round

In the last and final model, I created, predicting upsets by round, is a measure of how many upsets should happen per round according to the historical data dating back to 1985. This model also has distributions around it so that the number of upsets will vary each year but results will always tend back to the mean. This model allows us to determine the number of upsets that should happen per round so that we don't pick either too many upsets or too few upsets and that we vary the number of upsets per round like the historic data modeled by the distribution. Similar to the other models, Sagarin ratings are picked for each team based on the team’s Sagarin rating probability distribution. Next, the number of upsets is picked from the probability distribution I got from the historic tournament data.

The image below shows how the average upset probability of 44% in the fourth round is a distribution varying from 0 - 100% with a mean of 44%. This means most of the time there will be 2 upsets but sometimes there will be one or three and a very small amount of times there will be none or all four games will be upset. This matches the historic data.

 

Once we know how many winners and upsets we need per round we calculate the Sagarin difference for each matchup, each time the model runs. The Sagarin difference is the Sagarin rating of the higher-seeded team minus the Sagarin rating of the lower-seeded team. If this difference is negative then this matchup is upset so I keep it for one of the upset spots. Then I take all the positive Sagarin differences and rank the matchups using these differences, putting the biggest difference at the top and lowest at the bottom. I drop the lowest-ranked teams until I have the right number of upsets.

 

Each of these models has its strengths and weaknesses, but all these models, hand in hand, allowed me to determine and succeed in choosing the results of my bracket.    

 

Additional Models

I created 3 additional random model variations based on my initial three models. The way that I did this was to set the Sagarin values for the distributions equal. This way the model picked randomly from the same distribution so all teams had the same chance of winning the matchup. For example, all matchups in Model 1 had a 50/50 chance of winning as the probability distributions were the same. In Model 2 all the Sagarin ratings were set to 50 but I kept the Seed Win%. This helped me understand how seeding by itself affected the prediction accuracy. For Model 3 Sagarin's rating was set to 50 for all teams and I kept the round upset probability. 

 

 

 

Observations

 

  1. This project was challenging and I needed a lot of help to get through all the work in the time I had. 
    1. Using Sagarin data instead of trying to figure out the regular season data on my own. 
    2. Modifying a model provided by Pulvermacher from Nighthawk Intelligence was way easier than building one from scratch.  
    3. I also got help from my dad to build the models and use the Monte Carlo software. 
  2. When looking at models 101, 102, and 103 I noticed that 101 and 103 acted the same and 102 did a lot better. 
    1. All teams in the models were matched up randomly, but
    2. Model 102 still kept the seed win% which made it hard for lower seeds to advance. 
    3. Models 101 and 103 gave lower seeds an equal chance to advance through the rounds. 
  3. Model 002 could have performed better if it was built better. 
    1. The model uses the seed win% to predict how many teams from that seed should advance to the next round.
    2. The way that teams are chosen to advance is based on ranking the strength of the team (Sagarin rating) and not based on the strength of the match-up (Sagarin difference = Sagarin rating of higher seed team - Sagarin rating of lower seed team). 
    3. Making this change would better explore the chances a given team with a given seed has of winning a game against another given team with another given seed. 
  4. Model 003 is built based on the Sagarin differences of each match-up and it performs well which indicates that we should use Sagarin difference rather than just Sagarin to determine the strength of a matchup.
  5. When I first built model 002 I used average seed win%. I ended up updating this model to Model 002.1 by adding probability distributions for seed win%. This increased the range of the prediction accuracy.
  6. I tried running some of the models with 100, 1,000, 5,000, and 10,000 iterations. This widened the range of prediction accuracy and increased the accuracy of the model prediction. The increase after exceeding 5,000 iterations was very small and the model running time was quite a bit longer which may show that it is not worth it to exceed the 5,000 iterations mark.
  7. Out of models 001, 002, and 003, model 001 picks the least upsets. Although model 001 doesn’t pick many upsets it still performs well. Showing that predicting winners is most important and that it is very hard to pick the right upsets even if you get the number of upsets in the tournament and even per round right.
  8. One way to improve models 002 and 003 may be to use the Sagarin Recent rating to improve upset picking. Since this rating gives more value to a team’s end-of-season performance it may help to show which teams are performing the best going into the tournament which may be a better way of picking upsets.

Analysis

For all Berkshire Hathaway employees, picking the Sweet 16 perfectly means getting a $1 million a year prize. For everyone else, a one billion dollar prize was available in 2014 for a perfect bracket prediction, an impossible feat? Or maybe not. The models we created for 2018 and 2019 are set to use Sagarin rating data and some other data variables in order to pick within a specific range of choices. The models run these iterations at very high speeds as many as 100,000 times. I chose 5,000 iterations for my experiments with each iteration giving unique bracket selection results each. After all the iterations are run the final result displayed is based on the first values used (true Sagarin ratings, average seed win% and average upsets per round)  and gives a final bracket result. The model also keeps track of all the prediction accuracies for every iteration by comparing the model prediction versus the actual results of the 2018 or 2019 tournaments and tells us what our correct picking percentage is for both the Sweet Sixteen and Perfect Bracket picks. As you can see, even with 5,000 unique bracket picks none of the models picked a perfect bracket or Sweet Sixteen. The image below is an example of the Perfect Bracket prediction results that the software gives after the 5,000 iterations are run. I simplified this plot to show only the prediction% range and the prediction based on the starting values used for the model variables (red dot). The results are shown in the four graphs below the next paragraph

Something that I noticed about these models is that the models based on Sagarin ratings (models 001, 002, 002.1, 003)  did better than the models that randomly selected teams (models 101, 102, 103). The models based on Sagarin better pick brackets using the starting values for Sagarin ratings, Seed Win% and Round Upset%. These are the Bracket prediction values in red on the graph. This makes the models based on Sagarin data more reliable because it reduces the unknown. Model 003 is the most accurate model for predicting the Sweet Sixteen and the Perfect Bracket for both 2018 and 2019. Model 001 is a close second. 

The models also keep track of win% for each team for each round over all the iterations, helping me to see how likely a team is to progress through the tournament and even to win. This is helpful because at the end of the day these models don’t tell me what is going to happen, they only help me to understand what is likely to happen. As much as the computer can help, there is still some unpredictability in this game.


 

 

Conclusion

Conclusions:

Going into this experiment I had a very specific goal. The goal I had for my experiments was to use my understanding of probability to build Monte Carlo simulation models to accurately predict the Sweet Sixteen and a perfect bracket. However, I did not achieve this goal. The closest to a perfect bracket was 79% accuracy and the closest to a perfect Sweet Sixteen prediction was 85% accuracy using Model 003 for the 2019 tournament. 

Despite this, I Learned so many things along the way that I can apply to my everyday life. My learning about probability will also support my understanding of many things including health decisions, sports, and even scientific subjects I am interested in. 

Looking toward the 2021 NCAA tournament I am going to make some of the adjustments to the models I talked about in my project, in hopes to improve their prediction accuracy. Then, I am going to make my prediction for the 2021 NCAA tournament as I continue on my journey to the perfect bracket.   


What I have learned:

Despite the fact that I didn't achieve the goal I set out to achieve I did gain many things from my project. This project really helped me to understand probability and how I can relate it and use it in real-world situations or events. I learned how to create models and use formulas to tell the computer what I want it to do. Throughout this project, I learned how to interpret sets of data and gain insights related to probability, ranges of results, and prediction percentages. I have also learned how to think scientifically and make decisions in situations that are uncertain based on information that I have.

 

Taking this learning further:

This project has given me insights and a view on how people are making decisions in different fields of science like making public health decisions about COVID. How sports’ team owners are making decisions for their team on the field and in the boardroom, and even how self-driving vehicles use probability when they are making decisions on the road. Investment bankers use probability when making decisions about buying companies, funding projects, trading options, or selling stocks. Also, my growing understanding of probability is helping me to build a foundation to better understand subjects like quantum mechanics that I have an interest in. Many things in our world today are based on an understanding of probability. So, whether you are taking care of a family budget or you are someone who is trying to win an election you have to be aware of probabilities.

 

Final steps:

Looking toward the 2021 NCAA tournament I am going to make some of the adjustments to my models I talked about in my project in hopes to improve their prediction accuracies. Then I am going to make my prediction for the 2021 NCAA tournament as I continue on my journey to the perfect bracket.   


 

Application

First of all, and the main thing I am going to be doing going forwards, using the new knowledge I have gained is to pick my 2021 March Madness bracket using each of my three probability-based models in my continued attempts to achieve the perfect bracket prediction.  

Other Applications:

As I mentioned in my conclusion I can apply my new knowledge of probability to real-world events and occurrences that happen each and every single day in our lives. I can put it towards things such as the public health decisions that are made each and every single day around things like COVID as well as other events, or probabilities in how sports teams or team owners make decisions on the field or in the board room, and now even around self-driving vehicles that make probability to make decisions on the road each and every minute. My new understanding of probability can also help me build an understanding of probability-based subjects like quantum mechanics through using my understanding of making the unknown reachable.

Another application this project could have is the process of building probability-based models based on other probability-based experiments I want to experiment on. There is so much data in our world today. There is data ranging from apps that track my sports performance to home technology that tracks home systems and protects home information. I can now Interpret sets of data and gain insights related to probability ranges of results and prediction percentages. Lastly, another application of this project is I will now be able to think scientifically and make decisions in situations that are uncertain based on the information that I have.

 

Sources Of Error

Through my experiment and the gathering of my results, I noticed a couple of things that could have been sources of the errors that happened in my project. I also noticed if these things had been added they could have improved the results of my project in an impactful way. Throughout my project, I noticed a few errors in my data cleaning, data analysis, and model building which I could have fixed with more time and which I noticed throughout my experiment.

Data Cleaning:

In the process of data cleaning, I was required to go through school names by hand to complete the cleaning properly to the account for each team's Sagarin score and ranking. I had to make sure that the names for the schools used by Sagarin matched up with the names used by the people who copied and formatted the NCAA historic data. Because of the way I had to do this, there is a possibility that I matched one or two of the names incorrectly. This proved to be correct in testing as in some of the tests a matchup showed up with N/A or error because the names didn't match. This is a very minimal issue but it still needs to be accounted for as it can cause problems. Another source of error in the data cleaning portion of my project was that all the transferring and cleaning of the Sagarin data from the USA today was done by hand. When something is done in this fashion there is always a possibility for error because no human is perfect.

Data Analysis:

In the process of data analysis, I decided to add distributions to all my models to help me tell the whole story of my data as well as to help me predict upsets more accurately. However, when defining these distributions we took some shortcuts to save us some time and extra work, for example when matching the distribution to the historic data I matched the two data sets closely but not perfectly. This could have led to a possible error in my project. Another source of error was the fact that I had 33 years of training data and two years of testing data I used. It seems like a lot of data, but in reality, it is not a lot of data as some of the seed matches-ups I was trying to predict for win % almost never happen or even never happen at all. This made some of the win % information unreliable.

Model Building:

In the process of model building, I noticed a few small errors as well. When I tried to build the models to match seed win percentage and the round upset percentages this sometimes made the models too complicated or too hard to build when adding specific data like this. Some of the ways I used in simplifying the amount of effort and complex work it took to make these models led to some errors, like one of the errors I talked about in the previous paragraph. The other error may have arisen because the models used a lot of formulas. As there were many formulas it was easy to make a small mistake on one of the formulas. I believe that we cleaned up many of the issues in these formulas but still, a few small errors showed up in the model testing.

Citations

 

  1. “Applying Machine Learning To March Madness.” kdnuggets, Adit Deshpande, https://www.kdnuggets.com/2017/03/machine-learning-march-madness.html. Accessed 14 December 2020.
  2. “Build a Winning March Madness Bracket.” towards data science, Jane Thompson, https://towardsdatascience.com/march-madness-9212109bc8e8. Accessed 14 December 2020.
  3. “Machine Learning Applied to March Madness.” e capital, Max Craduner, https://ecapitaladvisors.com/blog/machine-learning-applied-to-march-madness/. Accessed 14 December 2020.
  4. “JEFF SAGARIN RATINGS.” USA Today, Jeff Sagarin, https://www.usatoday.com/sports/ncaab/sagarin/. Accessed 23 01 2021.
  5. “Statistics and probability.” Khan Academy, Salman Khan, https://www.khanacademy.org. Accessed 14 December 2020.
  6. “Statistics and probability.” 3 blue 1 brown, Grant Sanderson, https://www.youtube.com/c/3blue1brown/videos. Accessed 14 December 2020.
  7. “Risk Analysis Example Model” Palisade, https://www.palisade.com/risk/. Accessed 20 12 2020
  8. ““NCAA Tournament Results.” data world, MICHAEL ROY, January 8, 2017, https://data.world/michaelaroy/ncaa-tournament-results. Accessed 31 12 2020.
  9. “Risk Analysis.” Palisade, https://www.palisade.com/risk/. Accessed 23 01 2021.
  10. “College Basketball Stats and History.” SRCBB, https://www.sports-reference.com/cbb/ . Accessed 20 12 2020.
  11. Patou Zeleke (dad)

 

Acknowledgement

I would like to express my thanks and gratitude to all of the people and source that helped me to successfully complete this science fair project I put together this year.

First, I would like to give a special thanks to the company RPS Energy who provided and allowed me to have access to the @Risk software for my personal use during this project. Secondly, I would like to express my appreciation towards my dad, Patou Zeleke for assisting me through the model building process and throughout the research process. I would also like to give a huge thanks to all of the sources whose information I used throughout the duration of this project. Lastly, I would like to give a special thank you to my teacher Karen Burkell who provided me with the opportunity to do my project and who also provided me with information and assisted me throughout this project.

I would like to thank all the people and sources who contributed to my project and its success as without you this wouldn't have been possible.