Making a model to predict the Premier League winner 2022/23

srikaruchiha15
Dec 28, 2022
5 min read

Updated: Jan 4, 2023

First of all, being a football fan who loves playing the game more than watching, I do not believe that any model can truly predict a sports game. We can have models that can predict if you will have a disease or not, whether a mail is spam or the probability of losing a customer but none of that in sports and that's just my opinion.

Why? Look at 2020-2021 season. Liverpool had lost 7-2 against Aston Villa and ended up winning that year. Scores don't mean everything and one game doesn't define your season. There are a lot of controllable variables in football than uncontrollable ones. Uncontrollable ones being, injuries and unforeseen suspensions where you don't have any influence over it. That can not be the case for getting certain diseases where many variables other than nutrition and exercise are not in your control. No amount of motivation will get rid of your cold but comeback stories are more frequent in sports than medicine.

So I made this model for fun taking all things lightly and there are more flaws to this model than what is mentioned at the end. I just wanted to see if Arsenal would win with their current form given the data of previous champions. Keep reading to how this model is built and what seems like a better option!

Inspecting the dataset

I will be using python for this model. Here is the dataset I scraped from: Premier League Table

df=pd.read_excel('Data.xlsx',header=[1])
df.head(3)

The dataset uses the conventional statistics of a football game. Here is a description of what each variable is about:

Played: Total Matches played by the 15th week
Won: Total matches won
Drawn: Total matches drawn
Lost: Total matches lost
GF: Total number of goals scored by the team
GA: Total number of goals scored against the team
GD: The net goals scored ( GF-GA)
Points: The number of points won
Season won: If the club won that season or not; 1: Win, 0: Loss
Position During 15th Week: If the club was in first position or not in the 15th week; 1: Top of the table, 0: Not on top of the table

Checking correlation between winning the season and standings during the 15th week

We check if there is correlation between the columns i.e. 'season_won' and 'position' during 15th week. Since they both are categorical variables, we will be using the 'Phi-Correlation- Coefficient' than Pearson's or R.

from sklearn import metrics
metrics.matthews_corrcoef(df['Position During 15th Week'],df['Season_won'])

[Out]: -0.30944461244261673

We observe moderate negative correlation which suggests that teams that are not on top of the table during the 15th week more likely end up winning the season. This value is not entirely significant to our model and is just a check to see their correlation. We have 8 other variables that can influence our model.

Here is the correlation between all the variables:

There are a few high correlated variables such as 'Won', 'Points' and 'GD' which have high correlation amongst themselves. Now, in practice, we assess correlation with variables only after building the model which I have followed in my Jupyter Notebook but for this article, I have skipped that part and dropped the variables.

X=df.drop(['Season_won','Club',"Won",'GD','Points'],axis=1)
Y=df['Season_won']

Making the model

Now there are two ways to go about making the model, one is the GLM (Generalized linear model) and Logistic Regression model ( Logistic regression is a special case of the GLM model itself). I have done both in Jupyter, you can find it in my GitHub but for this case, I have only considered Logistic Regression model.

Fitting the model

We use statsmodel module to make our logistic regression model.

import statsmodels.api as sm
logs=sm.Logit(Y,sm.add_constant(X)).fit()

Then, we predict our model:

If the probability is greater or equal to 0.5, we will consider that as winning the season

y_pred=logs.predict(sm.add_constant(X)).map(lambda x: 1 if x >= 0.5 else 0)

Here is the final predicted dataset and the first 5 rows

test.head()

Checking the model accuracy, we get:

print(round(metrics.accuracy_score(y_pred,Y)*100,3),'%')

[Out]: 78.571 %

Any model with an accuracy above 75% is considered 'good'. Our model shows that 78.5% of the predictions were right .Obviously higher accuracy models are preferred but for now, this will do.

Predicting the winner

We import the current standings dataset ( Week 15) of the (2022-2023) season :

top5=pd.read_excel('Data.xlsx',sheet_name="Sheet2")
top5

Then, we use the our model to predict the win probability:

pred=logs.predict(top5_train)
top5['Prob_value_Log']=pred
top5

What does it say?

The probability value tells us who is most likely to finish first and from this analysis, we can say that its most likely Manchester City followed by Newcastle United and then Tottenham Hotspur.

Unfortunately, the model does not suggest that Arsenal has a chance to win and shows that Manchester United have more chance of winning the premier league :(

Critique of the model

The model only uses 'conventional variables' like Goals Against, Goals For and wins and position during the table. Although a positive goal difference (GF-GA) tells us that the club's overall offence and goal scoring rate is good but doesn't show 'how good' since it is relative. A low goals against also highlights the defense capability of the team. There are other variables like, past few games, team containing top scorer, top assists and top clean sheet keeper, home and away games, injuries, yellow cards and suspensions that influence the game as well
The model is biased. The model compares the club's position during 15th week and the winner of that season. Obviously, there is going to be a winner every year and hence the number of 1's will be greater than the the number of 0's.
The p-value which tells us the probability of our coefficient being 0 or not. It does not measure actual strength of the variable's influence on the dependent variable but p-values do tell us whether they have any effect on the model. We can see that there are several variables in the model who's p-values are greater than 0.05.
The model chosen to predict uses the club's position during the 15th match week to decide the overall winner and hence the logistic model was chosen. Ideally the most preferred model used is the Random forest model which can use more variables and make better decisions over games and more over, specific games rather than an entire season
Football is a game where many variables can be calculated but factors like motivation, mentality and collectiveness cannot be measured. Although one can ask every player how they are feeling before a match on a scale of 1-10 but doing that would be inappropriate for many reasons. Since these variables can not be measured, we do not know how the outcome of the game is going to change.

For example, the Liverpool - Aston Villa (2-7) game will always be an outlier that season but also shows that the "tactically" stronger team does not always win and having the best individuals in a team can not always win you a championship.

I will edit this article to see if the model's prediction was right or not. If it did, well its most likely coincidence unless every year the model's predictions are accurate. So, stay tuned to know more !

Git Link: https://github.com/Srikar-R/Predicting-PL-Winner