4027502

The effectiveness of satisfying the assumptions of predictive modelling techniques: An exercise in predicting the FIFA World Cup 2006

The assumptions of statistical procedures are enforced more rigorously in some disciplines than in others. Outliers are often removed from data sets due to concerns over measurement error. However when predicting the outcomes of sports performances, such outliers represent real and valid performances such as Germany's 8-0 win over Saudi Arabia in the 2002 FIFA World Cup. Previous research into the accuracy of predictive modelling techniques has provided examples of where models based on data that violate the relevant assumptions is greater than that of models where the assumptions were satisfied. The purpose of this investigation was to intentionally develop two sets of 6 models; one set being based on untransformed data that violated the assumptions of the modelling techniques and a second set where the data were transformed and outliers were removed in order to satisfy the assumptions of the modelling techniques. Data from 477 pool matches and 165 knockout matches from World Cups, European Championships, Copa America tournaments and African Cup of Nations tournaments from May 1994 to February 2006 were used to produce predictive models of match outcomes (win, draw or lose) or goal difference with respect to the higher ranked teams within matches according to the FIFA World rankings. The independent variables used were difference between the teams FIFA World rankings, difference between distance from capital city to capital city of the host nation, and difference in recovery days from previous match within the tournament. The two sets of models were used to predict the 2006 FIFA World Cup and 22 human predictions and 20 weighted random predictions were also produced. An evaluation process marked the predictions with respect to the actual outcomes of matches in the 2006 FIFA World Cup out of a total possible score of 64 points. The mean accuracy of the models where the assumptions were satisfied was 38.67 points which was similar to the 39.00 points for those where the assumptions were violated. However, the best individual model was a simulator where the assumptions of the underlying multiple linear regression technique used were satisfied (44.00 points). The multiple linear regression based models were more accurate than those based on discriminant function analysis and binary logistic regression. The accuracy score of the 12 model based predictions of 38.83+3.26 was significantly lower than the 42.95+3.36 for the human predictions (P < 0.017) but significantly greater than the 31.05+3.86 for the weighted random predictions (P < 0.017). These results provide evidence that challenges the value of satisfying the assumptions of discriminant function analysis, binary logistic regression and multiple linear regression.
© Copyright 2006 International Journal of Computer Science in Sport. Sciendo. All rights reserved.

Bibliographic Details
Subjects:
Notations:technical and natural sciences sport games training science
Published in:International Journal of Computer Science in Sport
Language:English
Published: 2006
Online Access:http://www.iacss.org/fileadmin/user_upload/IJCSS_Abstracts/Vol5_Ed2/IJCSS-Volume5_Edition2_Abstract_Donoghue.pdf
Volume:5
Issue:2
Pages:5-16
Document types:article
Level:advanced