Statistical enhanced learning for modeling and prediction tennis matches at Grand Slam tournaments
Abstract
In addition, different interpretable machine learning (IML) tools are employed to gain insights into the factors influencing the outcomes of tennis matches predicted by complex machine learning models, such as the random forest. Specifically, partial dependence plots (PDP) and individual conditional expectation (ICE) plots are employed to provide better interpretability for the most promising ML model from this work. Furthermore, we conduct a comparison of different regression and machine learning approaches in terms of various predictive performance measures such as classification rate, predictive Bernoulli likelihood, and Brier score. This comparison is carried out on external test data using cross-validation, rolling window, and expanding window strategies.
References
Angelini, G., Candila, V., and De Angelis, L. (2022). Weighted Elo rating for tennis match predictions. European Journal of Operational Research, 297(1):120–132.
Apley, D. W. and Zhu, J. (2020). Visualizing the effects of predictor variables in black box supervised learning models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(4):1059–1086.
Auret, L. and Aldrich, C. (2012). Interpretation of nonlinear relationships between process variables by use of random forests. Minerals Engineering, 35:27–42.
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford university press.
Breiman, L. (1996a). Bagging predictors. Machine learning, 24:123–140.
Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. The annals of statistics, 24(6):2350–2383.
Breiman, L. (2001). Random forests. Machine learning, 45:5–32.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regression trees. wadsworth int. Group, 37(15):237–251.
Brier, G.W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78:1–3.
Buhamra, N., Groll, A., and Brunner, S. (2024). Modeling and prediction of tennis matches at grand slam tournaments. Journal of Sports Analytics, 10(1):17–33.
Buhamra, N., Groll, A., and Gerharz, A. (2025). Comparing modern machine learning approaches for modeling tennis matches at grand slam tournaments. Journal of Sports Analytics. under review.
Eilers, P. H. and Marx, B. D. (2021). Practical smoothing: The joys of P-splines. Cambridge University Press.
Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11:89–121.
Fahrmeir, L. and Tutz, G. (2001). Multivariate Statistical Modelling Based on Generalized Linear Models. Springer-Verlag, New York, 2nd edition.
Felice, F., Ley, C., Groll, A., and Bordas, S. (2023). Statistically enhanced learning: a feature engineering framework to boost (any) learning algorithms. arXiv preprint arXiv:2306.17006.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232.
Gao, Z. and Kowalczyk, A. (2021). Random forest model identifies serve strength as a key predictor of tennis match outcome. Journal of Sports Analytics, 7(4):255–262.
Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1):44–65.
Greenwell, B. M. et al. (2017). pdp: An r package for constructing partial dependence plots. R J., 9(1):421.
Groll, A., Ley, C., Schauberger, G., and Van Eetvelde, H. (2019a). A hybrid random forest to predict soccer matches in international tournaments. Journal of quantitative analysis in sports, 15(4):271–287.
Kovalchik, S. (2019). deuce: resources for analysis of professional tennis data. R package version 1.4.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436–444.
Ley, C., Wiele, T. V. d., and Eetvelde, H. V. (2019). Ranking soccer teams on the basis of their current strength: A comparison of maximum likelihood approaches. Statistical Modelling, 19(1):55–73.
Molnar, C. (2020). Interpretable machine learning. Lulu. com.
Molnar, C., Freiesleben, T., K¨onig, G., Herbinger, J., Reisinger, T., Casalicchio, G., Wright, M. N., and Bischl, B. (2023). Relating the partial dependence plot and permutation feature importance to the data generating process. In World Conference on Explainable Artificial Intelligence, pages 456–479. Springer.
Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society, A 135:370–384.
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD internationalconference on knowledge discovery and data mining, pages 1135–1144.
Schauberger, G. and Groll, A. (2018). Predicting matches in international football tournaments with random forests. Statistical Modelling, 18(5-6):460–482.
Sipko, M. and Knottenbelt, W. (2015). Machine learning for the prediction of professional tennis matches. MEng computing-final year project, Imperial College London, 2.
Somboonphokkaphan, A., Phimoltares, S., and Lursinsap, C. (2009). Tennis winner prediction based on time-series history with neural modeling. In Proceedings of the International MultiConference of Engineers and Computer Scientists, volume 1, pages 18–20. Citeseer.
Vaughan Williams, L., Liu, C., Dixon, L., and Gerrard, H. (2021). How well do Elobased ratings predict professional tennis matches? Journal of Quantitative Analysis in Sports, 17(2):91–105.
Weston, D. (2014). Using age statistics to gain a tennis betting
edge. http://www.pinnacle.com/en/betting-articles/Tennis/
atp-players-tipping-point/LMPJF7BY7BKR2EY.
Whiteside, D., Cant, O., Connolly, M., and Reid, M. (2017). Monitoring hitting load in tennis using inertial sensors and machine learning. International journal of sports physiology and performance, 12(9):1212–1217.
Wickham, H., Chang, W., and Wickham, M. H. (2016). Package ggplot2. Create elegant data visualisations using the grammar of graphics. Version, 2(1):1–189.
Wilkens, S. (2021). Sports prediction and betting models in the machine learning age: The case of tennis. Journal of Sports Analytics, 7(2):99–117.
Wood, S. N. (2017a). Generalized additive models: an introduction with R. chapman and hall/CRC.
Wood, S. N. (2017b). Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC, London, 2nd edition.
Wright, M. N. and Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1):1–17.
Full Text: pdf


 
  
  
  
  
  Email this article
			Email this article  
			