One of the main challenges of football analytics is ensuring
that their manipulation of the available data is in fact uncovering underlying
“qualities” of teams and players, instead of just randomly picking up
statistical noise or irrelevant facts. I could certainly use the available data
and assign a number to each player by summing up the number of blocked shots
plus the square root of the number of headed shots inside the area divided by
the goal difference his team obtained with him on the pitch multiplied by his
number of interceptions. Can I use this number in any way to advise a club on
whether they should buy him? Probably not. How can I know what is valuable?
Recall from the previous entries on team passing
motifs that a main reason why I stated that the methodology was picking up on a
stable quality of passing style was
the fact that it was stable for consecutive seasons. If the methodology was
just randomly assigning motif distributions, then surely there would be no
consistency between different seasons.
The implication then is this: if a certain vectorisation of
the data is in certain sense “stable” across seasons, then this vectorisation
is representative of an underlying quality of the data observations. Metrics
intended to measure qualities which one would expect to be stable over seasons such
as “playing style” or “potential” should be able to be validated in this way.
The question then is how would the details of this
validation go. In this entry, I’ll go through a “validating methodology” that
I’ve been working on lately:
Take a vector
representing a team or a player for a given season (something like the 5-dimensional
vector representing a team in the passing motifs methodology). If my
reasoning above is correct, if the vector contains valuable information
regarding that player/team, then if I consider the equivalent vector for the
season directly before, they should in theory be in some sense “close” to each
other. The “closeness” of two vectors is of course a relative concept, so this
should be measured in relation to the average distance between any pair of
vectors.
As an example: If Juan Mata’s vector for 2014-15 is at a
distance of 2.3 from his vector for 2015-16, and on average the distance
between any two player vectors (not necessarily from the same player) in this
context is 9.5, then we can say with reasonable certainty that Juan Mata’s
vectors are “close” to each other.
The method which I wrote out takes as parameters the two
vectorisation matrices for two consecutive seasons, normalises them, considers
only players who have played at least 18 matches in each season; and prints out
the following:
Here’s
what we want to look at on this table: First of all, the lower the mean
distance between the two vectors of each player, the better our methodology is
according to the reasoning above. However, a “low” number is a very relative
concept, so we need a reference with which to measure how low this number
actually is. This methodology provides two such references:
- The first and most important one is the mean distance between all vectors, not just between the two corresponding to each player. This gives us an idea of how far any two vectors in this context are and if the “closeness” of vectors of the same player is significant. Z-Score 1 is the mean distance between all vectors minus the mean distance between the vectors from the same player divided by the standard deviation of the distance between all players. The lower this number (accepting negative values of course), the better.
- The second reference provided is the mean distance between Gaussian vectors in the dimension of the problem. Z-Score 2 is the mean distance between the simulated Gaussian vectors minus the mean distance between the vectors from the same player divided by the standard deviation of the distance between the Gaussian vectors. I feel this is also an important frame of reference because it gives a measure of just how “normalised” the scaled problem is. It also provides important “dimensional” context because for example if our vectorisation is in 15 dimensions as opposed to 5 dimensions, then the raw distances will increase but this does not necessarily mean that the higher dimensional vectorisation is less valuable, simply that the numbers we deal with in higher dimensions are naturally larger and we need to know this to know how “low” our mean distance number really is. Hence the importance of the Z-Scores.
This entry was a bit technical and perhaps less interesting
for the average football fan, but I thought it was important to explain it
because it’s what I’ve been using to understand how to best translate the
passing motifs problem to a player context. I’m looking to follow up this entry
with an applied example comparing different vectorisations of passing motifs at
a player level very soon (2-3 days hopefully if I can find the time), so stay
tuned!
No comments:
Post a Comment