I
spent the majority of the summer off in Colombia and then Croatia and took a
bit of a break from football and math. But now I’m back in London and settling
into my routine, and even though I didn’t spend much time in front of the computer
over the summer and have no finished results to show you yet, even on holiday I
can’t stop my mind from drifting off towards football and new ideas that could
be applied. I’m going to use this entry to tell you about the plans I made to
explore some of these ideas during this new “Analytics Season” leading up to the 2017 OptaPro Forum in February where
I’m hoping to get the chance to present them.
Up
until this point, I’ve spent most of the entries speaking about team passing
sequences and the results of their quantification through network motifs. This
is a very interesting topic, and I think there is still more to come from this.
These are some areas where I still hope to do some more work:
- The vectorisation through motif frequencies can be refined with some more information. For example, I’ve been thinking that different instances of ABAC represent very different kinds of combination play. An ABAC passing sequence can be composed of a short one-two between players A and B, and in the third and final pass A plays a long ball to player C. Alternatively, it can simply be composed of 3 short passes. The distance of the third pass should be the main factor differentiating different types of ABAC, because in the vast majority of cases the ‘ABA’ part will be made up of short passes (If Coutinho gives it to Lallana and Lallana gives it back to Coutinho, we don’t expect either of those passes to have been long otherwise Coutinho would have to run a large distance after his first pass). When the weights of the first principal component of the lengths of each pass in all instances of ABAC are looked at, effectively almost 97% of the variance is on the length of the third pass. It remains to be seen whether this ‘refinement’ can be used to further discern and distinguish team playing style.
NOTE: Think of Principal Component Analysis
as a method assigning coefficients to what features contain the most variance
in a set of data. If we have the height and weight of a population of hippos
and a population of zebras, the height of the whole set is roughly the same but
the weight differs a lot, and Principal Components Analysis tells us precisely this:
the weight is where the variance is.
- Even though we’ve spoken about “playing style” convincingly through this methodology, we still haven’t related this vectorisation to “success on the field” yet, which is actually what ultimately matters. It would be interesting for example to use Topological Data Analysis (you can read about it in my previous entry) to map out the motif vectorisation and discover where success in the league is being accumulated. We can also fit a probability distribution that gives the probability of a certain motif structure leading to a top 4 finish for instance. In this sense, we could potentially advise clubs on what they need to change in their passing play to increase their probability of finishing in the top 4, or of not being relegated, or of winning the league, etc.
- As I said before and showed with the Xavi example, this ‘passing motifs’ idea can be simply extrapolated to a player level by vectorising each player’s frequency of participation in the different motifs. Once players are represented as vectors in this high-dimensional space, we can apply a whole arsenal of methodologies to answer questions such as which player is better suited to replace an outgoing player, which players have a similar style, how individual players affect the passing motif structures of their teams. It remains to be seen whether this approach will quantify a meaningful underlying quality in players as we have shown it does for teams, but certainly, more information is preferable to less information.
- These topics (team style, player style and recruitment) can be combined; so for example we can advise clubs on how the recruitment of a certain player will affect their passing motif structure, and whether this change will improve their probability of a top 4 finish.
These
are ambitious plans, but I think that they are important ones because I must
admit that (fair) criticism that could be thrown my way is that so far the
results are very interesting theoretically but difficult to translate into
actual practical and applied contexts within the industry. This is fair, but it
doesn’t take away in the least bit from the value of the results obtained. The
passing motifs methodology has a lot going for it. It proved to be consistent across
different seasons which is strong evidence that it identifies some underlying
inherent property which we called “passing style”, instead of just randomly
picking up statistical noise. It was also used to identify a passing style
unique to Leicester City which was present even before their title winning
season, something that no one could have predicted or expected. As I said, it
has a lot going for it.
The key
counter-argument for this criticism is this (opinion): there doesn’t have to be
an obvious, direct and immediate practical application for theoretical work to
be valuable for a field. I strongly suggest those with a true interest in the
topic of Football Analytics read this fascinating entry from Statsbomb author
Marek Kwiatkowski. Here’s an excerpt if you can’t be bothered to read the whole
thing:
“(I)
believe that we have now reached the point where all obvious work has been
done, and to progress we must take a step back and reassess the field as a
whole. I think about football analytics as a bona fide scientific discipline:
quantitative study of a particular class of complex systems. Put like this it
is not fundamentally different from other sciences like biology or physics or
linguistics. It is just much less mature. And in my view we have now reached a
point where the entire discipline is held back by a key aspect of this
immaturity: the lack of theoretical developments. Established scientific
disciplines rely on abstract concepts to organise their discoveries and provide
a language in which conjectures can be stated, arguments conducted and findings
related to each other. We lack this kind of language for football analytics. We
are doing biology without evolution; physics without calculus; linguistics
without grammar. As a result, instead of building a coherent and ever-expanding
body of knowledge, we collect isolated factoids.”
When
looking for conceptual theoretical developments, passing network motifs fit the
bill of a consistent and robust concept with a clear underlying motivation (representation
of “passing style”). Practical applications will inevitably follow from this
maturation of the discipline, and I have already outlined above some much more
practical approaches which can be looked into.
Finally,
and before this entry gets any longer, this approach has given me valuable
insight into a type of conceptual processing that can be done to raw football
data in order to obtain a meaningful representation. Football events during a
match are very dynamic, complex and interdependent, but they codify all the
necessary information to determine results, quality, potential, etc. The
network motifs approach suggests taking the constituent blocks of the passing
events graph and applying an equivalence relationship on the identity of the
nodes in order to study their nature (this simply means that instead of
focusing on the specific players performing the passes, for instance we
consider any occurrence of a one-two as belonging to the same “class” of pass
motif regardless of who the specific players were). It has made me think: why
not attempt this with other types of events? Consider for example having a directed
graph representing a team’s performance, but instead of the nodes representing
players and the vertices passes, each node can be seen to represent an “area”
of the pitch and a vertex is simply the act of going from one area to another
through a pass, dribble, etc. A sequence in this network is simply a movement
between different areas. The ‘equivalence relationship’ on the identity of the
nodes which I think would be useful for this approach would work something like
this: a play starting in one area, then moving to an area three spaces to the
right through a pass, and then forward 2 spaces through a dribble would be
classified exactly the same regardless of the players involved and if it
happened starting inside our own penalty box or from the halfway line.
Vectorising
team or player performance through the frequency of the motifs in this context
could lead to a very robust quantification of playing style, performance
metric, probability of success… who knows?! I don’t yet, but let’s hope I can
find out from here to February.
No comments:
Post a Comment