Monday, 16 January 2017

Player Vectorised Representations: What player lists can we draw up with confidence?

I love drawing up lists and rankings of players (who doesn’t?) and giving myself a big “confirmation bias” pat on the back when I see players on the list which I like while casually either ignoring as a mistake of the method or updating my bias for the players on the list which I don’t particularly rate highly. However, the very exercise of drawing up lists and rankings can be misleading for the probabilistically-illiterate because it seems to imply set-in-stone certainty about who the best player is, who the second best is, etc.; and this rigid numbering masks the underlying concepts of probability. And yet, drawing up player lists is key for the recruitment workflows of clubs, be it in drafts or transfer windows, or even just to set up a schedule for their scouts. You definitely don’t have to see the rankings as set in stone, but I can imagine clubs would definitely want to have things like 15-men shortlists with 2 or 3 ‘favourite choices’. In this entry I’m going to show you a couple of lists I drew up and how we can go about our list-making with confidence with vectorised representations of players.

I drew up lists for this entry using the player passing motif ideas from previous entries. The passing motif methodology produces a vectorised representation of players, which basically means that each player is represented by a vector of numbers. In the passing motif methodology I’ve used so far, the vector representing each player has 45 entries or numbers. The key conceptual bit is that when you have this type of vectorised representation, you can imagine each player as being in a “space” of some sort. To imagine it, suppose that instead of 45 you simply had 3 numbers representing each player, something like age, height and weight. If this was the case, you could imagine each player as being represented by a dot in a 3-dimensional space much like your living room. Some players would be closer to others, some would be farther away. Perhaps all the senior, tall and heavy centre-backs are located around your TV, while the shorter and lighter second strikers are hovering around your dining room table. This is just how I conceptualise the result of the 45-dimensional passing motif methodology. It makes it more abstract to picture, but just as in the 3-dimensional case, there are distances, certain dots closer to each other or concentrated around certain areas, etc.

The list I drew up basically took all the players who had at least 18 appearances in last year’s Premier League, and gave them “points” according to how many key passes they made AND how many key passes the players around them made. The closer to a player you are, the more “points” his amount of key passes awards you; the farther away the less. I tried this out in a few ways but that’s the basic idea. The idea is that if you happened to make few key passes in the season but all the players whose motif vector is close to yours made a lot, you should still have a high score. If the information contained in the motif vectorisation is at all useful to recognise players with creative potential, then the best scored players should in a way be the best creative passers in a more profound way than simply looking at the Key Passes Table. The question is precisely, how do we know the vectorisation’s layout of players has anything to do with their “key passing ability” (i.e., players with high ability cluster around certain areas of this “space” and are in general closer to each other)? Let’s look at the list before we begin to answer this question so everybody gets a bit excited before it dawns on them that I’m actually rambling on about some technical stuff.

Remark: Notice how this list isn’t strictly correlated with key passes. Drinkwater is better ranked than Eriksen even though the latter had many more key passes. This means that if the list is sound (big if), its picking up on information that wasn’t immediately and explicitly available in the key passes tables.

My confirmation bias seems to like that list quite a lot, there are a lot of good names up there. Most readers probably follow the Premier League closely and know that those are all good attacking creative players, arguably the best in the league. Now imagine that instead of the Premier League, we drew up an equivalent list using data from leagues where we didn’t already know the players, and had confidence that just as in the case of the Premier League, we were definitely getting out a list of most of the best players. Should be useful huh?

There are also some notable absentees. Coutinho comes to mind as a player which is widely agreed to be amongst the best in the league who isn’t on the list. Why should we trust a list that claims to rank the top 15 creative players in the league but leaves out Coutinho?

As I said before, I think of the vectorised representation as encoding the information regarding players’ key passing ability if players who tend to have a higher number of key passes are more or less clustered together as opposed to randomly located mixed with all the other players. If this is a general trend, then we know that there is a relationship between a player’s key passing ability and his location in the 45-dimensional space we are imagining. Even if a player happened to not have many key passes in a season (this can happen just as strikers have goal droughts or perhaps because a player’s teammates don’t make good runs), we should still pick up on this “ability”.

What we would need then to justify our faith in the list is some sort of indicator which specified just how “clustered together” players with higher number of key passes are. There are many ways to approach this problem in mathematics. For those readers who have mathematical backgrounds we could try to fit a model and asses the goodness of fit, or apply some sort of multi-dimensional Kolmogorov-Smirnov technique comparing the actual distribution of vectors and key passes with one where the key passes where distributed randomly. However, all these tests are a bit technical and hard to apply in high dimensions, and all in all we really want an indicator more than a model of “Expected Key Passes”. Here’s a simpler validating technique:

For each player, take his K (in mi list K=10) closest neighbours and compute the standard deviation of their key passes. Once we’ve done this for every player, we can compute the mean of the standard deviation of key passes in each of these K-player “neighbourhoods” (let’s call it the ‘mean of neighbourhoods’ variation number=MNV’). If in each neighbourhood the players have a relatively similar number of key passes, then the MNV should be comparatively low. The important question is: what do we compare it to in order to know if its low or not?

 I feel that there are two important numbers to compare this number to. The first would be simulating many (many) scenarios where the key passes are randomly permutated amongst the players and comparing the real MNV number to the average of these simulated cases. The second number would be the minimum MNV of any of the simulated scenarios. If the MNV of our actual vectorised representation is “low” in comparison to these simulated scenarios, then we know that the players’ layout in this imaginary 45-dimensional space clusters the key passers of the ball closer together than random distributions; which in turn would mean that the logic applied to obtain the list has a robust underlying reasoning because a player’s location in the 45-dimensional space should have something to do with his “key passing ability” (I fear I may have lost half the readers by this point…).

Here are some results:

Of the 100,000 simulations, the lowest MNV was 14.62 while the actual MNV is 11.86. This means that if we randomly assigned the players to a position in the 45-dimensional space 100,000 times, none of those simulations has the key passers clustered together better than our actual passing motif representation. This is quite promising, but even then, I suspected that maybe this is because the method clearly recognises the difference between defensive players and attacking players and attacking players are much more likely to get more key passes; so I repeated the validation using only attacking midfielders, wingers and strikers:

The results are less overwhelmingly positive, but even when just looking at attacking players, the actual layout surpasses any random distributions of the players after 100,000 simulations. To appreciate the value of this method and what information this is actually giving us, let’s compare with an equivalent list drawn up using ‘goals’ to award points rather than ‘key passes’ (using only attacking players again for the same reason as before).

The MNV numbers are naturally smaller because players score much less goals per season than key passes, so the overall scale of the problem is smaller. We can see that even though the real MNV is smaller than the average of the simulations, its actually relatively large when compared to the minimum MNV obtained through random simulations (notice how important it is to have a frame of reference to know when the number is small and when it is large in each specific context). This means that the position of players and goals in the 45-dimensional space can be clustered together through random simulations considerably better than using the passing motif vectorised representation. As opposed to the ‘key passes’ case, this vectorisation doesn’t encode much information pertaining to “goalscoring capability”. This actually makes sense though since the passing motif methodology is designed using only information from the passing network which doesn’t necessarily contain information regarding finishing. Therefore, the list made using ‘goals’ is much less reliable.

Coming back to Coutinho’s absence from the original list, it’s important to understand that I’m not claiming the list as a know-it-all oracle for creative talent and that this talent can be rigidly ordered. What this entry tried to show is that there is solid evidence that a player’s position in the 45-dimensional space determined by the passing motif methodology encodes a good amount of information which determines how many key passes he ought to have given a sort of “passing ability”. That doesn’t mean it encodes all the information. Perhaps the vectorised representation is missing out on what it is that makes Coutinho great. Nevertheless, once we’ve accepted and understood that the list will offer us, I doubt any club could claim that a list like this from different leagues from around the world is of no use to their organisation just because they might miss out on the Serbian League’s Coutinho (sadly, such is the ‘glass half empty’ prejudice that analytics face).

Finally, this way of looking at the problem of rating players opens the door to a host of possibilities. When I was doing my bachelor in pure mathematics I was actually more interested in differential geometry and topology courses than statistics courses, which is why I tend to think of data observations as vectors in high-dimensional spaces and think that their positions in those spaces encodes valuable information. This entry began by taking a vectorised representation (passing motifs vectors) and established that if we look at the number of key passes each player made, the players’ vectors’ position in this space seemed to encode this info. On the other hand, it didn’t seem to encode the information pertaining to goalscoring. That isn’t to say it might not encode information regarding other metrics. Expected Assists maybe? It also doesn’t mean that other vector representations don’t encode some of this information better than my own passing motifs representation. It’s a bit of a 3 step thing really: 1. Find a vector representation, 2. Check what sort of information it seems to encode well (especially information that isn’t explicitly available elsewhere, and 3. Find a way to give players a rating using this fact.

I hope this way of thinking encourages other analysts out there to try their hand at this sort of work! 

Wednesday, 30 November 2016

Player Passing Motif Style Application: Most Distinctive Players and Best Recruitment Opportunities

I was recently invited to send some of my work on passing motifs to be used for a Fink Tank column in The Times, but of course a dendrogram such as the one I linked in my previous entry wouldn’t cut it in printed media. Therefore, I thought the best thing to do was set out to answer some concrete applied questions the methodology might answer, which would be easy to display but interesting nevertheless. The content of this entry was the result:

I started by thinking about “distinctive” players; players that couldn’t be replaced easily. Remember that there is solid evidence that the methodology outlined in the previous entry picks up some underlying information on player passing style, and two players are considered similar if their vectors are “close” to each other. With this in mind, I computed the average distance of the 10 closest players for each player. The players for whom this number was highest were considered the most distinctive. The following table shows the top 5 along with their closest neighbours (this whole entry is based on data from the 2015-16 Premier League, without goalkeepers):

First of all, I strongly think that Ulloa being there is a bit of an oddity: most of his appearances were late substitutions when Leicester were holding onto a 1-0 lead and therefore his stats are representative of this unique predicament.

Ozil is the most distinctive player in the league. Looking through his closest players Nacho Monreal and Bacary Sagna can raise a few eyebrows but all in all even they are considerably far away from his style and therefore don’t reveal much about him. It’s a bit like the US mainland and Australia being amongst the closest countries to Hawaii; so treat that with due suspicion.  The rest of the players seem to make good footballing sense. I’ll leave it to the readers to read through the results and make their own judgements.

Another interesting question which I thought of was this: which players represent the best recruitment opportunities in the sense that they have similar styles to players who play for much better ranked teams. Something like Sunderland players having similar styles to players from Arsenal, Manchester City or Tottenham. There are several ways to answer this question. Let’s start by the simplest: for each player I computed the average final league position of the 10 closest players, and subtracted that number from his own team’s final position. The players for whom this number was highest can be considered to represent the “best” opportunities. The following table shows the top 10:

I didn’t watch Aston Villa much last season and have no opinion on Ashley Westwood to be honest. It’s good to see Nathan Redmond and Idrissa Gueye on there though, considering they have since moved to Everton and Southampton proving they were in fact capable of playing for better teams. The problem with this methodology though is that it assumes that the difference in player’s quality is lineal with league position. That is to say, the difference in quality between a player from the 16th team and one from the 20th team is assumed to be the same as the difference between a player from the champions and a player from the 5th team, when in truth there is no solid basis for this assumption.

One way to deal with this is to apply an increasing concave function to league position so that the same differences in position lower down the table are weighted less than for higher placed teams. I tried out a few functions like log, square root, cubic root, etc., and the results vary marginally but the same core of names seems to pop out for most of them. As an example, the table below shows the results for this methodology applying the fourth root to league position:

Some more satisfying names show up on there now (Ashley Williams who moved to Everton and Jason Puncheon who is pretty good), but it’s still unclear whether this methodology is properly representing the differences in quality required to play for different teams. Perhaps a better way to look at it is by points obtained rather than league position. Is the ratio of points a good representation of the difference in quality? The idea would be something like if a team obtains 90 points then its players must be 3 times as good as those of a team which obtains 30 points. The following table shows the top ten players such that the ratio between their teams’ points and the average points of the teams of their 10 closest players is greatest:

NOTE: I had to exclude Aston Villa players here because they obtained such a small amount of points that right away the method assumed it was twice as hard to play for Norwich than for Villa (Aston Villa made 17 points and Norwich 34) and obviously all the Villa players dominated the rankings.

Nathan Redmond is on the list again which is good to see, as well as Moussa Sissoko who moved to Tottenham this season. M’Vila, Cisse and Watmore are other players on the list which I rate highly, but again, lets allow the reader to make his own judgements.

To wrap this theme up, the final way of answering this question I used combines some of the best elements of the previous two ideas: for each player we compute the difference between the average squared points of the 10 closest players and his own team’s final points squared. The square is taken to compensate for the fact that the quality required to go from a 60-points team to an 80-points team is higher than the quality required to go from a 30-points team to a 50-points team. The table below shows the results:

There are some good names on there. Redmond definitely seems to be a good catch by Southampton. Now, remember that this methodology isn’t meant to be a magic crystal ball. Some people who know I do this type of work constantly ask me: “So, who’s going to win the league? Who’s the next Messi?”. They fail to understand the subtlety of what there is to actually gain from data mining. Ashley Westwood might be great, but then again he might not. Nevertheless, some players which traditional scouting methods seem to like from last season which were recruited by larger teams are also liked by this method. 

It’s pretty remarkable that this methodology seems to be so rich when it is ignoring a lot of relevant information such as shots, goals, tackles, etc. It only sees what is visible in the passing network, which seems to be enough to make some decisions that very informed professional recruitment makes like picking up Gueye, Redmond, Sissoko, Ashley Williams, etc. That doesn't mean that it has all the answers. For example, it might like a centre back who is good at playing the ball out from the back but it has no way of knowing if he is also defensively sound. If complemented with more sources of information (such as direct traditional scouting), however, this type of work can be very useful for clubs.

Finally and on a bit of a sourer note, I thought it might also be interesting to look at some of the “worst” opportunity players; that is to say players who are similar to players from much worse teams than their own. I had to exclude Leicester players because their players completely dominated the top ten in most lists I drew up; even Kante, Mahrez and Vardy. I’m not sure what to make of this, because even though they were the champions, their players aren’t particularly similar to the players from other top teams. Just so it doesn’t seem like I’m slipping something past you, here are the 10 closest players to Kante, Mahrez and Vardy:

Now, excluding players from Leicester, here are the “worst” opportunities using the “squared points difference” metric outlined above:

Make of it what you will, but keep in mind that everything has a context and even if I claim this method sees a lot, I also recognise it doesn't know everything. If you see a name in there you don't like, keep calm, take a deep breath, look through the closest players and have a think about what might be going on.

Monday, 31 October 2016

Passing Motifs at a Player Level: Player Passing Style

This is a pretty exciting entry, so bear with me if it gets a bit long, I think its worth it…

Ever since the first entry on Passing Motifs I mentioned the potential of extrapolating the methodology to study passing styles at a player level. That first entry mentioned the idea set forth by Javier Lopez and Raul Sanchez to answer the question “Who can replace Xavi?”. Nevertheless, that particular example always left me wanting for more because the outcome was noticeably skewed towards players from Barcelona and a few other teams like Arsenal and even Swansea surprisingly. It made me think that the methodology was ignoring individual player traits and rather picking up stats that are reflective of the team the player plays for, not of the player himself.

I’ve been thinking ever since what the best way to extract player passing style from passing motifs is. Here are some of the ideas I’ve had:
  • One first objective is to neutralise the effect of the team passing style on a player. If a team proportionately uses ABAB a lot, then inevitably so will the players. Therefore, if you put Fernandinho in Barcelona, his motif frequencies will start to resemble those of the whole team without it having been something inherent to him all along. The idea I had was to view how a player’s relative motif frequencies diverged from his team’s frequencies in each match. That is to say, if in a match Arsenal performed 40% of its frequencies as an ABAC and 43% of the motifs Coquelin was involved in were ABAC, then Coquelin had a +3% for that motif for that match. Averaging for the whole season, Coquelin could be seen as 5-dimensional vector where each entry corresponds to his average divergence for each of the 5 motifs. When the performance of this vectorisation is measured through the methodology outlined in my previous entry using data from the 2014-15 and 2015-16 seasons of the Premier League (only players who had at least 18 appearances to avoid outliers), this was the result:

The fairly negative z-scores reveal that this methodology has an agreeable stability for those two seasons and is therefore picking up on some underlying quality of the players playing style.

  • Just as we did for team motifs, instead of considering the raw values of motifs a player performed, we consider each performance in a match by a player as a 5-dimensional vector in which each entry is the percentage of the player’s total motifs that that motif corresponds to. So we can represent a match played by Romelu Lukaku as 5% ABAB, 13% ABAC, 25% ABCA, etc. Averaging over a whole season, each player is represented by a 5-dimensional vector.

Once again, we’re reasonably happy that this vectorisation is picking up on stable player qualities.

  • Another way of seeing that data which I felt might be useful is seeing each player’s match as the proportion of each motif his team performed that he participated in. That is to say, if Southampton completed 50 instances of ABAB in a match, and Jordy Clasie participated in 25 of those, he would have a 50% score for ABAB in that match. If in that same match Southampton completed 80 instances of ABAC and Clasie participated in 20, he would have a 25% score for that motif. Applying this logic to the 5 different motifs and averaging over the whole season, each player is once again represented by a 5-dimensional vector. This is how well it performs:

Out of the three 5-dimensional vectorisations we have shown so far, this is by some margin the one which performs the best. Both its z-scores are considerably lower than the other two, meaning its capturing pretty stable information for each player.

  • In the first entry regarding passing motifs we mentioned how the motifs could be vectorised in a 15-dimensional vector for players. To refresh your memory, for an ABAC sequence a player could participate as the A player, the B player or the C player. It’s straightforward to count that looking at all 5 motifs there are 15 “participation” possibilities for each player. If we count how many times each player was each letter in each of the 5 motifs, we are left with a 15-dimensional vector representing each player. This is basically the methodology used in the “Who can Replace Xavi?” article.

     Comparing things in different dimensions is rather difficult and not too standardised in mathematics but I would dare say that it performs worse than previous 5-dimensional vectorisation, especially considering Z-Score 1 which is the most important indicator.

  •      Finally, we can take this 15-dimensional idea and slightly alter it to not count the total of each pseudo-motif but rather what their relative frequencies are, so once again do something like if Dimitri Payet performed the B in an ABAC 15 times out of 100 total motifs he participated in, that pseudo-motif has a score of 15%. Once again, each player is represented by a 15-dimensional vector:

Immediately we appreciate that this is the best performing of all the vectorisations we have seen.

Now, the first thing we must say is that all the 5 different ways of obtaining player vectors shown here show evidence of uncovering some stable and underlying qualities of players’ passing style. We have used the indicators to compare them and discuss which might be better, but there is no way of determining whether some information which one of them is picking up on is missed by another.

Here’s the advantage: there is no downside to combining them all. If we simply glue together all these representations to make one long 45-dimensional (5+5+5+15+15) vector representation for players, then all the qualities on which each methodology picked up are at a scale represented. If two players were similar across all representations, they will be similar in the long one as well; if two players were similar across some of the representations but not others, then they will be mildly similar depending on how dissimilar they were in the others; etc.

Here is the performance of this long 45-dimensional vectorisation:

The results are very satisfying and it proves to be a robust vectorisation for player playing style, more than 1 standard deviation below the mean distance between all players and more than 4 standard deviations below the Gaussian distances, even in this very high dimensional space.

This vectorisation will surely provide me with a lot of material to explore for a good while, its even a little frustrating not finding an easy visual way in which to convey it to the readers. Lets settle for now on a hierarchical clustering dendrogram as a visualisation tool.

Below is a link for the pdf for the hierarchical clustering dendrogram applied to the data set for the 2015-16 season of the Premier League (only players who played in over 18 matches). Since there are 279 players, the tree labels are really tiny so the image couldn't be uploaded onto the blog directly, but on the pdf you can use your explorer's zoom to explore the results.

If you'd rather not, here is a selection of the methodology's results:

  • Mesut Ozil has one of the most distinctive passing styles in the league. Cesc Fabregas is the player closest to him and together they form a subgroup with Juan Mata, Ross Barkley, Yaya Toure and Aaron Ramsey.
  •  Alexis Sanchez is in a league of his own but the players with the most similar passing style are Payet, Moussa Sissoko, Jesus Navas, Sterling and Martial.
  • Troy Deeney is in the esteemed company of Aguero, De Bruyne, Oscar and Sigurdsson.
  • David Silva, Willian, Eden Hazard and Christian Eriksen are all pretty similar.
  • Nemanja Matic, Eric Dier and Gareth Barry have a similar passing style.
  • M’Vila, Lanzini, Capoue, Puncheon, Ander Herrera and Drinkwater are all similar, pretty good and perhaps underrated.
  • Walcott, Ihenacho, Scott Sinclair, Jefferson Montero, Wilfired Zaha, Bakary Sako, Albrighton, Bolasie and Michail Antonio form a subgroup of similar wingers.
  • Giroud is more similar to some rather underwhelming strikers such as Gomis, Cameron Jerome and Pappiss Cisse rather than to world class strikers. The same can be said of Harry Kane being similar to Aroune Kone, Son and Marc Pugh. Maybe the methodology is not as convincing for strikers?
  • Shane Long and Odion Ighalo are good alternatives to Jamie Vardy.
  • Diego Costa and Lukaku are similar to Rooney.
  • Victor Moses, Aaron Lennon and Jordon Ibe are similar.
  • Mahrez is similar to Sessegnon, Nathan Redmond and Jesse Lingard. Did Southampton know this?
  • Matt Ritchie (ex-Bournemouth now at Newcastle) is in a group with Lallana, Alli, Pedro and Lamela. An opportunity for the taking?
  • Angel Rangel has (and has always had) unusual stats for a full-back.
  • The methodology recognises who the goalkeepers are and set them apart without this information being explicitly available in the datasets. The same applies for many other players from similar positions which are grouped together like the CBs and full-backs.

This is a poor man’s substitute to actually exploring the dendrogram yourselves. Not to mention that a clustering dendrogram is not even the most faithful representation of the information being collected by this vectorisation, but I’m more than happy with the results and feel there is some real promise to the methodology. If I can come up with some better visualisations for the results I’ll post those later on.

Please have a look through the results from the dendrogram and comment on whether you feel we’re getting close to convincingly capturing player passing style through passing motifs.

Distinguishing Quality from Random Noise: How do we know we’re getting valuable information?

One of the main challenges of football analytics is ensuring that their manipulation of the available data is in fact uncovering underlying “qualities” of teams and players, instead of just randomly picking up statistical noise or irrelevant facts. I could certainly use the available data and assign a number to each player by summing up the number of blocked shots plus the square root of the number of headed shots inside the area divided by the goal difference his team obtained with him on the pitch multiplied by his number of interceptions. Can I use this number in any way to advise a club on whether they should buy him? Probably not. How can I know what is valuable?

Recall from the previous entries on team passing motifs that a main reason why I stated that the methodology was picking up on a stable quality of passing style was the fact that it was stable for consecutive seasons. If the methodology was just randomly assigning motif distributions, then surely there would be no consistency between different seasons.

The implication then is this: if a certain vectorisation of the data is in certain sense “stable” across seasons, then this vectorisation is representative of an underlying quality of the data observations. Metrics intended to measure qualities which one would expect to be stable over seasons such as “playing style” or “potential” should be able to be validated in this way.

The question then is how would the details of this validation go. In this entry, I’ll go through a “validating methodology” that I’ve been working on lately:

 Take a vector representing a team or a player for a given season (something like the 5-dimensional vector representing a team in the passing motifs methodology). If my reasoning above is correct, if the vector contains valuable information regarding that player/team, then if I consider the equivalent vector for the season directly before, they should in theory be in some sense “close” to each other. The “closeness” of two vectors is of course a relative concept, so this should be measured in relation to the average distance between any pair of vectors.

As an example: If Juan Mata’s vector for 2014-15 is at a distance of 2.3 from his vector for 2015-16, and on average the distance between any two player vectors (not necessarily from the same player) in this context is 9.5, then we can say with reasonable certainty that Juan Mata’s vectors are “close” to each other.

The method which I wrote out takes as parameters the two vectorisation matrices for two consecutive seasons, normalises them, considers only players who have played at least 18 matches in each season; and prints out the following:

Here’s what we want to look at on this table: First of all, the lower the mean distance between the two vectors of each player, the better our methodology is according to the reasoning above. However, a “low” number is a very relative concept, so we need a reference with which to measure how low this number actually is. This methodology provides two such references:

  • The first and most important one is the mean distance between all vectors, not just between the two corresponding to each player. This gives us an idea of how far any two vectors in this context are and if the “closeness” of vectors of the same player is significant. Z-Score 1 is the mean distance between all vectors minus the mean distance between the vectors from the same player divided by the standard deviation of the distance between all players. The lower this number (accepting negative values of course), the better.
  • The second reference provided is the mean distance between Gaussian vectors in the dimension of the problem. Z-Score 2 is the mean distance between the simulated Gaussian vectors minus the mean distance between the vectors from the same player divided by the standard deviation of the distance between the Gaussian vectors. I feel this is also an important frame of reference because it gives a measure of just how “normalised” the scaled problem is. It also provides important “dimensional” context because for example if our vectorisation is in 15 dimensions as opposed to 5 dimensions, then the raw distances will increase but this does not necessarily mean that the higher dimensional vectorisation is less valuable, simply that the numbers we deal with in higher dimensions are naturally larger and we need to know this to know how “low” our mean distance number really is. Hence the importance of the Z-Scores.

This entry was a bit technical and perhaps less interesting for the average football fan, but I thought it was important to explain it because it’s what I’ve been using to understand how to best translate the passing motifs problem to a player context. I’m looking to follow up this entry with an applied example comparing different vectorisations of passing motifs at a player level very soon (2-3 days hopefully if I can find the time), so stay tuned!

Monday, 10 October 2016

New Season, New Ideas

I spent the majority of the summer off in Colombia and then Croatia and took a bit of a break from football and math. But now I’m back in London and settling into my routine, and even though I didn’t spend much time in front of the computer over the summer and have no finished results to show you yet, even on holiday I can’t stop my mind from drifting off towards football and new ideas that could be applied. I’m going to use this entry to tell you about the plans I made to explore some of these ideas during this new “Analytics Season” leading up  to the 2017 OptaPro Forum in February where I’m hoping to get the chance to present them.

Up until this point, I’ve spent most of the entries speaking about team passing sequences and the results of their quantification through network motifs. This is a very interesting topic, and I think there is still more to come from this. These are some areas where I still hope to do some more work:

  • The vectorisation through motif frequencies can be refined with some more information. For example, I’ve been thinking that different instances of ABAC represent very different kinds of combination play. An ABAC passing sequence can be composed of a short one-two between players A and B, and in the third and final pass A plays a long ball to player C. Alternatively, it can simply be composed of 3 short passes. The distance of the third pass should be the main factor differentiating different types of ABAC, because in the vast majority of cases the ‘ABA’ part will be made up of short passes (If Coutinho gives it to Lallana and Lallana gives it back to Coutinho, we don’t expect either of those passes to have been long otherwise Coutinho would have to run a large distance after his first pass). When the weights of the first principal component of the lengths of each pass in all instances of ABAC are looked at, effectively almost 97% of the variance is on the length of the third pass. It remains to be seen whether this ‘refinement’ can be used to further discern and distinguish team playing style.

NOTE: Think of Principal Component Analysis as a method assigning coefficients to what features contain the most variance in a set of data. If we have the height and weight of a population of hippos and a population of zebras, the height of the whole set is roughly the same but the weight differs a lot, and Principal Components Analysis tells us precisely this: the weight is where the variance is.

  • Even though we’ve spoken about “playing style” convincingly through this methodology, we still haven’t related this vectorisation to “success on the field” yet, which is actually what ultimately matters. It would be interesting for example to use Topological Data Analysis (you can read about it in my previous entry) to map out the motif vectorisation and discover where success in the league is being accumulated. We can also fit a probability distribution that gives the probability of a certain motif structure leading to a top 4 finish for instance. In this sense, we could potentially advise clubs on what they need to change in their passing play to increase their probability of finishing in the top 4, or of not being relegated, or of winning the league, etc.
  • As I said before and showed with the Xavi example, this ‘passing motifs’ idea can be simply extrapolated to a player level by vectorising each player’s frequency of participation in the different motifs. Once players are represented as vectors in this high-dimensional space, we can apply a whole arsenal of methodologies to answer questions such as which player is better suited to replace an outgoing player, which players have a similar style, how individual players affect the passing motif structures of their teams. It remains to be seen whether this approach will quantify a meaningful underlying quality in players as we have shown it does for teams, but certainly, more information is preferable to less information.
  • These topics (team style, player style and recruitment) can be combined; so for example we can advise clubs on how the recruitment of a certain player will affect their passing motif structure, and whether this change will improve their probability of a top 4 finish.

These are ambitious plans, but I think that they are important ones because I must admit that (fair) criticism that could be thrown my way is that so far the results are very interesting theoretically but difficult to translate into actual practical and applied contexts within the industry. This is fair, but it doesn’t take away in the least bit from the value of the results obtained. The passing motifs methodology has a lot going for it. It proved to be consistent across different seasons which is strong evidence that it identifies some underlying inherent property which we called “passing style”, instead of just randomly picking up statistical noise. It was also used to identify a passing style unique to Leicester City which was present even before their title winning season, something that no one could have predicted or expected. As I said, it has a lot going for it.

The key counter-argument for this criticism is this (opinion): there doesn’t have to be an obvious, direct and immediate practical application for theoretical work to be valuable for a field. I strongly suggest those with a true interest in the topic of Football Analytics read this fascinating entry from Statsbomb author Marek Kwiatkowski. Here’s an excerpt if you can’t be bothered to read the whole thing:

“(I) believe that we have now reached the point where all obvious work has been done, and to progress we must take a step back and reassess the field as a whole. I think about football analytics as a bona fide scientific discipline: quantitative study of a particular class of complex systems. Put like this it is not fundamentally different from other sciences like biology or physics or linguistics. It is just much less mature. And in my view we have now reached a point where the entire discipline is held back by a key aspect of this immaturity: the lack of theoretical developments. Established scientific disciplines rely on abstract concepts to organise their discoveries and provide a language in which conjectures can be stated, arguments conducted and findings related to each other. We lack this kind of language for football analytics. We are doing biology without evolution; physics without calculus; linguistics without grammar. As a result, instead of building a coherent and ever-expanding body of knowledge, we collect isolated factoids.”

When looking for conceptual theoretical developments, passing network motifs fit the bill of a consistent and robust concept with a clear underlying motivation (representation of “passing style”). Practical applications will inevitably follow from this maturation of the discipline, and I have already outlined above some much more practical approaches which can be looked into.

Finally, and before this entry gets any longer, this approach has given me valuable insight into a type of conceptual processing that can be done to raw football data in order to obtain a meaningful representation. Football events during a match are very dynamic, complex and interdependent, but they codify all the necessary information to determine results, quality, potential, etc. The network motifs approach suggests taking the constituent blocks of the passing events graph and applying an equivalence relationship on the identity of the nodes in order to study their nature (this simply means that instead of focusing on the specific players performing the passes, for instance we consider any occurrence of a one-two as belonging to the same “class” of pass motif regardless of who the specific players were). It has made me think: why not attempt this with other types of events? Consider for example having a directed graph representing a team’s performance, but instead of the nodes representing players and the vertices passes, each node can be seen to represent an “area” of the pitch and a vertex is simply the act of going from one area to another through a pass, dribble, etc. A sequence in this network is simply a movement between different areas. The ‘equivalence relationship’ on the identity of the nodes which I think would be useful for this approach would work something like this: a play starting in one area, then moving to an area three spaces to the right through a pass, and then forward 2 spaces through a dribble would be classified exactly the same regardless of the players involved and if it happened starting inside our own penalty box or from the halfway line.

Vectorising team or player performance through the frequency of the motifs in this context could lead to a very robust quantification of playing style, performance metric, probability of success… who knows?! I don’t yet, but let’s hope I can find out from here to February.

Thursday, 21 July 2016

Quantifying Passing Subsequences: The Mysterious Case of Leicester City Part 2

In the previous entry we left off having seen the result of a clustering dendrogram for the 5-dimensional representation of teams corresponding to the ratio at which they use the 5 passing sequences using data from the 2015-2016 Premier League season.

It came as a surprise that Leicester was signalled out by the method as the team with the most distinctive passing style in the league. But then again, Leicester were eventually crowned champions, so surely something qualitative is there to be found. The problem is untangling the true causality relation of what is being discovered. Saying “Leicester were champions because they earned the highest number of points” is a bit moronic. Something like “Leicester were champions because they scored the third highest amount of goals and had the second least amount of goals conceded” or “Leicester were champions because they were able to name an unchanged starting XI the most times in the season” can provide a bit more insight, but ultimately, when using data from the same season it can be difficult to decipher the true causality of discoveries; i.e., did X happen because Leicester had the potential to be champions, or did Leicester have the potential to be champions because of X.

The essential question then that the whole football world wants answered is if Leicester championship run could have been predicted BEFORE the campaign kicked off. Surely the sports trading community would be interested.
To investigate deeper, we went back to the data from the 2014-2015 when Leicester were almost certainly doomed to relegation but miraculously went on an incredible winning run in the season’s final stretch that saw them go from being 7 points from safety in April to end comfortably 6 points above the relegation zone. No team in the history of the Premier League had ever remained in the first division having fewer than 20 points by the 29th Fixture (Leicester had 19).

Could anyone have predicted Leicester exploits back then? Should we have known?

This was the resulting dendrogram for the data from the 2014-2015 season using the same methodology from the previous entry:

There are several important things to say regarding the results. First of all, forgetting about Leicester for a minute, it’s very satisfying to see that many of the same pairings from the 2015-2016 season are maintained. Arsenal-Manchester City, Tottenham-Chelsea and Crystal Palace-Sunderland are all examples of pairings that arise in both cases. There are other general trends that are respected like Liverpool, Southampton and Swansea being similar, just as Leicester, Arsenal and Manchester City forming the leftmost group in both cases with either Watford or Aston Villa. This is important because the probability of this happening (similar groups for both seasons) if the method was randomly pairing teams would be extremely low. This means that the method is identifying something (which I will call passing style) which is consistent in teams across a pair of consecutive seasons. This ‘satisfying consistency’ can also be seen in data for the 2013-14 and 2012-13 season for which I also replicated the method.

Let’s return now to Leicester. Just as in the case of the 2015-16 season, Leicester is the team that joins a subgroup highest up the clustering tree, meaning its passing style has the weakest bond to any other group of teams, i.e. it is the most distinctive. There is a very important caveat so we don’t get carried away: “being distinctive” is in no way equivalent to “being successful”. In fact, the second most “distinctive” team is Burnley who were relegated at the end of 2014-15. Both Leicester and Burnley have a relatively low total amounts of motifs completed, but this doesn’t explain their distinctiveness necessarily since both QPR and Crystal Palace completed fewer motifs than them and have relatively “strong” bonds with other teams. Also, a truly fascinating characteristic of Leicester’s results for both seasons is that in both of them, Leicester’s passing style forms a subgroup with Arsenal and Manchester City’s, arguably the “passing powerhouses” of the Premier League.

To answer the question posed before regarding whether we should have known about Leicester, I would cautiously say “No”. No, I very much doubt any concrete methodology would have pointed to Leicester as the eventual winner. However, keeping in mind that “being distinctive” is not synonymous to “being successful” (poor relegated Burnley), the truth is that with this data before the start of the 2015-16 season I could have said to pundits: ‘Hey, keep an eye on Leicester, there’s something interesting going on there (they are distinctive and are close to Arsenal and Manchester City)’. Moreover, I would also predict that if Leicester keep their players over the summer, this “style” which has led them to be distinctive in both the 2014-15 and 2015-16 seasons will still be there and could once again lead them to success. I wouldn’t go as far as saying they’ll win it again, but I think they’ll be in the contest. Then again, I could be completely wrong and Leicester’s fortunes can fall off a cliff in the upcoming season; but I know better than to think that means that everything I’ve said here is wrong. I hope the readers do as well (If Leicester end up doing well again, I would also be cautious about the omnipotence of my methods; statistics and probability are all about being better informed about the chaotic randomness of the world, not about fortune telling…).

I’ll keep on trying to see what else this methodology has to give. I suspect some sort of “tree/dendrogram” method could be used to quantify how much success (higher finishes in the league table) is being accumulated in what areas of the tree and what a team’s position in the dendrogram says about its final league position. Also, as I mentioned a couple of entries back when I first spoke about this methodology, the really interesting bit could be extrapolating the method to discover how well prospective recruitments will fit within a team’s passing style. I also hope to have a go at this. Finally, some additional variables could also be integrated into the methodology to further distinguish passing sequences. For example, a completely vertical instance of ABCD is very different from a sequence of ABCD composed of horizontal square passes. Integrating this is also something I’m working on.

Keep an eye on the blog to see how it all unfolds.

Friday, 8 July 2016

Quantifying Passing Subsequences: The Mysterious Case of Leicester City

This entry follows up with the previous entry's idea for quantifying teams' and players' passing styles through 3-passes long motifs (if you haven't read it I recommend you do so before reading this one).

Now, I decided to attempt to replicate the results shown previously from the Spanish La Liga using last season's Premier League data. In my application, I quantify the raw passing data by counting the amount of times each motif occurs for each team in each match. The table below shows the total amount of times each team performed each of the five motifs throughout the whole season.

As we can appreciate, Arsenal and Manchester City are either 1st or 2nd for every motif category. However, since teams like Arsenal and Manchester City complete the highest amounts of passes in a season, it is to be expected that they also complete the most motifs for each category.

A different way of looking at this data then is to analyse the relative frequencies of the motifs as a percentage of the total number of motifs completed by a team during match. That is to say, regardless of how many motifs were completed in total by a team, we want to look at which percentage of them were ABAB, which percentage of them were ABAC, etc.

The following boxplots show the distribution of the relative frequency of each motif during each match for each team of the 2015-16 Premier League season:

Now, isn't that interesting?! Leicester emerge as a team with a distinctive playing style now. If you return for a moment to the previous entry you can see that both Barcelona and Leicester “win” in the ABAB and ABAC categories and noticeably “lose” in the ABCD category (this similarity isn’t there in the other two categories). I'm obviously not claiming that Leicester and Barcelona have a similar style, I'm sure I would lose all credibility with football fans and might as well just close the blog. The main difference is that Barcelona don't only win the relative frequency battle, but also the overall total usually completing many more passes than their opposition. Leicester have a much more modest return in overall motifs completed (i.e. many fewer passes completed). In fact they complete the second lowest amount of motifs overall (second only to WBA and only marginally below Sunderland), but for the amount of motifs they do complete, there seems to be something there in the sense that they tend to proportionally perform a distinctive choice of passing sequences/motifs.

In fact, forget about the whole Barcelona thing for a moment. Even without having ever seen those results for La Liga, the methodology is pointing towards Leicester as the team with the unique style in the Premier League. The following figure shows the Clustering Dendrogram for the data viewing each team as a vector in R^5 where each feature is the mean percentage each motif constitutes of the total for each match of the season:

NOTE: It’s not easy to explain briefly what a clustering dendrogram is or means so please refer to any of the good sources on mathematics widely available (Wikipedia is pretty good), but basically it represents how teams are sequentially grouped according to their similarity (distance in R^5). For example, we can see that Leicester, Watford, Arsenal and Manchester City are a “group” but within that group there are two subgroups consisting of Leicester on its own and the other three, and similarly Arsenal is more tightly grouped with Manchester City than with Watford. The higher in the tree a grouping is made, the “weaker” its bond is. With that in mind, Leicester is the team with the “weakest bond” to any other subgroup of teams.

Honestly, I don't know any more than you what this means (yet); but it's very interesting that something pointing towards Leicester came up when I wasn't even looking for it. This was a simple probing methodology which pinpointed Leicester on its own, without me asking: “Is Leicester distinctive?” or “What sets Leicester apart?”.

It would be important to validate whether there is any sort of linearity between the frequency of each motif and the total amount of motifs completed; if that was the case Leicester’s low amount of motifs could explain why it is being set apart, but I don’t think this would tell the whole story as then WBA and Sunderland would also be signaled out in a way.

In the coming weeks I’ll try to get to the bottom of this…