Sunday, 14 May 2017

Paper: 'Flow Network Motifs Applied to Football Passing Data'

I wrote this paper to be presented at MathSport 2017 Conference in Padua University in June. It's a bit heavy on the mathematics in chapter 2 but should be fairly readable from there on. Here's the abstract so you can know if you're interested in reading it before opening the whole document:

"Network Motifs are important local properties of networks, and have lately drawn increasing attention as promising concepts to unearth structural design characteristics of complex networks. In this document, we push the boundaries of the existing body of literature which has used this theory to study soccer passing networks by attempting to uncover unique team passing network structure, and make a rigorous attempt to formalise a theoretical framework in which to carry out and evaluate these analyses. We contribute to the existing body of knowledge by proposing a framework based on repeatability in which to establish the ideal length of flow motifs with which to study soccer passing networks, and also by considering spatial classifications of flow motifs to achieve greater precision in our claim to discover unique team passing network style."

Anyways, here's the link to the pdf on Google Drive:

Thursday, 13 April 2017

What's the ideal length of passing motifs?

In my last entry I pledged to answer this question using the 'repeatability' methodology I presented there. This will be a quick entry to confirm that luckily, we've been right all along and 3 is the ideal length to consider for passing motifs.

The number of passes considered in a passing motifs analysis is a clear instance of the trade-off between detail and comparable structure that we discussed in the previous entry. The Figure below shows the number of motif types that occurred in the 2015-16 season of the Premier League depending on the number 'k' of passes we choose as 'length' from 3 to 7:

When we choose to consider 3 passes, there are 5 motif types which I hope all my readers know by heart by now: ABAB, ABAC, ABCA, ABCB and ABCD. If we choose to go up to 4 and consider one extra pass, there are 15 different types: ABABA, ABABC, ABACA, ABACB, ABACD, ABCAB, etc.

For 5-passes long motifs there are 52 types (all of which occurred at some point in the 2015-16 season), and for 6-passes long motifs there are 203 (of which only 187 types occurred in the 2015-16 data). There were 759 different types of 7-pass motifs in the data. We can appreciate how the number of motif types grows quickly with the number of passes we are considering, which precisely lends itself to losing structure in the noisy haze of excessive detail.

The Figure below shows the number of motifs for the 2015-16 season for each number of passes 'k' considered:

There were 138,432 3-pass long sequences compared to 45,820 7-pass long sequences. While there is a considerable amount less, the data set is still of a decent size to believe that we can extract meaningful conclusions.

Finally, the figure below shows the repeatability percentage as per the methodology of the previous entry for each choice of length from 3 to 7 (as before, we consider relative frequencies of the different categories rather than raw amounts):

3-pass long motifs have a repeatability of about 82.7%; while 6 and 7-pass long motifs have 57.8% and 52.3% respectively. Considering that as per our methodology random methods that carry no structure would have 50% repeatability (equivalent to randomly assigning style), these figures mean that by then we've lost almost all structure.

It's an interesting conclusion that sequences of 3 passes are the ideal number to consider which carry unique team structure better than longer sequences. Considering that passes constitute the grand majority of events on a football pitch, it's a far-reaching conclusion. It provides insight into breaking down the sequentiality of football matches into representative constituent blocks: looking at blocks of about the size corresponding to 3 passes should be best practice.

Wednesday, 5 April 2017

Passing Network Autographs and Overshooting Style

At the end of the last entry I touched on the trade-off between comparable structure and ‘granularity’ or ‘level of detail’ of football data. Imagine this: you have a player who has the ability to pick a certain type of between-lines pass that greatly increases your team’s chance of scoring in that play. With passing data, we could try and identify how this pass is represented in the data and then use the data to identify other players in the world that are also good at making this kind of pass. To do this, we will need to break the data down in a detailed way, and differentiate this type of pass from other vertical passes that perhaps aren’t as effective. We might want to consider where in the pitch this type of pass comes from or where it finishes, what action it is preceded by, where the defenders are, what happens after the pass is made, etc.; all of this in the hope that we can clearly identify the type of pass we’re looking for and tell it apart from other passes. However, what happens when we go too far and use too much detail? It is unlikely that each time our player performs this type of pass he does it in exactly the same way, in the same coordinates of the pitch and in the same conditions. If we start using too much detail, we might actually start classifying instances of this type of pass as different when in reality they correspond to the same sought-after ‘type’. Once this happens we are no longer capable of identifying the “type of pass” we were initially looking for, and now have hundreds of different passes that at this level of detail are all different from each other. We can no longer identify players who can play this pass because this “type” was obliterated in the detail.

I actually began thinking about this issue when reading Dustin Ward’s piece on clustering different types of passes. He decides to take 100 clusters or types of passes and see how often each team or player completes each of those types of passes. This is a good example of the trade-off mentioned above. 100 seems like a good number, it certainly reveals more info about a team than if we considered just 1 (which would basically be like looking at overall Pass Success percentages) or 2 types/clusters of passes.

Choosing 100 is also better than choosing 100,000. If we chose 100,000, then each player or team would perform maximum 1 instance in a season of highly detailed, highly differentiated types of passes. We wouldn’t be able to use this information to compare teams or players in any way. But is choosing 100 better than choosing 120? Or than choosing 80? How do we know when this trade-off is striking the right balance?

The key is having something against which to measure ‘balance’, something we want to optimise. In this entry I’ll show you an example of how this something could be ‘repeatability’:

For a while I’ve been wanting to push the Passing Motifs methodology a bit further and include some spatial information about the passes to see what else it can tell us about teams’ passing networks. Below is an example of two very different instances of ABAC.

The question I wanted to answer was this: will we gain any additional valuable information about teams by differentiating different ‘types’ of motifs according to their angles, distances and coordinates on the pitch? Crucially, I also wanted to know where the right balance would be when doing this differentiation in light of our structure-detail trade-off.

There are two ways of looking at spatial variables associated to motifs that I felt could be revealing:

x-y Coordinates of Passes: In Opta’s data files, each pass has a ‘Start x-y’ Coordinate and an ‘End x-y’ Coordinate, meaning each pass has four variables in terms of coordinates. A 3-pass long motif would therefore have a set of 12 variables representing where its passes began and ended.

Angles and Lengths: Another way of looking at it is by the ‘angles’ and ‘lengths’ of passes in a motif. The figure below illustrates how these are found.

With this idea we would have six variables associated with a motif: the angle of each of the 3 passes of the motif and the length of each of the 3 passes as well.

NOTE: The thing I like about this ‘angles+lengths’ idea is that it doesn’t “care” where in the pitch a motif happened, only its geometric structure. I like this because if it has ‘structure’ or ‘insight’ into teams’ styles it will not be as heavily determined by whether the team dominates the opposition or not: if we only look at pitch coordinates of motifs then top teams like Chelsea or Manchester City will perform all of their motifs high up the pitch. Therefore, the method would be biased towards saying they perform the same ‘types’ of motifs, namely “high up the pitch” motifs. I’m not saying that this isn’t meaningful, but it is information we all know by simply looking at the league table and knowing these teams play deeper into their opposition’s half. However, if we discover structure that is independent of the league table from the geometric shape of motifs, it makes it interesting in the sense that perhaps it wasn’t correlated with "obvious" aspects.

Whichever of the two ideas we go for, there is going to be a set of variables associated to each motif, which we can then use k-means clustering to classify into a certain number of different types of motifs. Our intuition from the trade-off tells us that there is an intermediate number of categories that has the best representation of “style”. The problem is that to use a k-means clustering algorithm, we need to manually tell the algorithm how many different categories we want before knowing this optimum number.

Consider this: for each choice of number of categories, once we have determined the number of categories and classified the different motifs into the category they correspond to, we can use the best practice we know from the original passing motif methodology and look at what percentage of each motif category (in the ABAC-sense) corresponds to which ‘type’ (in a either a x-y coordinate or angles+length sense). So as an example, if we had chosen to have 3 different types of motifs, then for each team we would have this set of numbers: what percentage of the teams ABAB motifs are type 1, what percentage of the ABABs are type 2, what percentage of the ABABs are type 3, what percentage of the teams ABAC motifs are type 1, what percentage of the ABACs are type 2, etc. What we’ll have is a vector representing each team.

Now suppose we randomly divide each team’s motifs into two different sets, so now we have Arsenal’s A motifs and Arsenal’s B motifs as if we were artificially considering each as the motifs of different teams. If choosing this number of categories reveals teams’ structure or style, then the style attributed to Arsenal’s A vector should be very similar to the style attributed to Arsenal’s B vector. The more underlying structure we’re capturing, the more this effect should be obvious. If on the other hand we’ve gone too far and now the extreme detail is overshooting the underlying structure we want to discover, then Arsenal’s A vector will not necessarily be similar to Arsenal’s B vector because the extreme detail is damaging the comparability of styles. This is what I mean by “repeatability”.

The following graph reveals how “repeatable” each choice of number of categories is for both the ‘x-y coordinates’ idea and the ‘angles+lengths’ idea:

The methodology is as follows: for each number from 2 to 50, we create that number of motif categories using k-means clustering and assign each motif to a category. We then divide randomly each team’s motifs into two different sets to have a vector A and vector B for each team. Then we check how “repeatable” the methodology by checking on average how close teams’ A vector is to their B vector in comparison to the rest of vectors representing other teams; and this process (since it involves both the randomness of a k-means algorithm and the division of a teams motifs into two sets) is repeated a hundred times for each number. The graph shows as a percentage the average ‘relative closeness’ for each of the hundred trials as follows:  I took each teams’ two vectors and determined on a scale of 1 to 39 how close a team’s A vector was to his B vector. Since there are 20 teams and I divided each one into two vectors, there will be a total of 40 vectors representing 'styles'. Considering as a focal point a team’s A vector, its B vector could either be the closest of the other 39 vectors (1), the second closest (2), all the way up to the farthest away (39). I did this for every team and averaged these numbers, to finally compute the percentage that the outcome was of 39 (this was done using passing data for the 2015-16 Premier League season).

Right off the bat we find evidence of the balance we’ve been speaking of. When we start increasing the number of categories we start obtaining more repeatability, meaning we can more closely recognise two vectors as being the A and B vectors of the same team because they are similar (i.e. close) to each other. I like to interpret this as uncovering more underlying information that uniquely identifies a team’s passing network style: no matter how we randomly divide a team’s motifs into two sets, we roughly still know which sets correspond to the same team because we know the “style”. We then reach an optimum number of categories for which this repeatability is optimised: for the ‘x-y coordinates’ idea it’s at 9 and for the ‘angles+lengths’ idea it’s at 13. After this, the repeatability starts to decrease meaning that a team’s A and B vectors start to not be as similar to each other because they’re made up of highly detailed motifs that are overshooting the underlying “style” of what it actually is that a team inherently does with its passing networks.

We have answered our initial question: The original passing motif methodology (found in this entry) in which we took the 5 different motifs and compared teams according to how much they relatively used each motif had about 83.3% repeatability as per our methodology. By breaking motifs down into an optimum number of categories for the ‘x-y coordinates’ and ‘angles+lengths’ ideas (9 and 13 respectively), we were able to increase our repeatability to 94.3% and 84.4% respectively (evidently the 'x,y Coordinates' has better repeatability than 'Angles+Lengths', but as we said before any structure from a purely geometrical classification is interesting).

Below is a set of boxplots illustrating what the 9 categories represent in terms of the different 'x-y' coordinates:

As an illustration, if you look at categories 4 and 8, they both begin a bit past the halfway line really close to the left touchline, but while in Category 4 motifs the three-pass sequence ends a bit further up but still on the left hand touchline, the Category 8 motifs made their way across the pitch to finish closer to the right hand touchline.

The 94.3% repeatability of the 'x-y coordinates 9 category' vectorisation is incredibly high. In fact, if we remove Sunderland and West Bromwich which for some reason only have 80.2% and 83.5% repeatability respectively, the other teams have an average repeatability of 95.7%!

These results mean that we’ve managed to pin down an underlying structure in teams’ passing networks that allow us to identify unique team styles (lets call it "Passing Network Autographs") with a high degree of confidence. We’re at the point where if we’re given a set of motifs we could have a robust educated guess at which team they correspond to and most likely be right (except perhaps if they belong to West Brom or Sunderland for some reason). As an example, below is a comparison of Arsenal's autograph versus Bournemouth's (the team whose 'autograph' most differs from Arsenal's):

Perhaps some readers might be unimpressed with this rather theoretical and un-applied result, but although I admit that in its raw form this seems a bit unmanageable, I would advise them to keep an open mind and think of the potential. For example, having such a reliable ‘passing network autograph’ for teams, we can look through players from outside the Premier League and find those whose current passing network best fits within a team’s autograph. We could also use our measure of team style to try and predict which styles are more effective against each other, or which defenses are the best at interrupting the attacking flow through a team’s passing network. These possibilities probably sound more appealing to most readers, but in order to do them in a meaningful way they must be underpinned by theoretical confidence that we are indeed identifying team styles. I will try and follow up this theoretical entry with a more applied one exploring some of these possibilities later this month. 

I want to finish this entry off by highlighting the important potential of generalisation these ideas have. I feel they’ve helped me establish best practice when it comes to breaking passing motifs into different categories according to their spatial properties (and by best practice I mean knowing how many categories I should break it up into); but the method can also be used to determine best practice in other ideas currently being explored by football analysts. For example, during my Opta Forum Presentation’s Q&A, Marek Kwiatkowski asked whether the passing motif methodology could be generalised to motifs of more than 3 passes. The answer is that it can, but we run the risk of going too far and start overshooting the structure that the methodology helped us identify as team and player passing style: for 3-pass long motifs we had 5 motif types, while only going up to 5 or 6-pass long motifs we’re already at 52 and 203 types respectively with wacky things like ABCADBA. The ideas presented here can help us answer the question whether it’s worth looking at longer motifs (another entry soon perhaps?). It can also help Dustin Ward to establish exactly how many types of passes he should consider. In general, it helps us to establish standardised best practice that the whole of football analytics will benefit from and that its currently distinctly lacking. Echoing Marek’s piece on the state of analytics: “Established scientific disciplines rely on abstract concepts to organise their discoveries and provide a language in which conjectures can be stated, arguments conducted and findings related to each other. We lack this kind of language for football analytics”. We need common-ground theory in which our public work can be related and compared, and it’s worth truly understood. The lack of it is holding back all of us who have an active interest in the field really taking off. I hope this approach to improve our understanding of our ideas and take steps towards enhancing them and establishing best practice can inspire other public (and even private) analysts to attempt similar things in their work and establish bridges through which we can compare and complement our work. Valuable applications will inevitably flow from robust, interconnected theory.

MATHEMATICAL FOOT-NOTE:  Comparing the distance between A and B vectors as their position from 1 to 39 on the closest-farthest away scale may seem a bit unorthodox and one might consider simply using the z-score of the distance between teams’ A and B vectors in the context of all the distances between all 40 vectors. However, the reason I don’t do this is that for each different choice of the number of categories, the dimension in which these vectors are is different, and on a personal mathematical note I have deep mistrust in comparing distances between things that are in different dimensions.

P.S.: I want to give a brief mention to BenTorvaney who gave me a small but meaningful contribution which I feel greatly enhanced the results of this entry.

Tuesday, 14 March 2017

VIDEO: OptaPro Forum 2017 Talk on Passing Motifs

A talk on some of my passing motifs work was selected for the OptaPro Forum 2017 which took place in Central London on February 8th. You can see the full video (except the Q&A) below:

Most of the content I'd already written entries about: For the general overview on the passing motifs methodology for teams read this entry. For my applied results on teams from the Premier League (and a take on what the hell happened with Leicester) read this and this entry. Below are the images for the hierarchical clustering dendrograms and the principal component analysis graphs of this 5-dimensional representations of Premier League teams for the 2014-15 and 2015-16 seasons.

Moving on to a player level, you can read up on the general methodology in this entry, and then on the specific scoring system that I presented at the forum to create those lists of players in this entry. Below are some of the lists I had on display which are respectively: Premier League 15-16 using 'Key Passes' to award points, Bundesliga 15-16 using 'Key Passes' to award points and finally Premier League using 'Expected Assists (xA)' to award points. For that last one I used Opta's xA numbers which give account for the probability of a pass turning into a shot with a certain xG value.

Finally, I also did a bit on using Topological Data Analysis (TDA) to explore the results for players which I hadn't done before; although to read up on the general methodology of TDA you can read this entry (wow how things have changed since that entry! I of course now know that Opta doesn't really log 'controls with left thigh'. Don't be fooled by how assured I wrote about the analytics industry back then, I honestly didn't know half of what I know now about that world just 10 months later... hopefully my future self in another 10 months will also look back with pity at my current self's ignorance).

Below is the image from the forum:

Finally, I want to use this (non) write-up on the presentation as a platform to discuss some more general reflections about analytics. 'Operationalising' is a hideous sounding word which was horribly difficult to repeatedly say in front of 300 people; but it actually is very important. There is so much complexity in raw football data that those of us who do analytics really need to broaden our scope when thinking how we will represent this raw information in numbers, vectors or variables that will help us uncover the rich underlying information that is there for the taking. The 'passing motif' operationalisation of raw passing network data is 'neat'; it seems harmless when you first see it and I wouldn't blame you for doubting that those 5/45 numbers attributed to teams/players will actually say much about them, but evidently they do. I think that what it's got going for it is that it helps to account for the sequentiality of the raw events, something which most of the work I encounter out there fails to do. As I said in the presentation, we're a bit too focused on events when its actually the sequences of these events that actually matter.

There is a classic problem though (akin to the overfitting problem of modelling) when trying to account for larger and larger sequences: If we become too granular and for example don't do this methodology for 3-pass long motifs but rather for 10-pass long motifs; then the occurrences will become so specific that we actually lose out on comparable capacity in the structure of our information. Alexis would have such a specific distribution of highly differentiated sequences that he would have no neighbours to reward him for their key passes! We need to strike a balance between sequentiality and lets call it "non-granularity" (this was actually one of the questions at the forum: can/should the methodology be generalised for more than 3 passes?).

Finding the correct concepts that strike this balance is the challenge of analytics. Passing motifs are "neat"; but even I recognise that they are nowhere near the ambitions of what I would hope to achieve in analytics. Exciting years to come!

Monday, 16 January 2017

Player Vectorised Representations: What player lists can we draw up with confidence?

I love drawing up lists and rankings of players (who doesn’t?) and giving myself a big “confirmation bias” pat on the back when I see players on the list which I like while casually either ignoring as a mistake of the method or updating my bias for the players on the list which I don’t particularly rate highly. However, the very exercise of drawing up lists and rankings can be misleading for the probabilistically-illiterate because it seems to imply set-in-stone certainty about who the best player is, who the second best is, etc.; and this rigid numbering masks the underlying concepts of probability. And yet, drawing up player lists is key for the recruitment workflows of clubs, be it in drafts or transfer windows, or even just to set up a schedule for their scouts. You definitely don’t have to see the rankings as set in stone, but I can imagine clubs would definitely want to have things like 15-men shortlists with 2 or 3 ‘favourite choices’. In this entry I’m going to show you a couple of lists I drew up and how we can go about our list-making with confidence with vectorised representations of players.

I drew up lists for this entry using the player passing motif ideas from previous entries. The passing motif methodology produces a vectorised representation of players, which basically means that each player is represented by a vector of numbers. In the passing motif methodology I’ve used so far, the vector representing each player has 45 entries or numbers. The key conceptual bit is that when you have this type of vectorised representation, you can imagine each player as being in a “space” of some sort. To imagine it, suppose that instead of 45 you simply had 3 numbers representing each player, something like age, height and weight. If this was the case, you could imagine each player as being represented by a dot in a 3-dimensional space much like your living room. Some players would be closer to others, some would be farther away. Perhaps all the senior, tall and heavy centre-backs are located around your TV, while the shorter and lighter second strikers are hovering around your dining room table. This is just how I conceptualise the result of the 45-dimensional passing motif methodology. It makes it more abstract to picture, but just as in the 3-dimensional case, there are distances, certain dots closer to each other or concentrated around certain areas, etc.

The list I drew up basically took all the players who had at least 18 appearances in last year’s Premier League, and gave them “points” according to how many key passes they made AND how many key passes the players around them made. The closer to a player you are, the more “points” his amount of key passes awards you; the farther away the less. I tried this out in a few ways but that’s the basic idea. The idea is that if you happened to make few key passes in the season but all the players whose motif vector is close to yours made a lot, you should still have a high score. If the information contained in the motif vectorisation is at all useful to recognise players with creative potential, then the best scored players should in a way be the best creative passers in a more profound way than simply looking at the Key Passes Table. The question is precisely, how do we know the vectorisation’s layout of players has anything to do with their “key passing ability” (i.e., players with high ability cluster around certain areas of this “space” and are in general closer to each other)? Let’s look at the list before we begin to answer this question so everybody gets a bit excited before it dawns on them that I’m actually rambling on about some technical stuff.

Remark: Notice how this list isn’t strictly correlated with key passes. Drinkwater is better ranked than Eriksen even though the latter had many more key passes. This means that if the list is sound (big if), its picking up on information that wasn’t immediately and explicitly available in the key passes tables.

My confirmation bias seems to like that list quite a lot, there are a lot of good names up there. Most readers probably follow the Premier League closely and know that those are all good attacking creative players, arguably the best in the league. Now imagine that instead of the Premier League, we drew up an equivalent list using data from leagues where we didn’t already know the players, and had confidence that just as in the case of the Premier League, we were definitely getting out a list of most of the best players. Should be useful huh?

There are also some notable absentees. Coutinho comes to mind as a player which is widely agreed to be amongst the best in the league who isn’t on the list. Why should we trust a list that claims to rank the top 15 creative players in the league but leaves out Coutinho?

As I said before, I think of the vectorised representation as encoding the information regarding players’ key passing ability if players who tend to have a higher number of key passes are more or less clustered together as opposed to randomly located mixed with all the other players. If this is a general trend, then we know that there is a relationship between a player’s key passing ability and his location in the 45-dimensional space we are imagining. Even if a player happened to not have many key passes in a season (this can happen just as strikers have goal droughts or perhaps because a player’s teammates don’t make good runs), we should still pick up on this “ability”.

What we would need then to justify our faith in the list is some sort of indicator which specified just how “clustered together” players with higher number of key passes are. There are many ways to approach this problem in mathematics. For those readers who have mathematical backgrounds we could try to fit a model and asses the goodness of fit, or apply some sort of multi-dimensional Kolmogorov-Smirnov technique comparing the actual distribution of vectors and key passes with one where the key passes where distributed randomly. However, all these tests are a bit technical and hard to apply in high dimensions, and all in all we really want an indicator more than a model of “Expected Key Passes”. Here’s a simpler validating technique:

For each player, take his K (in mi list K=10) closest neighbours and compute the standard deviation of their key passes. Once we’ve done this for every player, we can compute the mean of the standard deviation of key passes in each of these K-player “neighbourhoods” (let’s call it the ‘mean of neighbourhoods’ variation number=MNV’). If in each neighbourhood the players have a relatively similar number of key passes, then the MNV should be comparatively low. The important question is: what do we compare it to in order to know if its low or not?

 I feel that there are two important numbers to compare this number to. The first would be simulating many (many) scenarios where the key passes are randomly permutated amongst the players and comparing the real MNV number to the average of these simulated cases. The second number would be the minimum MNV of any of the simulated scenarios. If the MNV of our actual vectorised representation is “low” in comparison to these simulated scenarios, then we know that the players’ layout in this imaginary 45-dimensional space clusters the key passers of the ball closer together than random distributions; which in turn would mean that the logic applied to obtain the list has a robust underlying reasoning because a player’s location in the 45-dimensional space should have something to do with his “key passing ability” (I fear I may have lost half the readers by this point…).

Here are some results:

Of the 100,000 simulations, the lowest MNV was 14.62 while the actual MNV is 11.86. This means that if we randomly assigned the players to a position in the 45-dimensional space 100,000 times, none of those simulations has the key passers clustered together better than our actual passing motif representation. This is quite promising, but even then, I suspected that maybe this is because the method clearly recognises the difference between defensive players and attacking players and attacking players are much more likely to get more key passes; so I repeated the validation using only attacking midfielders, wingers and strikers:

The results are less overwhelmingly positive, but even when just looking at attacking players, the actual layout surpasses any random distributions of the players after 100,000 simulations. To appreciate the value of this method and what information this is actually giving us, let’s compare with an equivalent list drawn up using ‘goals’ to award points rather than ‘key passes’ (using only attacking players again for the same reason as before).

The MNV numbers are naturally smaller because players score much less goals per season than key passes, so the overall scale of the problem is smaller. We can see that even though the real MNV is smaller than the average of the simulations, its actually relatively large when compared to the minimum MNV obtained through random simulations (notice how important it is to have a frame of reference to know when the number is small and when it is large in each specific context). This means that the position of players and goals in the 45-dimensional space can be clustered together through random simulations considerably better than using the passing motif vectorised representation. As opposed to the ‘key passes’ case, this vectorisation doesn’t encode much information pertaining to “goalscoring capability”. This actually makes sense though since the passing motif methodology is designed using only information from the passing network which doesn’t necessarily contain information regarding finishing. Therefore, the list made using ‘goals’ is much less reliable.

Coming back to Coutinho’s absence from the original list, it’s important to understand that I’m not claiming the list as a know-it-all oracle for creative talent and that this talent can be rigidly ordered. What this entry tried to show is that there is solid evidence that a player’s position in the 45-dimensional space determined by the passing motif methodology encodes a good amount of information which determines how many key passes he ought to have given a sort of “passing ability”. That doesn’t mean it encodes all the information. Perhaps the vectorised representation is missing out on what it is that makes Coutinho great. Nevertheless, once we’ve accepted and understood that the list will offer us, I doubt any club could claim that a list like this from different leagues from around the world is of no use to their organisation just because they might miss out on the Serbian League’s Coutinho (sadly, such is the ‘glass half empty’ prejudice that analytics face).

Finally, this way of looking at the problem of rating players opens the door to a host of possibilities. When I was doing my bachelor in pure mathematics I was actually more interested in differential geometry and topology courses than statistics courses, which is why I tend to think of data observations as vectors in high-dimensional spaces and think that their positions in those spaces encodes valuable information. This entry began by taking a vectorised representation (passing motifs vectors) and established that if we look at the number of key passes each player made, the players’ vectors’ position in this space seemed to encode this info. On the other hand, it didn’t seem to encode the information pertaining to goalscoring. That isn’t to say it might not encode information regarding other metrics. Expected Assists maybe? It also doesn’t mean that other vector representations don’t encode some of this information better than my own passing motifs representation. It’s a bit of a 3 step thing really: 1. Find a vector representation, 2. Check what sort of information it seems to encode well (especially information that isn’t explicitly available elsewhere, and 3. Find a way to give players a rating using this fact.

I hope this way of thinking encourages other analysts out there to try their hand at this sort of work! 

Wednesday, 30 November 2016

Player Passing Motif Style Application: Most Distinctive Players and Best Recruitment Opportunities

I was recently invited to send some of my work on passing motifs to be used for a Fink Tank column in The Times, but of course a dendrogram such as the one I linked in my previous entry wouldn’t cut it in printed media. Therefore, I thought the best thing to do was set out to answer some concrete applied questions the methodology might answer, which would be easy to display but interesting nevertheless. The content of this entry was the result:

I started by thinking about “distinctive” players; players that couldn’t be replaced easily. Remember that there is solid evidence that the methodology outlined in the previous entry picks up some underlying information on player passing style, and two players are considered similar if their vectors are “close” to each other. With this in mind, I computed the average distance of the 10 closest players for each player. The players for whom this number was highest were considered the most distinctive. The following table shows the top 5 along with their closest neighbours (this whole entry is based on data from the 2015-16 Premier League, without goalkeepers):

First of all, I strongly think that Ulloa being there is a bit of an oddity: most of his appearances were late substitutions when Leicester were holding onto a 1-0 lead and therefore his stats are representative of this unique predicament.

Ozil is the most distinctive player in the league. Looking through his closest players Nacho Monreal and Bacary Sagna can raise a few eyebrows but all in all even they are considerably far away from his style and therefore don’t reveal much about him. It’s a bit like the US mainland and Australia being amongst the closest countries to Hawaii; so treat that with due suspicion.  The rest of the players seem to make good footballing sense. I’ll leave it to the readers to read through the results and make their own judgements.

Another interesting question which I thought of was this: which players represent the best recruitment opportunities in the sense that they have similar styles to players who play for much better ranked teams. Something like Sunderland players having similar styles to players from Arsenal, Manchester City or Tottenham. There are several ways to answer this question. Let’s start by the simplest: for each player I computed the average final league position of the 10 closest players, and subtracted that number from his own team’s final position. The players for whom this number was highest can be considered to represent the “best” opportunities. The following table shows the top 10:

I didn’t watch Aston Villa much last season and have no opinion on Ashley Westwood to be honest. It’s good to see Nathan Redmond and Idrissa Gueye on there though, considering they have since moved to Everton and Southampton proving they were in fact capable of playing for better teams. The problem with this methodology though is that it assumes that the difference in player’s quality is lineal with league position. That is to say, the difference in quality between a player from the 16th team and one from the 20th team is assumed to be the same as the difference between a player from the champions and a player from the 5th team, when in truth there is no solid basis for this assumption.

One way to deal with this is to apply an increasing concave function to league position so that the same differences in position lower down the table are weighted less than for higher placed teams. I tried out a few functions like log, square root, cubic root, etc., and the results vary marginally but the same core of names seems to pop out for most of them. As an example, the table below shows the results for this methodology applying the fourth root to league position:

Some more satisfying names show up on there now (Ashley Williams who moved to Everton and Jason Puncheon who is pretty good), but it’s still unclear whether this methodology is properly representing the differences in quality required to play for different teams. Perhaps a better way to look at it is by points obtained rather than league position. Is the ratio of points a good representation of the difference in quality? The idea would be something like if a team obtains 90 points then its players must be 3 times as good as those of a team which obtains 30 points. The following table shows the top ten players such that the ratio between their teams’ points and the average points of the teams of their 10 closest players is greatest:

NOTE: I had to exclude Aston Villa players here because they obtained such a small amount of points that right away the method assumed it was twice as hard to play for Norwich than for Villa (Aston Villa made 17 points and Norwich 34) and obviously all the Villa players dominated the rankings.

Nathan Redmond is on the list again which is good to see, as well as Moussa Sissoko who moved to Tottenham this season. M’Vila, Cisse and Watmore are other players on the list which I rate highly, but again, lets allow the reader to make his own judgements.

To wrap this theme up, the final way of answering this question I used combines some of the best elements of the previous two ideas: for each player we compute the difference between the average squared points of the 10 closest players and his own team’s final points squared. The square is taken to compensate for the fact that the quality required to go from a 60-points team to an 80-points team is higher than the quality required to go from a 30-points team to a 50-points team. The table below shows the results:

There are some good names on there. Redmond definitely seems to be a good catch by Southampton. Now, remember that this methodology isn’t meant to be a magic crystal ball. Some people who know I do this type of work constantly ask me: “So, who’s going to win the league? Who’s the next Messi?”. They fail to understand the subtlety of what there is to actually gain from data mining. Ashley Westwood might be great, but then again he might not. Nevertheless, some players which traditional scouting methods seem to like from last season which were recruited by larger teams are also liked by this method. 

It’s pretty remarkable that this methodology seems to be so rich when it is ignoring a lot of relevant information such as shots, goals, tackles, etc. It only sees what is visible in the passing network, which seems to be enough to make some decisions that very informed professional recruitment makes like picking up Gueye, Redmond, Sissoko, Ashley Williams, etc. That doesn't mean that it has all the answers. For example, it might like a centre back who is good at playing the ball out from the back but it has no way of knowing if he is also defensively sound. If complemented with more sources of information (such as direct traditional scouting), however, this type of work can be very useful for clubs.

Finally and on a bit of a sourer note, I thought it might also be interesting to look at some of the “worst” opportunity players; that is to say players who are similar to players from much worse teams than their own. I had to exclude Leicester players because their players completely dominated the top ten in most lists I drew up; even Kante, Mahrez and Vardy. I’m not sure what to make of this, because even though they were the champions, their players aren’t particularly similar to the players from other top teams. Just so it doesn’t seem like I’m slipping something past you, here are the 10 closest players to Kante, Mahrez and Vardy:

Now, excluding players from Leicester, here are the “worst” opportunities using the “squared points difference” metric outlined above:

Make of it what you will, but keep in mind that everything has a context and even if I claim this method sees a lot, I also recognise it doesn't know everything. If you see a name in there you don't like, keep calm, take a deep breath, look through the closest players and have a think about what might be going on.

Monday, 31 October 2016

Passing Motifs at a Player Level: Player Passing Style

This is a pretty exciting entry, so bear with me if it gets a bit long, I think its worth it…

Ever since the first entry on Passing Motifs I mentioned the potential of extrapolating the methodology to study passing styles at a player level. That first entry mentioned the idea set forth by Javier Lopez and Raul Sanchez to answer the question “Who can replace Xavi?”. Nevertheless, that particular example always left me wanting for more because the outcome was noticeably skewed towards players from Barcelona and a few other teams like Arsenal and even Swansea surprisingly. It made me think that the methodology was ignoring individual player traits and rather picking up stats that are reflective of the team the player plays for, not of the player himself.

I’ve been thinking ever since what the best way to extract player passing style from passing motifs is. Here are some of the ideas I’ve had:
  • One first objective is to neutralise the effect of the team passing style on a player. If a team proportionately uses ABAB a lot, then inevitably so will the players. Therefore, if you put Fernandinho in Barcelona, his motif frequencies will start to resemble those of the whole team without it having been something inherent to him all along. The idea I had was to view how a player’s relative motif frequencies diverged from his team’s frequencies in each match. That is to say, if in a match Arsenal performed 40% of its frequencies as an ABAC and 43% of the motifs Coquelin was involved in were ABAC, then Coquelin had a +3% for that motif for that match. Averaging for the whole season, Coquelin could be seen as 5-dimensional vector where each entry corresponds to his average divergence for each of the 5 motifs. When the performance of this vectorisation is measured through the methodology outlined in my previous entry using data from the 2014-15 and 2015-16 seasons of the Premier League (only players who had at least 18 appearances to avoid outliers), this was the result:

The fairly negative z-scores reveal that this methodology has an agreeable stability for those two seasons and is therefore picking up on some underlying quality of the players playing style.

  • Just as we did for team motifs, instead of considering the raw values of motifs a player performed, we consider each performance in a match by a player as a 5-dimensional vector in which each entry is the percentage of the player’s total motifs that that motif corresponds to. So we can represent a match played by Romelu Lukaku as 5% ABAB, 13% ABAC, 25% ABCA, etc. Averaging over a whole season, each player is represented by a 5-dimensional vector.

Once again, we’re reasonably happy that this vectorisation is picking up on stable player qualities.

  • Another way of seeing that data which I felt might be useful is seeing each player’s match as the proportion of each motif his team performed that he participated in. That is to say, if Southampton completed 50 instances of ABAB in a match, and Jordy Clasie participated in 25 of those, he would have a 50% score for ABAB in that match. If in that same match Southampton completed 80 instances of ABAC and Clasie participated in 20, he would have a 25% score for that motif. Applying this logic to the 5 different motifs and averaging over the whole season, each player is once again represented by a 5-dimensional vector. This is how well it performs:

Out of the three 5-dimensional vectorisations we have shown so far, this is by some margin the one which performs the best. Both its z-scores are considerably lower than the other two, meaning its capturing pretty stable information for each player.

  • In the first entry regarding passing motifs we mentioned how the motifs could be vectorised in a 15-dimensional vector for players. To refresh your memory, for an ABAC sequence a player could participate as the A player, the B player or the C player. It’s straightforward to count that looking at all 5 motifs there are 15 “participation” possibilities for each player. If we count how many times each player was each letter in each of the 5 motifs, we are left with a 15-dimensional vector representing each player. This is basically the methodology used in the “Who can Replace Xavi?” article.

     Comparing things in different dimensions is rather difficult and not too standardised in mathematics but I would dare say that it performs worse than previous 5-dimensional vectorisation, especially considering Z-Score 1 which is the most important indicator.

  •      Finally, we can take this 15-dimensional idea and slightly alter it to not count the total of each pseudo-motif but rather what their relative frequencies are, so once again do something like if Dimitri Payet performed the B in an ABAC 15 times out of 100 total motifs he participated in, that pseudo-motif has a score of 15%. Once again, each player is represented by a 15-dimensional vector:

Immediately we appreciate that this is the best performing of all the vectorisations we have seen.

Now, the first thing we must say is that all the 5 different ways of obtaining player vectors shown here show evidence of uncovering some stable and underlying qualities of players’ passing style. We have used the indicators to compare them and discuss which might be better, but there is no way of determining whether some information which one of them is picking up on is missed by another.

Here’s the advantage: there is no downside to combining them all. If we simply glue together all these representations to make one long 45-dimensional (5+5+5+15+15) vector representation for players, then all the qualities on which each methodology picked up are at a scale represented. If two players were similar across all representations, they will be similar in the long one as well; if two players were similar across some of the representations but not others, then they will be mildly similar depending on how dissimilar they were in the others; etc.

Here is the performance of this long 45-dimensional vectorisation:

The results are very satisfying and it proves to be a robust vectorisation for player playing style, more than 1 standard deviation below the mean distance between all players and more than 4 standard deviations below the Gaussian distances, even in this very high dimensional space.

This vectorisation will surely provide me with a lot of material to explore for a good while, its even a little frustrating not finding an easy visual way in which to convey it to the readers. Lets settle for now on a hierarchical clustering dendrogram as a visualisation tool.

Below is a link for the pdf for the hierarchical clustering dendrogram applied to the data set for the 2015-16 season of the Premier League (only players who played in over 18 matches). Since there are 279 players, the tree labels are really tiny so the image couldn't be uploaded onto the blog directly, but on the pdf you can use your explorer's zoom to explore the results.

If you'd rather not, here is a selection of the methodology's results:

  • Mesut Ozil has one of the most distinctive passing styles in the league. Cesc Fabregas is the player closest to him and together they form a subgroup with Juan Mata, Ross Barkley, Yaya Toure and Aaron Ramsey.
  •  Alexis Sanchez is in a league of his own but the players with the most similar passing style are Payet, Moussa Sissoko, Jesus Navas, Sterling and Martial.
  • Troy Deeney is in the esteemed company of Aguero, De Bruyne, Oscar and Sigurdsson.
  • David Silva, Willian, Eden Hazard and Christian Eriksen are all pretty similar.
  • Nemanja Matic, Eric Dier and Gareth Barry have a similar passing style.
  • M’Vila, Lanzini, Capoue, Puncheon, Ander Herrera and Drinkwater are all similar, pretty good and perhaps underrated.
  • Walcott, Ihenacho, Scott Sinclair, Jefferson Montero, Wilfired Zaha, Bakary Sako, Albrighton, Bolasie and Michail Antonio form a subgroup of similar wingers.
  • Giroud is more similar to some rather underwhelming strikers such as Gomis, Cameron Jerome and Pappiss Cisse rather than to world class strikers. The same can be said of Harry Kane being similar to Aroune Kone, Son and Marc Pugh. Maybe the methodology is not as convincing for strikers?
  • Shane Long and Odion Ighalo are good alternatives to Jamie Vardy.
  • Diego Costa and Lukaku are similar to Rooney.
  • Victor Moses, Aaron Lennon and Jordon Ibe are similar.
  • Mahrez is similar to Sessegnon, Nathan Redmond and Jesse Lingard. Did Southampton know this?
  • Matt Ritchie (ex-Bournemouth now at Newcastle) is in a group with Lallana, Alli, Pedro and Lamela. An opportunity for the taking?
  • Angel Rangel has (and has always had) unusual stats for a full-back.
  • The methodology recognises who the goalkeepers are and set them apart without this information being explicitly available in the datasets. The same applies for many other players from similar positions which are grouped together like the CBs and full-backs.

This is a poor man’s substitute to actually exploring the dendrogram yourselves. Not to mention that a clustering dendrogram is not even the most faithful representation of the information being collected by this vectorisation, but I’m more than happy with the results and feel there is some real promise to the methodology. If I can come up with some better visualisations for the results I’ll post those later on.

Please have a look through the results from the dendrogram and comment on whether you feel we’re getting close to convincingly capturing player passing style through passing motifs.