One of the major flaws identified with Corsi as a metric is that it is very prone to influence from factors outside of the player’s purview — where he starts on the ice, who he plays with, and who he plays against are decisions made by his coach and his general manager. Of course James Neal looks terrific in Corsi — he plays the whole bloody game with either Sidney Crosby or Evgeni Malkin. Of course the Sedins have good Corsis, they start in the offensive end more than the opposition’s netminder. I’ve been trying to think of a way to adjust Corsi to take those factors out — to try to explain what portions of a player’s Corsi is explained simply by what circumstances they find themselves in, and what portion of a player’s Corsi is the residual talent/results.

To review, Corsi is simply a shot attempt metric — how many shot attempts does your team take at the opponent’s net, net of how many shot attempts the opposition takes at yours, rated on a per 60 minutes of icetime basis.

I compiled 5 seasons of data using behindthenet.ca, only looking at forwards who played at least 30 games in a season. My original design used three metrics to try to explain a player’s raw Corsi:

- Corsi Quality of Competition – a measure of the average raw Corsi of the players a certain player faces
- Corsi Quality of Team – a measure of the average raw Corsi of the players a certain player plays with
- Offensive Zone Start % – a percentage of how often a player’s shift is started in the offensive zone by his coach

It was obvious from the outset that these variables have a massive amount of explanatory power with respect to a player’s raw Corsi. The only hitch comes when you realize that Corsi QoC correlates POSITIVELY with a player’s Corsi — ie, the more difficult a player’s competition, the better he seems to do. This is obviously counter-intuitive to logic, and is likely due to some recursive effects of in-game strategy — coaches tend to play their best players against each other. To get around this while still trying to incorporate the insights of Corsi QoC, I decided to take a simple differential between Quality of Team and Quality of Competition. And why not? If the average Corsis of a player’s linemates is 2, for instance, and the opposition he plays against also has a Corsi of 2, shouldn’t that logically mean he’s been placed in a neutral situation, ie he has a difficulty factor of 0? If a player’s linemates have average Corsis of 20, and he plays against opposition with average Corsis of 0, doesn’t that mean he’s been placed in an incredibly fortunate set of circumstances (with a difficulty factor of +20)? In my analysis, I refer to this new variable as QualDiff.

I then ran regressions using QualDiff and Offensive Zone Starts % on raw Corsi for each of the last five seasons, AND all 5 seasons combined. The results:

Let’s concentrate on the Combined equation, which is greyed above. The formula would then be:

Expected raw Corsi = -11.91 + QualDiff * 1.00 + Off ZS% * 0.24

So, a player with QualDiff of 0, and a zone start rate of 50% would have an expected Raw Corsi of:

Expected raw Corsi = -11.91 + 0 * 1.00 + 50 * 0.24 = -0.15

This is so close to a zero raw Corsi that not one player with more than 10 GP this year is closer than it to zero. The r-square correlation of the combined formula is 0.61, meaning 61% of the variation in Corsi can be explained by the variation in these two variables. The P-values of each coefficient is so low that’s it’s approaching zero.

The formula suggests that Corsi will increase by 1 for every unit increase in his quality of teammates and every unit decrease in his quality of competition. His Corsi will increase by one about every 4 more percentage points in his zone start. A player with a zero QualDiff and starting every faceoff in the defensive end is expected to have a -11.91 Corsi, reflected in the intercept above.

Let’s walk through how we can now apply this formula to adjust raw Corsis. First you must figure out what each player’s expected Corsi would be using the formula above and the two variables. Then you simply need to take the difference between his Actual Corsi and his Expected Corsi –ie, how much better or worse did this player do than his expected Corsi? I’m calling this “Adjusted Corsi”. All we’re doing is taking away the portion of a player’s Corsi that can be explained by his zone starts and quality factors and seeing what’s left over.

If we apply the formula to this year’s data among players with more than 10 GP, here are the top and bottom 10 players, along with any qualifying Oilers:

Using this table you can follow how the Adjusted Corsi is derived from following the formulas at the top. We can see that Jordan Eberle actually has the 9th best Adjusted Corsi in the NHL right now — this is because he plays with crappy teammates (ie the Oilers) against relatively difficult competition & middling zone starts and actually has a raw Corsi of +3.3. Since his expected Corsi is -9.87, his adjusted Corsi is 13.17. This places into context how strong the Oiler’s kid line is versus the rest of the league, considering the talent they have been surrounded with.

Many of the Oilers do quite well here, while many do quite terribly. Zone starts and Quality Factors actually bump Lennart Petrell out of last place in the league and into 13th last — he is expected to have a Corsi of -23.57, but actually has a Corsi of -38.8, for an Adjusted Corsi of -15.23.

I’d expect the formulas for defencemen to be different, but I will perform that analysis next.

## 14 Comments

This is really interesting stuff, Mark. To see if it passed the sniff test on a larger sample, I ran the formula for last season’s Oilers, and I think the rankings aren’t far from what we would expect:

NAME Adjusted Corsi

HALL 10.058

PAAJARVI 9.398

HEMSKY 7.774

GAGNER 2.442

HORCOFF 1.609

SMYTH 0.31

JONES -0.052

EBERLE -0.229

BELANGR -0.757

NUGE -1.119

EAGER -4.793

HORDI -4.796

LANDER -7.846

PETRELL -14.016

Perhaps RNH and Eberle are lower than we would initially expect, but RNH was a raw rookie, and I think there was some real skepticism around Eberle’s underlying numbers.

I’ll be really interested to see your Dman rankings. Cheers!

I like the qualdiff idea. A quick question. If you run the regression with a subsample of players does the coefficient change dramatically? I ask, because as you noted, if you run just the qcomp numbers you get a positive relationship between qcomp and corsi. If you do that with just the players that play the most it is even more positive, while if you run it with players who play fewer games, you get the inverse relationship you expect.

If qualdiff successfully addresses the “recursive effects of in-game strategy” then the coefficient should be consistent across player groupings.

I’ve thought about it some more and I don’t think this method works. Qdiff doesn’t change the Qcomp factor, it simply adds it to Qteam. And because the range of values in Qteam is much greater than Qcomp, Qdiff is really Qteam with a small adjustment. Hence you haven’t addressed the underlying problem with Qcomp and the counterintuitive relationship between Qcomp and Corsi, you’ve hidden it by swamping it with the Qteam data.

That’s part of the realization I had. QualComp *is* a minor factor when compared to qual team. The hardest qualcomps are basically in the 3′s, maybe 4′s. So when you think about it, it’s as if those players continuously played players the calibre of Matt Stajan, Kyle Turris, Derek Roy, etc — ie players with raw Corsi’s of about 3 or 4. It’s not like Dave Bolland is playing Sidney Crosby’s every night and shift, he gets to play a bunch of crap as well to bring his qualcomp down to 1.5.

So when you realize how minor qualcomp is in comparison to qualteam, where players regularly have qualteams of 10+, meaning that they actually *do* play with superstars and very deep talent nearly all the time, that’s when I start thinking about just leaving it out entirely. In the end, you can get a ton of explanatory power out of qualteam by itself.

But I feel like that’s doing a disservice to some of the players who play with crap and play harder comp as well. Those marginal QoC points may mean something to them.

In regression sometimes, the relationship is surprising, and you’ve got to clobber the model into doing what you want. I would dispute that I’m adding QoC to QoT — I’m subtracting it, thereby forcing my logic on the model before any least squares are pulled out of the bag. A high QualDiff denotes easier circumstances, lower means harder. So if a player has a high positive QoC, I’m subtracting it from whatever his QoT is, telling the model that this guy had difficult circumstances by lowering his QualDiff.

I don’t like the idea of restricting the sample in any circumstance, and I have the same hesitation here. Each player has something to tell us.

But seriously, keep the conversation coming, you might convince me to trash QualComp altogether …

Qcomp is a good puzzle. Rationally speaking, it has to have empirical consequences. However, measuring those consequences empirically is very difficult. I take your point about ignoring data, however the nature of this particular data taken as a whole is that it is subject to the ecological fallacy. And since the use of this data is to measure individual ability that is a particularly damning problem. Zone starts doesn’t have this problem because the effect of zone starts is measurably consistent across populations.

That’s why I think you have to isolate a subset of the population that is effected by Qcomp in the way we want to isolate but is less influence by the distorting iterative effects. This will never be as accurate a number as zone starts but it would be good enough for purposes. It would provide a tool for allowing for the context of competition as a baseline for more granular analysis. I think that is the best that can be done.

Gee, that Andrew Cogliano guy in #3 is playing is terrible players against tough opposition and with horrible zone starts but is absolutely killing it out there. Too bad the Oilers can never find players like that.

On a related note, do you have any explanation for the type of player we see at the top of this rating? None of these people, while some of them are well regarded, are widely thought as the type of quality possession driver you would expect to see leading a fully-adjusted Corsi.

The biggest reason you don’t see the Crosby’s at the top of this list is that his own quality is leaking into his linemates’ Corsis, infecting his own qualteam. Crosby was 13th, HSedin 29th. So they do show up, but are likely lower than they should be if they are the true bus drivers. But I do like how this brings players I didn’t expect to the forefront. I actually had to google Tye McGinn. Awake GM’s would try to mine stuff like this for value, ie finding players that are not having strong absolute Corsi years but are doing very well under the circumstances they’ve been handed.

Is there a way to cancel out a player’s own contributions to his QualTeam number — some kind of WOWY comparison, for example? It seems to me like that might better let the true cream rise to the top in your results.

I am a little confused as to were the zone start factor of .24 comes from.

Is this the average drop in corsi for players starting in own zone vs same player having more offensive zone starts?

Really enjoying the rigour in your analyses.

You noted in your comments that “Awake GM’s would try to mine stuff like this”. Exactly right.

I was thinking from the flip side of the equation: finding whether trade that were busts can be explained.

Many Oiler fans (self included) cheered when Belanger was brought on board – he seemed the defensively reliable, faceoff winning, secondary scoring third/fourth line center we desperately needed. When a few years later the player is dubbed “The Belanger Triangle” (where offense goes to die), things clearly haven’t worked out as expected.

I wonder if your adjusted Corsi methodology would help to explain why. (I use Belanger as an example, but of course the idea is whether any surprise result from a trade, good or bad, might sometimes be explained by the above).

Hope you’re still reading these comments – I just saw this now:

this part of the analysis however is problematic. You are combining variables without any backing for why doing so is appropriate and then fitting a line to data. That’s not likely to be very useful for future data.

There is a decent amount of research (see work by @BSH_EricT and David Johnson of Hockey Analysis) that suggests that Quality of Competition measures matter a hell of a lot less than Quality of Teammates. And yet you’re attempting to weight the two equally.

Actually as a matter of fact, you’re not but what you’re doing isn’t much different. Corsi QOT when combined with Corsi QOC will be extremely close to Corsi QOT because Corsi QOC values over seasons are small. Corsi QOC ranges over a full season between +2 and -2, while Corsi QOT ranges from +12 to -12. So your results are more likely measuring impact of teammates than Competition, and thus the use of “Qualdiff” is misleading.

“You are combining variables without any backing for why doing so is appropriate and then fitting a line to data. That’s not likely to be very useful for future data.”

I explained my logic for doing so, and showed that the relationship holds explanatory power over a period of 5 sequential years. But this shouldn’t surprise anyone — any way you arrange QoT or QoC will result in them having explanatory power over the dependent variable chosen. I must say that I am skeptical of the necessity of using QoC at all, and may very well decide to strip it out entirely in future iterations of this.

And first you accuse me of weighting them equally, but then say I’m not? The entire point of combining the two in this fashion is to observe that QoT’s are regularly an order of magnitude larger than QoC’s… therefore, their effect on the QualDiff variable will be commensurately huge. This is weighting QoT very heavily over QoC over the dataset.

But to me, it’s also about the logic of the approach. QualDiff ponders a number to depict what average game state a player found himself in over a season. We see what tools he was given (QualTeam) and what tools the enemy had (QualComp). After all, we’re just talking Corsi numbers here, it’s not like I’m trying to subtract pterodactyls from puppies.

This is good stuff. Nice to see this. Kudos. About time somebody did this.

One thought: you might weight players by time on the ice in the regression.

You know, I actually did have that variable in there initially, and if I remember correctly it wasn’t really showing a ton of marginal explanatory power. I’ve had the thought to try dummy variables to estimate what line a player was played on based on average TOI. A line 1 player is expected to have better numbers than a line 4 player, for instance…

## One Trackback

[...] on some previous work by Michael Parkatti over at Boys on the Bus, I’ve been trying to improve on his initial try at an Expected Corsi of [...]