This post continues a series of examinations into the concept of team ‘luck’. I initially wrote a three part series on team PDO, testing both it and its constituent parts to find out if they were truly random over time. I found that while team shooting percentage does seem to be random over time, team save percentage shows distinct evidence of not being random. The influence from SV% was so strong that it made PDO statistically predictable over time — kind of a bad thing when it’s traditionally used to reflect luck.

The first post in this current series tried to envisage a new way to calculate PDO, taking into account the expected performance of goaltenders entering a season and weighting it based on how many starts that goaltender received in the current season. I then applied the methodology to the 2013 season and promised to analyze it with the same vigour I’d used in my initial 3-part series on PDO. And here we are.

Firstly, I’d like to report that my initial methodology for a modified PDO statistic (MPDO) did *not* pass my statistical tests — it definitely showed more randomness than PDO, but any given MPDO could still be used as a statistically significant predictor of a team’s subsequent season. It really made me think more about what this entire process is trying to accomplish. If you recall, my initial methodology for coming up with a goalie’s predicted SV% for any upcoming season was to simply take his save percentage over the last 5 seasons of data (as long as he had > 500 shots against). My rationale was that we were trying to get an idea of what a goalie’s “true” talent level is using a large data sample, so that any deviation from this monolithic number could be seen as manifested luck. However, when implementing this methodology in practice, I realized that the 5-year average was simply too long — there were many instances of years long past polluting my dataset, in that they were no longer useful predictors of expected future performance. A quick example would be including some of Martin Brodeur’s monster post-2005 lockout years to help predict how he would do in some of the last few years. His performance over this time period was so different, with high SV% years falling into low ones, that the 5-year average essentially evened out to league average, leaving his “standard” the same as it would be in traditional PDO.

I then realized that this is essentially a forecasting excercise. We’re trying to come up with what a goaltender’s expected performance in this season, not what his long-term true talent level is. Those are related, but slightly different concepts. Knowing Martin Brodeur’s long-run SV% is interesting, but it’s certainly not going to be very useful in trying to predict what his SV% “should” be when he’s 40. I put the word “should” in quotation marks to emphasize this point:** we need to find an expected save percentage where it is EQUALLY LIKELY for his actual SV% to be above or below that number**. Read that sentence again, it’s important. Meaning, we really need to give this the college try to predict (or, forecast) what a goaltender’s performance will be this season.

Now, I’ve worked with SV% numbers and forecasting a lot, so I had some ideas of how this could be accomplished. I needed something both simple to understand and decently accurate. After about 10 different attempts at trial and error, I came up with a new methodology that satisfied these requirements. Instead of using a 5 year average SV%, I only used the last 3 years of data. This concentrates the expected performance on a goalie’s most recent history, but includes enough data to be considered a large sample size (in my eyes anyways). I then weighted the 3 years in an exponential fashion, with the most recent year having the most weight and 3 years ago having the least weight (I used the arbitrary weights 2, 4, and 8 for the save percentages recorded three, two, and one year ago). A goalie still needed to have seen 500 shots within the last three years to qualify, and any non-qualifying or new goalie was given the same “average new goalie” SV% I used in the first part of this series (which are 4-5 points below league average SV%).

Once I had my expected SV% for each goalie, I then weighted them by the starts that each team’s goalies saw to come up with an expected team SV% for that team for that season. If the actual SV% was above this, it was considered lucky, and if it was below this it was considered unlucky. For instance, the Oilers this year had an expected team SV% of 0.921, but actually had a team SV% of 0.924, meaning they did a bit better than was expected, or “luckier” than expected. I took the difference between the expected and actual, added to the difference between the team’s expected and actual shooting percentage, and came up with a modified PDO (MPDO) for each team for each of the last 6 seasons.

To normalize the MPDO results, I transformed all the numbers for each season into normal cumulative distribution scores between 0 and 1, a technique I used extensively in the initial 3-part series on PDO. A score of 0.50 was league average, anything above that is above league average, anything below is below league average. It’s just a technique to compare apples to apples over time. If MPDO is random, I would expect the long-term average scores to tend towards 0.50 or league average. Here are the 6 years of data:

The teams in this table are ranked in descending order from the most “lucky” over the last 6 seasons to the least. Already we see a promising band of scores around 0.50. Let’s compare the top and bottom 5 teams to their normal PDO counterparts:

The top team in traditional PDO was Vancouver by a landslide, with an average 6-year score of 0.87. However, in MPDO Vancouver slides down to the 4th luckiest — why is this? Well, Vancouver has employed consistently good goaltenders over this time-frame, meaning that one would “expect” them to do better over time than the league average. In other words, Vancouver would consistently have high PDO, and people would construe this as being “lucky”, when in fact it was only because they had good goaltenders. Their MPDO in 2013 was a touch below league average because their expected SV% was 0.931, while their actual SV% was 0.928.

And just eyeballing this table should tell you something about what’s going on here –instead of the top and bottom PDO teams being the ones who’ve had consistently good or bad goaltending, it seems to be more of an unexpected mix of teams. ”Hey, Dallas is the third luckiest team over the last 6 years!” instead of “Hey, Boston’s had two Vezina winners and the best young goalie in the game, neat!”.

Compare this graph of the 6-year average scores using MPDO:

To this original one using PDO:

You’ll immediately notice more teams in the arbitrary yellow band +/- 10% from the expected long-run tendency towards 0.50. You also notice a less number and less severe outliers. The new MPDO does seem to be increasing the gravitational pull towards 0.50 (league average).

You may also recall a technique I used in my PDO series where I found the expected probability of any team having, say, 6 above average and 0 below average seasons out of 6 seasons, etc. The expected probabilities were calculated using a probability tree that assumed the chance of being above or below 0.50 should be 50% if the measure truly is random (the measure in this case being MPDO). The following table shows the actual numbers of teams that had the specified number of above/below average seasons and compares it to the expected numbers.

The effect is better seen graphically. Compare this graph of how MPDO compares to the expected probabilities:

To this graph using the original PDO:

What we’re looking for is evidence of a central tendency — if the measures are random, we want to see them bunch up in the middle where teams have more equal numbers of good and bad seasons instead of a sustained number of good or bad seasons. Now, perhaps the two graphs don’t seem all that different, but check out how the 6 Above/0 Below bar in the PDO graph is completely absent in the MPDO graph, with its weight being added to the more central 4 Above/ 2 Below category. It’s a small change, but does show a more central tendency. How can we test this?

I chose to use a Chi-Squared test to make my point here. What Chi-Squared answers is, basically: “does a set of actual numbers match a set of expected numbers”? It’s often used in situations like this, where the expected numbers are assumed to be random chance, and the observed actual numbers are tested to see if they deviate significantly from the expected ones. If so, whatever you’re analyzing can be found not congruent with simple random chance. This test uses P-values: any P-value below 0.05 suggests whatever you’re analyzing is not congruent with random chance, any score above 0.05 suggests that it is. I proceeded with chi-squared tests not only for MPDO but also PDO and the original parts of PDO (team save and shooting percentages) to illustrate the point.

Here again we see the evidence that SH% is truly random, very much above the 0.05 cutoff. SV% is found to not be congruent with random chance, well below 0.05 at 0.0004. What’s interesting is that PDO is found to show adherence with random chance with this test at P-value: 0.18, but it is obviously heavily influenced by SV% downwards. Alternatively, the new MPDO statistic has a P-value much higher at 0.59. The difference in the P-values suggests MPDO displays much more randomized characteristics than PDO.

To finish off, I performed my favourite test of randomness: I performed a regression equation using one season’s MPDO normal cumulative distribution score as an independent variable to predict the next season’s score as a dependent variable. The rationale here is that if I can use one year’s MPDO to predict the next year’s with statistical significance, the measure cannot be random. If you recall, I performed this test with PDO and came up with a P-value of 0.0014, well within the 0.05 boundaries needed for me to accept the hypothesis that they do have a statistical relationship. What I’m hoping for with this test with MPDO is for a high P-value, showing a very weak statistical relationship from one year to the next.

I was not disappointed. Using the MPDO normdist scores, I calculated a P-value of 0.32, meaning that I cannot accept a hypothesis that one year’s MPDO has any bearing on the next year’s — the relationship is random.

To conclude, I do not want to suggest this new MPDO stat is supreme over all potential others — it is to prove that finding a measure to properly reflect luck in hockey is possible. Showing which teams are riding luck and which ones are due to break-out has been one of the most useful developments of the advanced stats community. It’s my belief that refining our methods to more accurately depict this concept is important, and will provide a great many insights for years to come. This new MPDO statistic is proven to be randomly influenced by a force that we can approximate as luck.

For those interested, here’s a table that compares MPDO to PDO for each team for the last 6 years. If you’d like this in excel format, just email me and I’d be happy to provide a copy of my work.

## 6 Comments

I hate to repeat myself, but you still haven’t really addressed the concerns I expressed in the comments on the last piece.

The 2013 MPDO that you described there had a correlation to PDO of 0.98. The new 2013 MPDO correlation is 0.97 — still high enough to leave me wondering if the added complication is worth the marginal gain.

I can now see from the larger table in this piece that the correlation is a bit lower for past seasons (though still quite high, around 0.85). I presume that’s because the range of PDO gets a lot smaller as more games are played, which makes the minor corrections employed in MPDO more significant. But since almost all of the estimations of team luck and references to PDO are done when looking at the ~15-50 game timespans where luck plays a larger role, I wonder about the value of a correction that is tiny compared to the PDO spread we’re looking at over those timespans.

And I still think that if we do find these corrections to be useful, it’d be more reasonable to use every goalie’s actual save percentage, regressed to the mean based on their shots faced. It just can’t be right to use a threshold that causes a sharp difference in how you handle a guy who’s faced 499 shots and a guy who’s faced 500 shots, but no difference between a guy who’s faced 500 and a guy who’s faced 5000.

Hi Eric,

Thanks for the comment. I would expect PDO and MPDO to have a high correlation — after all, half of it remains the same (SH%) and many teams employ something near league average goaltending, in which case their MPDO will not change all that much (the Leafs this year, for instance). With the constituent parts being very similar, the correlation between the two will be high no matter how it’s modified, that’s just a reality.

However, I think I’ve done a pretty thorough job of statistically proving that traditional PDO is not a random measure. That’s what this entire thing is about — if PDO is not random, than it does not reflect luck. How comfortable are we, as a community, with continually referencing a statistic that purports to show luck, when in actual fact it does not. That’s the kind of stuff that really irks me, and should irk anyone aware of the issue. Formulating MPDO was a way for me to prove to myself that you can create a randomized combination of SH% and SV% to show how a team’s luck is turning out.

PDO and MPDO are similar, yes, but they are worlds apart in terms of their randomness, & are proven to be statistically different in terms of independence. In this light, the correction is not ‘tiny’, it transforms the measure from half-random to random.

And I’m not sure the use of thresholds is ill-conceived. Are you suggesting that SV% in small sample sizes will always be biased in one direction? I think it’s just as likely for a small sample sized SV% to be crazy high (Fasth) than crazy low (Leland Irving). If I regress based on shots faced, the coefficient will only be applied in one direction, further biasing either the very high or low SV% low-sample size goalie. SIDE NOTE — I’d already thought about doing an analysis on this very concept — does SV% change considerably with sample size. Without knowing the answer to this, I think having a 500-shot threshold is reasonable. I’m actually skeptical that a guy’s SV% is appreciably different after facing 5000 shots rather than 500. I’m moving more and more to Cosh’s position of “goaltenders never get better” as the butterfly era continues.

Would really appreciate your thoughts on that…

No, I’m not suggesting that sv% in small sizes is always biased in one direction (it might be, since we’re only looking at goalies who get to play again next year, but that’s not part of the concern I was raising). And I don’t understand what you mean by your last sentence.

You’re trying to come up with an estimate for what we should expect a given team’s save percentage to be. I’ll use some examples to make my point more concrete, since something seems to be getting lost in translation.

Imagine that each team has two goalies that get exactly half the playing time. Goalie 1 comes into the season having posted a .930 career save percentage on 5000 shots. Goalie 2 is as follows:

Team A: .930 on 5000 shots

Team B: .930 on 500 shots

Team C: .930 on 499 shots

Team D: .900 on 5000 shots

Team E: .900 on 500 shots

Team F: .900 on 499 shots

By your method, their expected team save percentages are:

A: .930

B: .930

C: (.930 + .918) / 2 = .924

D: (.930 + .900) / 2 = .915

E: (.930 + .900) / 2 = .915

F: (.930 + .918) / 2 = .924

I’m saying it’s silly that A and B (or D and E) get the same correction (the backups on teams B and E are a lot more likely to turn out to be average goalies than A and D are), and it’s silly that B and C (or E and F) get very different corrections (their backups are almost identical and are treated very differently), and it’s silly that C and F get the same correction (their backups appear to be different).

Instead, I encourage you to regress based on the number of shots faced. Do something along this lines of this article to calculate the luck component over a given number of shots. Regress the 5000-shot guy 40% of the way back to the mean, the 500-shot guy 84.3% of the way back to the mean, and the 499-shot guy 84.31% of the way back to the mean (or whatever the right figures might be). Then you end up putting more weight on the more established save percentages — as you should — rather than having an arbitrary binary threshold for believable or unbelievable.

As for the first part of your comment, there’s “high correlation” and then there’s “so high it’s pointless to bother switching from one to the other”. The ~0.85 correlation that’s observed after 82 games falls in the former category; the ~0.97 correlation that’s observed after 48 games falls in the latter. I’d bet that the tests that distinguished the two metrics in your tests of 82-game samples don’t show nearly as much of a difference between them in 48-game samples. This seems almost necessarily true, given how similar the metrics were after 48 games.

What I think is happening is that you are making small corrections. Mid-season, when PDO is all over the place, those small corrections are all but irrelevant. By the end of the season, when things have evened out and the luck component of PDO gets a lot smaller, subtracting out the talent component starts to be meaningful. If that’s true, MPDO might be useful for end-of-season reviews where we ask whether a team got lucky this year, but for midseason predictions of which teams’ records are misleading (which is where PDO usage comes into play most often), MPDO wouldn’t be appreciably better than PDO.

Ok, I get what you’re saying now in terms of regression, I misinterpreted it before. Something to point out here:

1. My approach envisions coming up with the expected SV% for each goalie *before* the season starts (using the previous 3 years of data available). Then, this expected SV% is weighted by the number of games started *in the current season* to come up with the team’s expected save percentage up to that point.

2. Obviously this approach makes incorporating any data about brand new goalies (eg Fasth) impossible, as we don’t even have one shot to base our conclusions on. We could talk about incorporating current season SV% data, which is a different approach, but that’s obviously a bit more complex as opposed to just setting the expected SV% at the season’s outset and tracking the numbers against that.

3. Out of 1440 total goalie starts this season, 197 were by goalies who had not faced 500 shots in the last 3 years, or 13.7%. This percentage is middle of the road among seasons, as it ranges from just 7.5% to 18.7% during the 6 years of this data. So, we’re talking about a fairly small amount of starts by “new” goalies each year.

4. Out of 21 “new” goalies this year, 10 of them had never started a game before in the NHL. 6 had started less than 10 games before this year, and 5 had started between 10 and 18 games. This is a scant amount of data.

5. Of the 21 new goalies, 12 of them started 5 or less games this year. 5 of them started more than 5 but less than 20 games. 4 of them started more than 20 games (Bishop, Holtby, Markstrom, Fasth). Only Holtby started the majority of his team’s games.

I guess where I’m going with this is that a) we have either tiny or no amount of data on these guys. b) Even if we were to incorporate this season’s data as it proceeded, we’d have a truly small amount of data on many of them anyways. c) from a historical point of view, we know that new goalies perform notably worse than league average. d) modifying their SV% 80% (or however much) of the way to league average based on a few games of results means that they’d be approaching league average in their expected performance, which I don’t agree with.

I do realize having a threshold introduces discontinuity, but we’re talking about a demographic that represents a small minority of any given year’s starts, and in those cases usually have a tiny amount of data available. The reality is that most “new goalies” come and go without much fanfare, and if they’re anything approximating a starter we’ll gain enough info to incorporate them for next season.

As I said, I’m not married to this methodology, all I’m trying to do is provide is a good estimate of what we “expect” a goalie’s SV% to be. And I think you might be missing this point. I realized through doing this that I could care less about what a goalie’s true “talent” level is — all we care about in this kind of exercise is how that goalie will perform this year. Stripping out all the real world reasons why a goalie has performed as he’s performed over the last 3 years is doing a disservice to the analysis. If we can come up with a better forecasting methodology that does away with thresholds that can decrease the predictive value of MPDO from year to year, then great ! I’m all for it, because that’s my only objective here. The methodology suggested here is but one way you could skin this cat, and IMO a decent cut at it.

As to the correlation factor, we can agree to disagree about the importance of the difference. I started this series because I simply do not feel comfortable referring to a luck statistic that has a long history of not being truly random. For each year, we can take a team’s PDO from last season and use it as a statistically significant predictor of this season’s. IMO, that’s bunk, and calls out for a proper evaluation. Any MPDO will be close to PDO because a lot of teams are at or near league average, but many are not, and those are the teams we’re trying to correct for here. A team like Vancouver can see its score drop by 10 points, as the goaltending being good is not surprising. A team like Chicago’s score can rise by 7 points because there’s nothing in Crawford’s or Emery’s past that suggested they’d perform as well as they did this year.

Your first several paragraphs seem to be focused on the question about how much it matters that you ignore the past history of guys with fewer than 500 NHL shots and the sketchiness of having a step function at 499/500. But you haven’t addressed the larger concern, that you have a completely flat function from 500 shots up to 50000.

Braden Holtby entered the season with a .929 overall save percentage on 524 career shots. (I don’t have the ES numbers handy; he might have been just below 500 5v5 shots, but the point will stand even if this exact example doesn’t work.) Your approach assumes he will be a .929 goaltender going forwards, which seems unreasonable given how few goaltenders actually sustain performance at that level and how small a sample of his talent we have. It makes a lot more sense to me to regress his estimated future save percentage based on the known distribution of talent and the known amount of luck contained in a 524-shot career, rather than spend all season saying the Caps got really unlucky because Holtby didn’t post a .929 save percentage. I don’t really see any argument against doing this, to be honest — once you’ve introduced the complexity of a correction that’s tied to past performance and usage, why not make it a smooth function?

I agree with you that a team like Vancouver might be expected to have a PDO of over 1000 (though not 1010 — again, this is a problem with assuming Schneider is a .928 goalie after 1880 shots; you’re overvaluing his small sample and giving him too extreme a forecast as a result). But the point I’m trying to make is that 20 games into a season, when we look at PDO and write articles about who’s running hot and likely to drop off, those articles are based on teams having PDOs of 1040, not 1015. If Vancouver starts off at 1040, I’m going to predict regression without even looking at whether their true talent is 995 or 1007.

A correction in the last digit won’t have an appreciable impact on those articles. By the end of an 82-game season, PDO has converged into a tight enough range that corrections of a few thousandths become significant — but by that point, people aren’t really looking at PDO to make forecasts any more, both because the tighter grouping of PDO makes it less relevant and because there’s no season left to forecast.

I absolutely agree with your principle that not all teams have the same PDO in the long run, and that this is worth keeping in mind. But through the first half of the season — during the period when PDO is most used — the spread in PDO is much larger than the differences between teams in PDO talent, which is precisely why it gets used as a measure of luck.

I’d encourage you to look at how much randomness there is in PDO and MPDO over a 20-game stretch, which is the timeframe where we use PDO the most. I think you’ll find that they’re awfully similar, and almost entirely luck.

There’s a strange bipolarity to my stance on this, which I recognize. I think that the corrections in MPDO probably do improve things a little, but the difference is negligible over the sample sizes where we use PDO the most and so the added complexity is not worth it in most scenarios. And yet I’m calling for even more complexity, because if we’re going to add in factors of historical performance and sample size to get something that’s more theoretically accurate, we might as well get the theory right.

I like this idea. I suggested this exact methodology (minus the 2,4,8 weighting) the very first time I heard about PDO so it definitely passes me common sense test, but I don’t have the statistical background to comment on what impact Eric`s proposed change would have, although he seems to make a strong argument.

However, I strongly disagree with Eric’s stance that the changes are negligible. During the season, sure, use either metric as it doesn’t make much difference in small samples, but just as much dissection goes on in the offseason about what to expect the following year, that having a more accurate luck metric is very valuable imo.

If I can spitball another possible improvement. Team shooting percentage has shown to be extremely random, however individual on-ice shooting percentage have been shown to be somewhat skill driven, at least by top 6 forwards. Defenceman have been shown to have no impact and “bottom 6 forwards” very little. Or so I have gathered from blog posts from around the web. With this in mind, might an expected SH percentage also be possible?

This would almost certainly be an even more marginal difference than the save percentage but if we’re going to go for it, why not take the full plunge. The downside is, that the first step would be somewhat subjective. You would have to predict who the top 6 forwards would be for a team the upcoming year. Although I suppose you could just include the forwards with the top 6 5v5 ice times from the year in question (or previous year for predictions). Then use the same kind of 3 year rolling average for those players but when calculating the team average, include the other 12 players with expected league averages. Perhaps you would have to weight this by ice time.

Very possibly this isn’t the correct methodology, but I believe there is value in accomplishing this somehow. Pitsburg for example has been able to sustain high shooting percentages, likely because Crosby and Malkin are able to drive high on-ice shooting percentages on their top lines. For them to do this next year wouldn’t be lucky but expected.

Any comments for possible ways for this to be calculated or why the whole idea is off base would be appreciated.

## 2 Trackbacks

[...] to calculate how much above average a team’s expected PDO might be, based on their goalies’ and/or shooters’ career [...]

[...] worked to calculate how much above average a team’s expected PDO might be, based on their goalies’ and/or shooters’ career [...]