In litigation, proving damages often requires bringing in the statisticians. Both sides rely on statistical experts, who then set forth various statistical strategies. Based on different sets of assumptions and interpretations, different statisticians may reach different conclusions about how to best analyze a dataset. A recent case* about drywall from the Eastern District of Pennsylvania shows a common statistical fight: How to treat data that has a time component to it, and specifically, whether to “first difference” the data or not.

In the case, plaintiffs had sued alleging certain antitrust violations on behalf of defendants. To prove damages, plaintiffs’ statistician performed a regression analysis. Regression analysis is a broad term for taking a set of explanatory variables and finding their relationship with one or more dependent variables. In simple linear regression, this means finding a line that best fits a cloud of points in two dimensions. With more than two variables, the same logic means finding hyperplanes of best fit in high dimensional point clouds, something that statisticians find as thrilling as it sounds.

At any rate, in the present case the regression analysis showed damages of around $250 million. The other side disagreed, arguing that the data should have been “first differenced” before performing the regression. This means that instead of using Y as the dependent variable, the difference between Y(t) and Y(t-1) should have been used. In every day terms, you can think of first differencing like this:

Imagine rating your happiness every day. For many people happiness might go up or down depending on the activities of the day, whether it’s a weekend or not, and so on. We might be able to predict your level of happiness over time based on a variety of factors. Now, instead of looking at the raw data, we might instead be interested in the change in happiness between days. Instead of happiness at level 5 on Friday and level 8 on Saturday, we look at 8-5 = 3, the difference between Friday and Saturday. I’m writing this blog in the middle of a COVID-19 shelter-in-place order, which means every day is pretty much the same, and my first-differenced happiness levels over time is a long series of zeros. We are reasonably happy at home, perhaps a 7.5 out of 10, yet the differences between all those 7.5’s each day is zero.

First differencing has several statistical advantages and disadvantages. Statisticians like to work with time-series data that is “stationary”, that is, that doesn’t change over time in fundamental ways. If the data are stationary, they are more predictable and useful. If the data aren’t stationary, then we often want to transform them to make them more stationary. One price of transforming the data is that the story the data have to tell can start to get lost in translation. For instance, in our happiness example, when I first difference I lose the level of happiness in the model as I condense Friday and Saturday into just the difference between those two days. In a sense, we start throwing out data in the name of assumptions. All those pretty happy shelter-in-place days disappear, and are replaced with zeros--not that we were unhappy, but our happiness level didn’t change over time.

Whether to first difference the data, do more complicated transformations like removing seasonal trends, or deal with time in a different manner such as adjusting standard errors depends on the situation at hand. In this case, the central question was whether the data were “cointegrated”, that is, whether even assuming they were non-stationary, did the variables move together sufficiently to analyze them as stationary without transformation? The plaintiffs noted the example of “a drunkard and her leashed dog walking on the street. The drunkard’s path may resemble a random walk ... and so does the path of the dog. But, because they are walking forward together and in more or less the same direction, they will not ‘deviate’ too far from one another ....”

Ultimately, the court decided that whether to first difference the data or not was a question for a jury to decide, which after hearing long statistical arguments about stationarity and cointegration, might make them envy the drunkard in the plaintiffs’ hypothetical. Indeed, if we “first differenced” their level of happiness the day they hear those arguments with the day before, we would expect a number that’s pretty negative!

*In Re: Domestic Drywall Antitrust Litigation, 2020 WL 1695434 (E.D. Penn. Apr. 7, 2020)

Image by Alexander Baranov from Montpellier, France - Flickr, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=45065904