SummaryStatistics vs. DescriptiveStatistics

Apache Commons Math makes available the classes DescriptiveStatistics and SummaryStatistics, both derived from the interface StatisticalSummary (there are a few more classes in the same hierarchy, that are not in this post spotlight). Both of them are used to get basic univariate statistics, but there are a few reason that should help us to decide which one is the most suitable for a specific task.

Let's start talking about commonality. Both of them are not synchronized, if you really need a thread-safe implementation, check out for SynchronizedDescriptiveStatistics and SynchronizedSummaryStatistics, and both of them implement all the basic functionality defined in StatisticalSummary.

It is a handful of methods that should look pretty straightforward to the reader having just some basic statistics knowledge:
double getMean(); // arithmetic mean
double getVariance();
double getStandardDeviation();
double getMax();
double getMin();
long getN(); // number of available values
double getSum(); // sum of the values
All this methods return NaN when called if no values have been added to the object, with the obvious exception of getN() that returns 0.

The substantial difference is that SummaryStatistics does not store data values in memory, resulting being sleeker and leaner than DescriptiveStatistics. On the other hand, DescriptiveStatistics makes available some more functionality to the user. So, if what you need is in StatisticalSummary, you can manage huge collection of data with SummaryStatistics and happily avoid to pay a large price in terms of memory usage.

There are then a few common methods that are defined for both SummaryStatistics and DescriptiveStatistics, even though they are not part of the commonly implemented interface StatisticalSummary.

To load the data we use public void addValue(double value), that could be called like this, where generator is a Random object previously initialized:
for(int i = 0; i < 1000; ++i) {
    stats.addValue(generator.nextDouble());
}
From object of both classes we can get the sum of the squares, getSumsq(), and the geometric mean, getGeometricMean(). Sometimes it is useful to reset the values on which we are working, and this is done by calling clear().

Only for SummaryStatistics are defined getSumOfLogs() and getSecondMoment().

Only for DescriptiveStatistics are available:

void removeMostRecentValue(): discards just the last value inserted in the underlying dataset or throws an exception.
double replaceMostRecentValue(double v): replaces the last inserted value or throws an exception.
double getSkewness(): the skewness is a measure of the current distribution asymmetry.
double getKurtosis(): the kurtosis is a measure of the current distribution protrusion.
double[] getValues(): creates a copy of the current data set.
double[] getSortedValues(): creates a sorted copy of the current data set.
double getElement(int index): gets a specific element or throws an exception.
double getPercentile(double p): an estimation of the requested percentile, or throws an exception.

Window size

When we have no idea of how many values could be entered, it could be dangerous using DescriptiveStatistics in its default mode, that let the underlying data collection growing without any limit. Better to define the dimension of the "window" we want to work with using setWindowSize(int windowSize). What happens when we reach the limit is that the oldest value is discarded to let room for the new entry. If you wonder what is the current size, you can check it through getWindowSize() that returns, as an int, its current value. The "no window" value is represented by DescriptiveStatistics.INFINITE_WINDOW, defined as -1.

No comments:

Post a Comment