Classification metrics in `yardstick`

where both the `truth`

and `estimate`

columns are factors are implemented for the binary and the multiclass case. The multiclass implementations use `micro`

, `macro`

, and `macro_weighted`

averaging where applicable, and some metrics have their own specialized multiclass implementations.

Macro averaging reduces your multiclass predictions down to multiple sets of binary predictions, calculates the corresponding metric for each of the binary cases, and then averages the results together. As an example, consider `precision`

for the binary case.

\[ Pr = \frac{TP}{TP + FP} \]

In the multiclass case, if there were levels `A`

, `B`

, `C`

and `D`

, macro averaging reduces the problem to multiple one-vs-all comparisons. The `truth`

and `estimate`

columns are recoded such that the only two levels are `A`

and `other`

, and then precision is calculated based on those recoded columns, with `A`

being the “relevant” column. This process is repeated for the other 3 levels to get a total of 4 precision values. The results are then averaged together.

The formula representation looks like this. For `k`

classes:

\[ Pr_{macro} = \frac{Pr_1 + Pr_2 + \ldots + Pr_k}{k} = Pr_1 \frac{1}{k} + Pr_2 \frac{1}{k} + \ldots + Pr_k \frac{1}{k} \]

where \(PR_1\) is the precision calculated from recoding the multiclass predictions down to just `class 1`

and `other`

.

Note that in macro averaging, all classes get equal weight when contributing their portion of the precision value to the total (here `1/4`

). This might not be a realistic calculation when you have a large amount of class imbalance. In that case, a *weighted macro average* might make more sense, where the weights are calculated by the frequency of that class in the `truth`

column.

\[ Pr_{weighted-macro} = Pr_1 \frac{\#Obs_1}{N} + Pr_2 \frac{\#Obs_2}{N} + \ldots + Pr_k \frac{\#Obs_k}{N} \]

Micro averaging treats the entire set of data as an aggregate result, and calculates 1 metric rather than `k`

metrics that get averaged together.

For precision, this works by calculating all of the true positive results for each class and using that as the numerator, and then calculating all of the true positive and false positive results for each class, and using that as the denominator.

\[
Pr_{micro} = \frac{TP_1 + TP_2 + \ldots + TP_k}{(TP_1 + TP_2 + \ldots + TP_k) + (FP_1 + FP_2 + \ldots + FP_k)}
\] In this case, rather than each *class* having equal weight, each *observation* gets equal weight. This gives the classes with the most observations more power.

Some metrics have known analytical multiclass extensions, and do not need to use averaging to get an estimate of multiclass performance.

Accuracy and Kappa use the same definitions as their binary counterpart, with accuracy counting up the number of correctly predicted true values out of the total number of true values, and kappa being a linear combination of two accuracy values.

Matthews correlation coefficient (MCC) has a known multiclass generalization as well, sometimes called the \(R_K\) statistic. Refer to this page for more details.

ROC AUC is an interesting metric in that it intuitively makes sense to perform macro averaging, which computes a multiclass AUC as the average of the area under multiple binary ROC curves. However, this loses an important property of the ROC AUC statistic in that its binary case is insensitive to class distribution. To combat this, a multiclass metric was created that retains insensitivity to class distribution, but does not have an easy visual interpretation like macro averaging. This is implemented as the `"hand_till"`

method, and is the default for this metric.