Compute the normalized Gini coefficient, which measures the ranking ability of a regression model based on the Lorenz curve. This metric is useful for evaluating models that predict risk or loss costs, such as insurance pricing models.
Usage
gini_coef(data, ...)
# S3 method for class 'data.frame'
gini_coef(data, truth, estimate, na_rm = TRUE, case_weights = NULL, ...)
gini_coef_vec(truth, estimate, na_rm = TRUE, case_weights = NULL, ...)Arguments
- data
A
data.framecontaining the columns specified by thetruthandestimatearguments.- ...
Not currently used.
- truth
The column identifier for the true results (that is
numeric). This should be an unquoted column name although this argument is passed by expression and supports quasiquotation (you can unquote column names). For_vec()functions, anumericvector.- estimate
The column identifier for the predicted results (that is also
numeric). As withtruththis can be specified different ways but the primary method is to use an unquoted variable name. For_vec()functions, anumericvector.- na_rm
A
logicalvalue indicating whetherNAvalues should be stripped before the computation proceeds.- case_weights
The optional column identifier for case weights. This should be an unquoted column name that evaluates to a numeric column in
data. For_vec()functions, a numeric vector,hardhat::importance_weights(), orhardhat::frequency_weights().
Value
A tibble with columns .metric, .estimator,
and .estimate and 1 row of values.
For grouped data frames, the number of rows returned will be the same as the number of groups.
For gini_coef_vec(), a single numeric value (or NA).
Details
The normalized Gini coefficient is a metric that should be maximized. The output ranges from 0 to 1, with 1 indicating perfect ranking ability where predicted values perfectly rank the true values.
The Gini coefficient is calculated from the Lorenz curve, which plots the cumulative proportion of the total truth values against the cumulative proportion of observations when sorted by predicted values. The raw Gini is the area between the Lorenz curve and the diagonal line of equality. The normalized Gini divides this by the maximum possible Gini (achieved when observations are sorted by the true values).
The formula is:
$$\text{Normalized Gini} = \frac{G(\text{estimate})}{G(\text{truth})}$$
where \(G(x)\) is the Gini coefficient when sorting by \(x\).
Note that gini_coef() is a regression metric based on ranking, distinct
from gain_capture() which is a classification metric.
Unlike many other metrics, gini_coef() is not symmetric with respect to
truth and estimate. The estimate values determine the sorting order,
while the truth values are accumulated along the Lorenz curve. Swapping
them will produce different results.
When the true values are constant (zero variance), the Gini coefficient is
undefined and NA is returned with a warning.
See also
Other numeric metrics:
ccc(),
huber_loss(),
huber_loss_pseudo(),
iic(),
mae(),
mape(),
mase(),
mpe(),
msd(),
mse(),
poisson_log_loss(),
rmse(),
rmse_relative(),
rpd(),
rpiq(),
rsq(),
rsq_trad(),
smape()
Examples
# Supply truth and predictions as bare column names
gini_coef(solubility_test, solubility, prediction)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 gini_coef standard 0.935
library(dplyr)
set.seed(1234)
size <- 100
times <- 10
# create 10 resamples
solubility_resampled <- bind_rows(
replicate(
n = times,
expr = sample_n(solubility_test, size, replace = TRUE),
simplify = FALSE
),
.id = "resample"
)
# Compute the metric by group
metric_results <- solubility_resampled |>
group_by(resample) |>
gini_coef(solubility, prediction)
metric_results
#> # A tibble: 10 × 4
#> resample .metric .estimator .estimate
#> <chr> <chr> <chr> <dbl>
#> 1 1 gini_coef standard 0.929
#> 2 10 gini_coef standard 0.946
#> 3 2 gini_coef standard 0.940
#> 4 3 gini_coef standard 0.945
#> 5 4 gini_coef standard 0.946
#> 6 5 gini_coef standard 0.923
#> 7 6 gini_coef standard 0.931
#> 8 7 gini_coef standard 0.921
#> 9 8 gini_coef standard 0.951
#> 10 9 gini_coef standard 0.936
# Resampled mean estimate
metric_results |>
summarise(avg_estimate = mean(.estimate))
#> # A tibble: 1 × 1
#> avg_estimate
#> <dbl>
#> 1 0.937
