Au
the response class proportions of each daughter node can be expressed as a weighted average
over the response class proportions of the present levels q ∈ Q
P
that are being sent to it,
it follows that the left daughter node N
L
is always less likely to classify an observation to
the k = 1 response class than the right daughter node N
R
when splitting on a categorical
predictor p in a binary classification tree that uses a weighted Gini index node impurity
measure.
In terms of implementation, the randomForest R package uses the pseudo value pro-
cedure for binary classification that was described in Section 2.1.2 when determining the
optimal splitting criterion S
∗
p
for a categorical predictor p with a “large” number of un-
ordered levels.
2
However, the code that is responsible for computing the k = 1 response
class proportion γ
p
(q) within each unordered level q ∈ Q as in equation (9) executes as
follows:
γ
p
(q) =
(
|{n ∈ N
M
: x
np
= q and y
n
= 1}|
|{n ∈ N
M
: x
np
= q}|
if q ∈ Q
P
0 if q ∈ Q
A
.
Therefore, the issues that arise here are similar to the ones that were described for regression.
Even though this “zero imputation” of the k = 1 response class proportions γ
p
(q) for the
absent levels q ∈ Q
A
is unimportant when determining the optimal numeric pseudo split
point ˜s
∗
p
during training, it can have a large effect on the subsequent classifications that are
made for observations with absent levels. In particular, since the proportions γ
p
(q) ≥ 0 for
all of the present levels q ∈ Q
P
, it follows from our discussions in Section 2.1.2 that the
numeric pseudo split point ˜s
∗
p
≥ 0. And because the “imputed” proportions γ
p
(q) = 0 ≤ ˜s
∗
p
for all q ∈ Q
A
, the absent levels will always be sent to the left daughter node. But,
due to the innate differences that exist amongst the two daughter nodes, this arbitrary
choice of sending the absent levels left can significantly affect the classifications that are
made on observations with absent levels—although the model’s final classifications will also
depend on any successive splits which take place after the absent levels problem occurs,
the classifications for observations with absent levels will tend to be biased towards the
k = 2 response class. Moreover, this behavior also implies that the random forest binary
classification models which are trained using the randomForest R package may be sensitive
to the actual ordering of the response classes: since observations with absent levels are
always sent to the left daughter node N
L
which is more likely to classify them to the k = 2
response class than the right daughter node N
R
, the classifications for these observations
can be influenced by interchanging the indices of the two response classes.
Meanwhile, for cases where the pseudo value procedure is not or cannot be used, the
random forests FORTRAN code and the randomForest R package will instead adopt a more
brute force approach that either exhaustively or randomly searches through the space of
possible splits. However, to understand the potential problems that absent levels can cause
in these situations, we must first briefly digress into a discussion of how categorical splits
are internally represented in their code.
Specifically, in their code, a split on a categorical predictor p is both encoded and
decoded as an integer whose binary representation identifies which unordered levels go left
2
The exact condition for using the pseudo value procedure for binary classification in version 4.6-12 of
the randomForest R package is when a categorical predictor p has Q > 10 unordered levels. Meanwhile,
although the random forests FORTRAN code for binary classification references the pseudo value procedure, it
does not appear to be implemented in the code.
10