Question 35
Domain 5You are developing an ML model using a dataset with categorical input variables. You have randomly split half of the data into training and test sets. After applying one-hot encoding on the categorical variables in the training set, you discover that one categorical variable is missing from the test set. What should you do?
Correct answer: C
Explanation
One-hot encoding must be applied to the test set using the same categorical schema as the training set so both datasets have matching feature columns. If a category is missing in the test set, encoding the test data preserves the model’s expected input structure and avoids feature mismatch at prediction time.
Why each option is right or wrong
A. Use sparse representation in the test set.
B. Randomly redistribute the data, with 70% for the training set and 30% for the test set.
C. Apply one-hot encoding on the categorical variables in the test data.
Under standard supervised-learning preprocessing practice, the model must receive the same feature vector shape at prediction time as it saw during training; otherwise the input matrix dimensions will not align. One-hot encoding the test set with the training-set category schema ensures the encoded columns match, even if one category has zero occurrences in the test split, which is the correct way to avoid a feature-mismatch error when the model expects that column.
D. Collect more data representing all categories.