ML Associate Practice Q30

A. Remove values only after sorting the column and dropping the highest and lowest row in the Spark DataFrame.

Outlier removal is based on standard deviation rules or the IQR method, not simply trimming endpoints.

B. Calculate either standard deviation thresholds or IQR bounds, then filter rows whose column values fall outside those limits.

The source material states that outliers in a Spark DataFrame should be identified and removed using either standard deviation-based rules or the interquartile range (IQR) method. In this situation, computing those bounds for the target column and filtering out rows beyond them is the supported approach.

C. Convert the column to categorical values first, then remove any category that appears less often than the others.

Outlier detection here uses numeric spread measures, not category frequency after recoding values.

D. Keep all rows unless the DataFrame contains null values, because null handling determines whether a value is an outlier.

Null handling and outlier detection are separate tasks; outliers are identified by standard deviation or IQR.

Question 30

Explanation

Why each option is right or wrong