Question 30
Domain 2: Data ProcessingA data engineer needs to remove unusually extreme values from a Spark DataFrame column. Which approach correctly identifies outliers using a commonly accepted spread-based rule?
Correct answer: B
Explanation
Outliers in a Spark DataFrame can be identified and removed using either standard deviation-based rules or the interquartile range method, both of which rely on how far values fall from the main distribution. — official.txt
Why each option is right or wrong
A. Remove values only after sorting the column and dropping the highest and lowest row in the Spark DataFrame.
Outlier removal is based on standard deviation rules or the IQR method, not simply trimming endpoints.
B. Calculate either standard deviation thresholds or IQR bounds, then filter rows whose column values fall outside those limits.
The source material states that outliers in a Spark DataFrame should be identified and removed using either standard deviation-based rules or the interquartile range (IQR) method. In this situation, computing those bounds for the target column and filtering out rows beyond them is the supported approach.
C. Convert the column to categorical values first, then remove any category that appears less often than the others.
Outlier detection here uses numeric spread measures, not category frequency after recoding values.
D. Keep all rows unless the DataFrame contains null values, because null handling determines whether a value is an outlier.
Null handling and outlier detection are separate tasks; outliers are identified by standard deviation or IQR.