Question 39
Domain 2: Data ProcessingA data engineer needs to remove unusually extreme values from a Spark DataFrame column. Which approach correctly identifies outliers using a commonly accepted spread-based rule from the provided methods?
Correct answer: D
Explanation
Outliers in a Spark DataFrame can be identified and removed using either a standard deviation-based rule or the interquartile range method, both of which rely on how far values fall from the main distribution. — official.txt
Why each option is right or wrong
A. Flag values as outliers only when they appear more than once in the DataFrame.
Outlier detection is based on statistical spread, not repetition count.
B. Treat all values below the column average as outliers and remove them.
Values below the average are not automatically outliers; spread-based thresholds are required.
C. Remove rows only after sorting the column and discarding a fixed percentage from each end.
The provided methods are standard deviation-based rules and the IQR method, not fixed trimming.
D. Use either a standard deviation-based threshold or the interquartile range to identify extreme values.
The source material states that outliers in a Spark DataFrame can be identified and removed using either standard deviation-based rules or the interquartile range (IQR) method. Because the question asks for a provided spread-based approach for detecting unusually extreme values, this matches both accepted methods named in the material.