Question 16
Content Domain 4: Machine Learning Implementation and OperationsA data scientist is developing a pipeline to ingest streaming web traffic data. The data scientist needs to implement a process to identify unusual web traffic patterns as part of the pipeline. The patterns will be used downstream for alerting and incident response. The data scientist has access to unlabeled historic data to use, if needed. The solution needs to do the following: Calculate an anomaly score for each web traffic entry. Adapt unusual event identification to changing web patterns over time. Which approach should the data scientist implement to meet these requirements?
Correct answer: D
Explanation
Amazon Random Cut Forest is designed to "calculate anomaly scores" on streaming data and works with unlabeled historical data, which fits the need to identify unusual web traffic patterns. Using a "sliding window" lets the model adapt to changing traffic patterns over time, and Kinesis Data Analytics can run the SQL query in real time on the stream.
Why each option is right or wrong
A. Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker Random Cut Forest (RCF) built-in model. Use an Amazon Kinesis Data Stream to process the incoming web traffic data. Attach a preprocessing AWS Lambda function to perform data enrichment by calling the RCF model to calculate the anomaly score for each record.
RCF fits anomaly detection, but this design relies on external model calls rather than adaptive in-stream analytics.
B. Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker built-in XGBoost model. Use an Amazon Kinesis Data Stream to process the incoming web traffic data. Attach a preprocessing AWSLambda function to perform data enrichment by calling the XGBoost model to calculate the anomaly score for each record.
XGBoost is generally supervised and not the standard choice for unlabeled anomaly scoring.
C. Collect the streaming data using Amazon Kinesis Data Firehose. Map the delivery stream as an input source for Amazon Kinesis Data Analytics. Write a SQL query to run in real time against the streaming data with the k-NearestNeighbors (kNN) SQL extension to calculate anomaly scores for each record using a tumbling window.
kNN can measure similarity, but RCF is the AWS-native streaming anomaly approach here.
D. Collect the streaming data using Amazon Kinesis Data Firehose. Map the delivery stream as an input source for Amazon Kinesis Data Analytics. Write a SQL query to run in real time against the streaming data with the Amazon RandomCut Forest (RCF) SQL extension to calculate anomaly scores for each record using a sliding window.
Amazon Kinesis Data Analytics supports real-time SQL over streaming sources, and its Amazon Random Cut Forest SQL function is specifically intended to produce an anomaly score for each incoming record from unlabeled data. The use of a sliding window is what lets the scoring logic continuously retrain on the most recent traffic, so the anomaly detection adapts as web patterns drift over time rather than relying on a fixed historical baseline.