Question 22
Domain 3: Deployment and Orchestration of ML WorkflowsA company is planning to create several ML prediction models. The training data is stored in Amazon S3. The entire dataset is more than 5 ТВ in size and consists of CSV, JSON, Apache Parquet, and simple text files. The data must be processed in several consecutive steps. The steps include complex manipulations that can take hours to finish running. Some of the processing involves natural language processing (NLP) transformations. The entire process must be automated. Which solution will meet these requirements?
Correct answer: D
Explanation
Amazon SageMaker Pipelines is built to orchestrate “several consecutive steps” in an automated ML workflow, including long-running processing and NLP transformations. Because the data is in Amazon S3 and exceeds 5 TB, SageMaker processing jobs can handle large-scale CSV, JSON, Parquet, and text data, while Amazon EventBridge can trigger the pipeline automatically on a schedule or event.
Why each option is right or wrong
A. Process data at each step by using Amazon SageMaker Data Wrangler. Automate the process by using Data Wrangler jobs.
Data Wrangler focuses on interactive data preparation, not full multi-step ML workflow orchestration at this scale.
B. Use Amazon SageMaker notebooks for each data processing step. Automate the process by using Amazon EventBridge.
Notebooks are mainly for interactive development and are not the best managed mechanism for production step orchestration.
C. Process data at each step by using AWS Lambda functions. Automate the process by using AWS Step Functions and Amazon EventBridge.
Lambda is generally unsuitable for heavy, hours-long data transformations and large-scale ML preprocessing workloads.
D. Use Amazon SageMaker Pipelines to create a pipeline of data processing steps. Automate the pipeline by using Amazon EventBridge.
Amazon SageMaker Pipelines is the managed service designed to orchestrate multi-step ML workflows, and it supports chaining processing, training, and transformation jobs with step dependencies for long-running tasks that can run for hours. The data volume and formats fit SageMaker Processing jobs reading directly from Amazon S3, which can handle large datasets over 5 TB and common file types such as CSV, JSON, Parquet, and text, including NLP preprocessing. Amazon EventBridge provides the automation trigger for the pipeline, allowing the workflow to start on a schedule or event without manual intervention.