Question 1
Domain 1: Data Preparation for Machine Learning (ML)An ML engineer needs to use data with Amazon SageMaker Canvas to train an ML model. The data is stored in Amazon S3 and is complex in structure. The ML engineer must use a file format that minimizes processing time for the data. Which file format will meet these requirements?
Correct answer: D
Explanation
Apache Parquet is a columnar file format, so SageMaker Canvas can read only the needed columns instead of scanning entire rows, which minimizes processing time for complex data. It is well suited for data stored in Amazon S3 because it is efficient for analytics and machine learning workflows.
Why each option is right or wrong
A. CSV files compressed with Snappy
CSV is row-based text; compression saves storage but does not make complex structured reads efficient.
B. JSON objects in JSONL format
JSONL handles semi-structured records, but text parsing is typically slower than columnar Parquet processing.
C. JSON files compressed with gzip
Gzip-compressed JSON reduces size, but nested text JSON still requires heavier parsing work.
D. Apache Parquet files
Amazon SageMaker Canvas supports tabular data ingestion from Amazon S3, and for complex datasets the most efficient choice is a columnar format such as Apache Parquet. Parquet stores data by column rather than by row, so Canvas can load only the required fields and avoid full-row scans, which reduces parsing and I/O overhead for large S3 objects. This makes it the best fit when the requirement is to minimize processing time for training data.