Question 25
Content Domain 1: Data EngineeringA Data Scientist needs to create a serverless ingestion and analytics solution for high-velocity, real-time streaming data. The ingestion process must buffer and convert incoming records from JSON to a query-optimized, columnar format without data loss. The output datastore must be highly available, and Analysts must be able to run SQL queries against the data and connect to existing business intelligence dashboards. Which solution should the Data Scientist build to satisfy the requirements?
Correct answer: A
Explanation
Amazon Kinesis Data Firehose is a serverless service that can "buffer" streaming records and transform JSON into columnar formats like Apache Parquet or ORC before delivery to Amazon S3. Amazon S3 provides highly available storage, and Amazon Athena lets analysts run SQL directly on the data; the Athena JDBC connector enables existing BI tools to connect.
Why each option is right or wrong
A. Create a schema in the AWS Glue Data Catalog of the incoming data format. Use an Amazon Kinesis Data Firehose delivery stream to stream the data and transform the data to Apache Parquet or ORC format using the AWS GlueData Catalog before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.
Amazon Kinesis Data Firehose is the managed service that buffers streaming records and can invoke AWS Glue schema metadata to convert JSON into Apache Parquet or ORC before writing to Amazon S3, which satisfies the requirement to avoid data loss while producing a query-optimized columnar layout. Under the AWS Glue Data Catalog integration, Firehose can use the defined schema for transformation, and S3 provides the required highly available datastore; analysts can then query the objects with Amazon Athena, which supports SQL over S3 data, and existing BI dashboards can connect through the Athena JDBC driver.
B. Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and writes the data to a processed data location inAmazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.
S3 event plus Lambda is object-triggered, not ideal for continuous high-velocity stream buffering.
C. Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and inserts it into an Amazon RDS PostgreSQLdatabase. Have the Analysts query and run dashboards from the RDS database.
RDS PostgreSQL is a relational database, not the best serverless analytics target for streaming data lakes.
D. Use Amazon Kinesis Data Analytics to ingest the streaming data and perform real-time SQL queries to convert the records to Apache Parquet before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3using Amazon Athena and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.
Kinesis Data Analytics analyzes streams; it is not the primary service for buffering and format conversion delivery.