Question 14

Domain 2: Data Ingestion & Acquisition

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used. Which strategy will yield the best performance without shuffling data?

A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet. B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet. C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet. D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet. E. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Correct answer: C

Explanation

`spark.sql.files.maxPartitionBytes` controls the maximum input split size, so setting it to 512 MB aligns file processing with the target part-file size. Because it avoids a shuffle and the prompt says built-in file-sizing features “cannot be used,” ingesting, doing only narrow transformations, and then writing to Parquet yields the best performance.

Why each option is right or wrong

A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

Spark’s file split sizing is governed by `spark.sql.files.maxPartitionBytes`, whose default is 128 MB; setting it to 512 MB makes Spark read the 1-TB JSON in larger input partitions that better match the desired output part-file size without introducing a shuffle. By contrast, `repartition` and `shuffle.partitions` force a wide shuffle, and `coalesce`/narrow-only processing preserves locality but does not give the same direct control over input split sizing for the subsequent Parquet write.

B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

`spark.sql.shuffle.partitions` affects shuffle output, and sorting introduces a shuffle.

C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

`repartition` triggers a full shuffle, so it is not the no-shuffle strategy.

E. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

`spark.sql.shuffle.partitions` sets shuffle partition count, not a direct 512 MB file-size target.

Previous Next