Question 12
Domain 2: Data Ingestion & AcquisitionWhich configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?
Correct answer: A
Explanation
"spark.sql.files.maxPartitionBytes" sets the maximum bytes per file partition, so it directly controls how Spark splits input data into partitions during ingestion. The exam guide’s file-sizing question uses this config to target a 512 MB part-file size, showing it affects partition size without shuffling data.
Why each option is right or wrong
A. spark.sql.files.maxPartitionBytes
Spark’s file-source split sizing is governed by `spark.sql.files.maxPartitionBytes`, which defaults to 128 MB and caps the number of bytes Spark will pack into each input partition when reading files. In this question, setting it to 512 MB directly influences how the 1-TB JSON input is partitioned during ingestion, without requiring a shuffle; by contrast, `shuffle.partitions` and `repartition/coalesce` affect shuffle output, not initial file scan partitioning.
B. spark.sql.autoBroadcastJoinThreshold
Controls the broadcast join size limit, not file ingestion partitioning.
C. spark.sql.adaptive.advisoryPartitionSizeInBytes
Sets adaptive query partition sizing during execution, not initial file split size.
D. spark.sql.adaptive.coalescePartitions.minPartitionNum
Sets the minimum number of coalesced partitions in adaptive execution, not ingestion partition size.