Question 6
Domain 1: Developing Code for Data Processing using Python and SQLWhat is the best way to unit test a PySpark transformation that standardizes customer data?
Correct answer: B
Explanation
Unit tests should use small, deterministic input DataFrames so the transformation has a fixed, repeatable input. Then assert on the transformed output to verify the standardization logic produces the expected rows and columns. This follows the basic testing rule of checking known inputs against expected outputs.
Why each option is right or wrong
A. Run the full production job against shared bronze tables for every code change
Production jobs and shared bronze tables are for integration or validation, not unit tests.
B. Build small deterministic input DataFrames and assert on the transformed output
PySpark unit tests should isolate the transformation logic with tiny, fixed inputs so the output is predictable and can be compared exactly against expected rows and schema. In practice, that means constructing deterministic DataFrames in the test and asserting the transformed result, rather than relying on live sources, randomness, or cluster state, which would make the test nondeterministic and flaky.
C. Rely only on manual notebook inspection after deployment
Manual notebook review is ad hoc; tests should automatically assert expected outputs.
D. Skip tests because Spark plans can change between runtimes
Changing Spark plans does not eliminate the need to test transformation logic.