DE Professional Exam Glossary - 86 Terms

Search the terminology pack for Databricks Certified Data Engineer Professional. Use these definitions with the study guide and practice questions.

Download App Study Guide Free Practice Exam

A

ACID: Atomicity, consistency, isolation, and durability; the exam references ACID transaction behavior for Delta Lake operations.
ACL: Access control list; a security mechanism for controlling access to workspace objects and data assets.
ACLs: Access control lists used to secure workspace objects and enforce least-privilege access.
Apache Spark: The distributed processing engine used on Databricks for ETL, streaming, SQL, and large-scale data transformations.
APPLY CHANGES APIs: APIs used in Lakeflow Declarative Pipelines to simplify change data capture (CDC).
assertDataFrameEqual: A testing utility used to verify that two DataFrames are equal.
assertSchemaEqual: A testing utility used to verify that two schemas are equal.
Auto Loader: A Databricks ingestion capability used to build reliable batch and streaming data pipelines that efficiently ingest new files from sources such as cloud storage.

C

CDC: Change Data Capture; the exam references APPLY CHANGES APIs as a way to simplify CDC in Lakeflow Declarative Pipelines.
CDF: Change Data Feed; an acronym for the Delta Lake feature that exposes data changes.
Change Data Feed: A Delta Lake feature that exposes row-level changes and is used here to address limitations of streaming tables and improve latency.
checkpoint directory: The directory that stores streaming state and checkpoint information for a stream.
CI/CD: Continuous integration and continuous delivery/deployment; the exam covers integrating Databricks development and deployment workflows with CI/CD.
coalesce: A Spark operation that reduces the number of partitions without a full shuffle. The text uses it in a strategy for writing Parquet without shuffling data.
column masks: Table security controls that hide or transform sensitive column values for unauthorized users.
CTAS: An acronym for CREATE TABLE AS SELECT, used to create a derivative table from a query result. The text mentions it as a possible solution for creating a sales table from a marketing table.

D

D2D: Databricks-to-Databricks Sharing; sharing data securely between Databricks deployments.
D2O: Databricks-to-Open sharing; sharing data from Databricks to external platforms using an open sharing protocol.
DABs: Databricks Asset Bundles; a shorthand used for Databricks deployment packaging and automation.
data purging: The process of removing data to comply with retention requirements or other compliance obligations.
data retention policies: Policies that define how long data must be kept before it is purged or deleted for compliance.
data skipping: A query optimization technique that avoids reading irrelevant data files or partitions to improve performance.
Databricks Asset Bundles: A Databricks deployment mechanism used to package resources for modular development, deployment automation, and CI/CD integration.
Databricks Certified Data Engineer Professional: A Databricks certification that validates advanced skills in building, optimizing, and maintaining production-grade data engineering solutions on the Databricks Lakehouse Platform.
Databricks CLI: A command-line interface used to manage Databricks resources, jobs, pipelines, and monitoring tasks.
Databricks Compute: Databricks compute resources used to run workloads; the guide specifically notes serverless as part of the platform knowledge area.
Databricks Lakehouse Platform: Databricks' platform for building production-grade data engineering solutions; the exam expects knowledge of its core features and operational practices.
Databricks secrets module: A Databricks feature used to retrieve sensitive values such as passwords securely in code. In the text, it is accessed with dbutils.secrets.get.
DataFrame.transform: A DataFrame method used in testing and transformation workflows to apply a function to a DataFrame.
DBFS: Databricks File System, a storage layer mentioned in the text as a place where an encoded password would be saved in one answer choice.
dbutils.secrets.get: A Databricks API call that retrieves a secret value from a named scope and key, such as scope='db_creds' and key='jdbc_password'.
DEEP CLONE: A Delta Lake cloning feature that creates a new table and copies both data and metadata so the clone can be kept in sync with the source through changes committed to one table.
deletion vectors: A Delta optimization technique mentioned alongside liquid clustering for improving table performance.
Delta Lake: A core Databricks data storage and table technology used for scalable data modeling, ACID table operations, and data ingestion/transformation in the Lakehouse.
Delta Sharing: A secure data-sharing capability for sharing live data from Databricks to other Databricks deployments or external platforms.
Delta transaction log: The log that records changes to a Delta Lake table, including operations such as table renames.
dependency graph: A structure representing task dependencies in a multi-task job. The text notes that tasks are managed as a dependency graph.
dimensional model: A data model designed for analytical workloads, emphasizing efficient querying and aggregation.

E

ETL: Extract, transform, load; the exam focuses on designing secure, reliable, and cost-effective ETL pipelines.

F

file pruning: A query optimization technique that reduces the number of files scanned during query execution.

G

Generalization: An anonymization method that reduces data specificity to protect confidential information.
Git Folders: A Databricks feature for integrating Git-based CI/CD workflows for notebook and code deployment.

H

Hashing: An anonymization or pseudonymization method that transforms confidential data into a hash value.

I

interactive cluster: A cluster intended for interactive use, contrasted in the text with a job cluster when choosing the lowest-cost configuration.
is_member: A function used in the view definition to test whether the current user belongs to a specified group, such as 'marketing'.

J

JDBC: A database connectivity format used in the text to read from an external database by specifying a URL, table name, user, and password.
job cluster: A cluster created for and used by a scheduled job. The text contrasts it with a dedicated interactive cluster as the lower-cost option.
Jobs API: A Databricks API used to manage jobs and set up job status and performance notifications.

L

Lakeflow Definitive Pipelines: A Databricks pipeline framework used to build and manage reliable production-ready batch and streaming ETL pipelines, including declarative pipeline development and event-log monitoring.
Lakeflow Jobs: Databricks job orchestration for creating and automating ETL workloads via UI, APIs, or CLI.
Lakehouse Federation: A Databricks federation capability for accessing and governing data across supported source systems.
least privilege: A security principle requiring users and systems to have only the minimum access necessary to perform their tasks.
liquid clustering: A Delta Lake optimization technique used to improve query performance and simplify data layout decisions, presented as an alternative to partitioning and Z-ordering.

M

Medallion Architecture: A layered data architecture using bronze, silver, and gold tables to organize data by refinement and usage.
metadata: Descriptive information about enterprise data used to improve discoverability.
microbatch: A small batch of streaming data processed as part of Structured Streaming.
multi-task job: A Databricks job composed of multiple tasks with dependencies between them. The text describes a job with tasks A, B, and C where B and C depend on A.

P

Pandas UDF: A User-Defined Function implemented with Pandas/Python for distributed data processing in Spark on Databricks.
Parquet: A columnar file format used in the text as the output format for a one-TB JSON dataset.
partitioning: A data layout strategy that divides a table into partitions to improve query performance and data management.
permission inheritance model: Unity Catalog's model for inheriting permissions across data objects and related resources.
PII: Personally identifiable information; sensitive personal data that the exam expects candidates to protect and mask in compliant pipelines.
PyPI: The Python Package Index; the exam mentions installing third-party dependencies from PyPI packages in Databricks.
Python: A programming language used in the exam context for data processing, project structure, UDFs, and testing on Databricks.

Q

Query Profiler UI: A Databricks interface for analyzing query execution and identifying performance bottlenecks.

R

repartition: A Spark operation that redistributes data across partitions, typically involving a shuffle. The text contrasts it with coalesce in a file-sizing strategy.
REST API: An HTTP-based application programming interface used here to manage and monitor Databricks resources.
row filters: Table security controls that restrict which rows a user can see.

S

shuffle partitions: A Spark configuration that controls parallelism during shuffle operations.
Spark Structured Streaming: Spark's streaming processing model for microbatch-based streaming workloads; the guide compares it with Lakeflow Declarative Pipelines for scalable ETL.
Spark UI: A Databricks/Spark monitoring interface used to inspect workloads, diagnose issues, and debug pipelines.
spark.sql.adaptive.advisoryPartitionSizeInBytes: A Spark SQL configuration that sets the target partition size used by adaptive query execution. The text mentions setting it to 512 MB.
spark.sql.files.maxPartitionBytes: A Spark SQL configuration that controls the maximum number of bytes to pack into a file scan partition. The text sets it to 512 MB.
spark.sql.shuffle.partitions: A Spark SQL configuration that controls the number of partitions used for shuffle operations. The text sets it to 2,048 partitions.
SQL: A query language used in the exam for data processing, transformations, and pipeline development on Databricks.
SQL Alerts: Alerts based on SQL queries, used here to monitor data quality.
Structured Streaming: Spark's streaming framework used for microbatch processing and production SLA-ready pipelines.
Suppression: An anonymization method that removes or hides sensitive data values.
system tables: Databricks tables used for observability, including monitoring resource utilization, cost, auditing, and workload activity.

T

Tokenization: An anonymization or pseudonymization method that replaces sensitive data with tokens.
trigger interval: The time interval at which a Structured Streaming job is triggered to process data. The text uses a 60-minute trigger interval.

U

UDF: User-Defined Function; a custom function created by the user, including Pandas/Python UDFs in Databricks.
Unity Catalog: Databricks' governance layer for managing data access, permissions, metadata, and managed tables; the exam emphasizes permission inheritance and reduced operational overhead.

V

view: A virtual table defined by a SQL query. In the text, a view is created from a table and used to control which values are exposed to users.

W

Workflows UI: A Databricks user interface used to manage workflows and configure job notifications.

Z

Z-order: A data layout optimization technique used to cluster related data for faster query performance.

About These Definitions

These definitions are loaded from the shared release pack. Use them with the study guide and practice questions to connect vocabulary to exam scenarios.

Download App Read the full study guide Take the free practice exam