A data engineer is creating a data ingestion pipeline to understand where customers are taking their rented bicycles during use. The engineer noticed that over time, data being transmitted from the bicycle sensors fails to include key details like latitude and longitude. Downstream analysts need both the clean records and the quarantined records available for separate processing.
The data engineer already has this code:
import dlt
from pyspark.sql.functions import expr
rules = {
" valid_lat " : " (lat IS NOT NULL) " ,
" valid_long " : " (long IS NOT NULL) "
}
quarantine_rules = " NOT({0}) " .format( " AND " .join(rules.values()))
@dlt.view
def raw_trips_data():
return spark.readStream.table( " ride_and_go.telemetry.trips " )
How should the data engineer meet the requirements to capture good and bad data?
A. @dlt.view
@dlt.expect_or_drop( " lat_long_present " , " (lat IS NOT NULL AND long IS NOT NULL) " ) def trips_data_quarantine():
return spark.readStream.table( " ride_and_go.telemetry.trips " )
B. @dlt.table(partition_cols=[ " is_quarantined " ])
@dlt.expect_all(rules)
def trips_data_quarantine():
return (
spark.readStream.table( " raw_trips_data " )
withColumn( " is_quarantined " , expr(quarantine_rules))
)
C. @dlt.table
@dlt.expect_all_or_drop(rules)
def trips_data_quarantine():
return spark.readStream.table( " raw_trips_data " )
D. @dlt.table(name= " trips_data_quarantine " )
def trips_data_quarantine():
return (
spark.readStream.table( " raw_trips_data " )
filter(expr(quarantine_rules))
)
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 2:
A data engineer is masking a column containing email addresses. The goal is to produce output strings of identical length for all rows, while generating different outputs for different email values .
Which SQL function should be used to achieve this?
A. mask(email, ' ? ' )
B. sha1(email)
C. sha2(email, 0)
D. hash(email)
正解:D
解説: (Pass4Test メンバーにのみ表示されます)
質問 3:
The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.
Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.
A. " Can Manage " privileges on the required cluster
B. " Can Restart " privileges on the required cluster
C. Workspace Admin privileges, cluster creation allowed. " Can Attach To " privileges on the required cluster
D. Cluster creation allowed. " Can Attach To " privileges on the required cluster
E. Cluster creation allowed. " Can Restart " privileges on the required cluster
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 4:
A Delta Lake table with Change Data Feed (CDF) enabled in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources. The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours. Which approach would simplify the identification of these changed records?
A. Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
B. Replace the current overwrite logic with a MERGE statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the Change Data Feed.
C. Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
D. Modify the overwrite logic to include a field populated by calling current_timestamp() as data are being written; use this field to identify records written on a particular date.
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 5:
The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response indicating the job run request was submitted successfully includes a field run_id. Which statement describes what the number alongside this field represents?
A. The job_id and number of times the job has been run are concatenated and returned.
B. The job_id is returned in this field.
C. The globally unique ID of the newly triggered run.
D. The number of times the job definition has been run in this workspace.
正解:C
解説: (Pass4Test メンバーにのみ表示されます)
質問 6:
A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.
Which describes how Delta Lake can help to avoid data loss of this nature in the future?
A. Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.
B. Ingestine all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.
C. The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.
D. Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.
E. Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
1028 お客様のコメント





Koizumi -
Databricks-Certified-Professional-Data-Engineerが無事で合格しました。ほぼ出題されました。ありがとうございました。