A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.
Which approach would simplify the identification of these changed records?
A. Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
B. Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
C. Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
D. Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
E. Modify the overwrite logic to include a field populated by calling
spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 2:
The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:
item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING
The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.
A junior data engineer suggests converting this data to Delta Lake will improve query performance.
Which response to the junior data engineer s suggestion is correct?
A. The Delta log creates a term matrix for free text fields to support selective filtering.
B. Text data cannot be stored with Delta Lake.
C. ZORDER ON review will need to be run to see performance gains.
D. Delta Lake statistics are only collected on the first 4 columns in a table.
E. Delta Lake statistics are not optimized for free text fields with high cardinality.
正解:E
解説: (Pass4Test メンバーにのみ表示されます)
質問 3:
An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:
df = spark.read.format("parquet").load(f"/mnt/source/(date)")
Which code block should be used to create the date Python variable used in the above code block?
A. date = dbutils.notebooks.getParam("date")
B. date = spark.conf.get("date")
C. input_dict = input()
date= input_dict["date"]
D. dbutils.widgets.text("date", "null")
date = dbutils.widgets.get("date")
E. import sys
date = sys.argv[1]
正解:D
解説: (Pass4Test メンバーにのみ表示されます)
質問 4:
A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.
Which statement describes the contents of the workspace audit logs concerning these events?
A. Because User B last configured the jobs, their identity will be associated with both the job creation events and the job run events.
B. Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs.
C. Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events.
D. Because User A created the jobs, their identity will be associated with both the job creation Get Latest & Actual Certified-Data-Engineer-Professional Exam's Question and Answers from events and the job run events.
E. Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identity these events.
正解:C
解説: (Pass4Test メンバーにのみ表示されます)
質問 5:
The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user group already created in Databricks that represent various teams within the company. A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users. Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?
A. "Manage" permission should be set on a secret scope containing only those credentials that will be used by a given team.
B. No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.
C. "Read'' permissions should be set on a secret key mapped to those credentials that will be used by a given team.
D. "Read" permissions should be set on a secret scope containing only those credentials that will be used by a given team.
正解:D
解説: (Pass4Test メンバーにのみ表示されます)
質問 6:
A data architect has heard about lake's built-in versioning and time travel capabilities. For auditing purposes they have a requirement to maintain a full of all valid street addresses as they appear in the customers table.
The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability. Which piece of Get Latest & Actual Certified-Data-Engineer-Professional Exam's Question and Answers from information is critical to this decision?
A. Delta Lake only supports Type 0 tables; once records are inserted to a Delta Lake table, they cannot be modified.
B. Data corruption can occur if a query fails in a partially completed state because Type 2 tables requires setting multiple fields in a single update.
C. Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution.
D. Delta Lake time travel cannot be used to query previous versions of these tables because Type 1 changes modify data files in place.
E. Shallow clones can be combined with Type 1 tables to accelerate historic queries for long-term versioning.
正解:C
解説: (Pass4Test メンバーにのみ表示されます)
質問 7:
A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka_generated timesamp, key, and value. Three months after the pipeline is deployed the data engineering team has noticed some latency issued during certain times of the day.
A senior data engineer updates the Delta Table's schema and ingestion logic to include the current timestamp (as recoded by Apache Spark) as well the Kafka topic and partition. The team plans to use the additional metadata fields to diagnose the transient processing delays.
Which limitation will the team face while diagnosing this problem?
A. New fields cannot be added to a production Delta table.
B. Updating the table schema requires a default value provided for each file added.
Get Latest & Actual Certified-Data-Engineer-Professional Exam's Question and Answers from
C. New fields not be computed for historic records.
D. Updating the table schema will invalidate the Delta transaction log metadata.
E. Spark cannot capture the topic partition fields from the kafka source.
正解:C
解説: (Pass4Test メンバーにのみ表示されます)
質問 8:
A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
The user_ltv table has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:
An analyst who is not a member of the auditing group executes the following query:
SELECT * FROM user_ltv_no_minors
Which statement describes the results returned by this query?
A. All records from all columns will be displayed with the values in user_ltv.
B. All age values less than 18 will be returned as null values all other columns will be returned with the values in user_ltv.
C. All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.
D. All values for the age column will be returned as null values, all other columns will be returned with the values in user_ltv.
E. All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.
正解:C
解説: (Pass4Test メンバーにのみ表示されます)
質問 9:
The view updates represents an incremental batch of all newly ingested data to be inserted or Get Latest & Actual Certified-Data-Engineer-Professional Exam's Question and Answers from updated in the customers table.
The following logic is used to process these records.
MERGE INTO customers
USING (
SELECT updates.customer_id as merge_ey, updates .*
FROM updates
UNION ALL
SELECT NULL as merge_key, updates .*
FROM updates JOIN customers
ON updates.customer_id = customers.customer_id
WHERE customers.current = true AND updates.address <> customers.address ) staged_updates ON customers.customer_id = mergekey WHEN MATCHED AND customers. current = true AND customers.address <> staged_updates.address THEN UPDATE SET current = false, end_date = staged_updates.effective_date WHEN NOT MATCHED THEN INSERT (customer_id, address, current, effective_date, end_date) VALUES (staged_updates.customer_id, staged_updates.address, true, staged_updates.effective_date, null) Which statement describes this implementation?
A. The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.
B. The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.
C. The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.
D. The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.
正解:C
解説: (Pass4Test メンバーにのみ表示されます)
Kikuchi -
Pass4Testの問題集の新鮮のところは、図解を豊富に取り入れて、知識を整理できるように工夫しています。