An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.
The initial code is:

def in_spanish_inner(df: pd.Series) -> pd.Series:
model = get_translation_model(target_lang='es')
return df.apply(model)
in_spanish = sf.pandas_udf(in_spanish_inner, StringType())
How can the MLOps engineer change this code to reduce how many times the language model is loaded?
A. Convert the Pandas UDF from a Series # Series UDF to a Series # Scalar UDF
B. Convert the Pandas UDF from a Series # Series UDF to an Iterator[Series] # Iterator[Series] UDF
C. Convert the Pandas UDF to a PySpark UDF
D. Run thein_spanish_inner()function in amapInPandas()function call
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 2:
A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate.
Which code snippet can be used to meet this requirement?
A. df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate")
B. df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate")
C. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")
D. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")
正解:D
解説: (Pass4Test メンバーにのみ表示されます)
質問 3:
A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns fortransaction_id,account_number, transaction_amount, andtimestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.
Which approach should the data scientist use to deduplicate the orders using PySpark?
A. df = df.dropDuplicates(["transaction_amount"])
B. df = df.dropDuplicates()
C. df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first ("timestamp"))
D. df = df.filter(F.col("transaction_id").isNotNull())
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 4:
A data engineer is working ona Streaming DataFrame streaming_df with the given streaming data:

Which operation is supported with streaming_df?
A. streaming_df.groupby("Id").count()
B. streaming_df.filter(col("count") < 30).show()
C. streaming_df.orderBy("timestamp").limit(4)
D. streaming_df.select(countDistinct("Name"))
正解:A
解説: (Pass4Test メンバーにのみ表示されます)
質問 5:
A developer runs:

What is the result?
Options:
A. It throws an error if there are null values in either partition column.
B. It creates separate directories for each unique combination of color and fruit.
C. It appends new partitions to an existing Parquet file.
D. It stores all data in a single Parquet file.
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 6:
Given:
python
CopyEdit
spark.sparkContext.setLogLevel("<LOG_LEVEL>")
Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?
A. FATAL, NONE, INFO, DEBUG
B. WARN, NONE, ERROR, FATAL
C. ERROR, WARN, TRACE, OFF
D. ALL, DEBUG, FAIL, INFO
正解:C
解説: (Pass4Test メンバーにのみ表示されます)
質問 7:
A data engineer is streaming data from Kafka and requires:
Minimal latency
Exactly-once processing guarantees
Which trigger mode should be used?
A. .trigger(processingTime='1 second')
B. .trigger(continuous=True)
C. .trigger(availableNow=True)
D. .trigger(continuous='1 second')
正解:A
解説: (Pass4Test メンバーにのみ表示されます)
質問 8:
Which command overwrites an existing JSON file when writing a DataFrame?
A. df.write.overwrite.json("path/to/file")
B. df.write.mode("overwrite").json("path/to/file")
C. df.write.format("json").save("path/to/file", mode="overwrite")
D. df.write.json("path/to/file", overwrite=True)
正解:B
解説: (Pass4Test メンバーにのみ表示されます)