A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.
Which code fragment meets the requirements?
A)

B)

C)

D)

The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.
Which code fragment meets the requirements?
A. regions = dict(
regions_df
.select('region_id', 'region')
.limit(3)
.collect()
)
B. regions = dict(
regions_df
.select('region_id', 'region')
.sort('region_id')
.take(3)
)
C. regions = dict(
regions_df
.select('region', 'region_id')
.sort(desc('region_id'))
.take(3)
)
D. regions = dict(
regions_df
.select('region', 'region_id')
.sort('region_id')
.take(3)
)
正解:D
解説: (Pass4Test メンバーにのみ表示されます)
質問 2:
A developer runs:

What is the result?
Options:
A. It throws an error if there are null values in either partition column.
B. It creates separate directories for each unique combination of color and fruit.
C. It appends new partitions to an existing Parquet file.
D. It stores all data in a single Parquet file.
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 3:
23 of 55.
A data scientist is working with a massive dataset that exceeds the memory capacity of a single machine. The data scientist is considering using Apache Spark™ instead of traditional single-machine languages like standard Python scripts.
Which two advantages does Apache Spark™ offer over a normal single-machine language in this scenario? (Choose 2 answers)
A. It processes data solely on disk storage, reducing the need for memory resources.
B. It can distribute data processing tasks across a cluster of machines, enabling horizontal scalability.
C. It has built-in fault tolerance, allowing it to recover seamlessly from node failures during computation.
D. It requires specialized hardware to run, making it unsuitable for commodity hardware clusters.
E. It eliminates the need to write any code, automatically handling all data processing.
正解:B,C
解説: (Pass4Test メンバーにのみ表示されます)
質問 4:
28 of 55.
A data analyst builds a Spark application to analyze finance data and performs the following operations:
filter, select, groupBy, and coalesce.
Which operation results in a shuffle?
A. coalesce
B. groupBy
C. filter
D. select
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 5:
49 of 55.
In the code block below, aggDF contains aggregations on a streaming DataFrame:
aggDF.writeStream \
.format("console") \
.outputMode("???") \
.start()
Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?
A. AGGREGATE
B. REPLACE
C. COMPLETE
D. APPEND
正解:C
解説: (Pass4Test メンバーにのみ表示されます)
質問 6:
37 of 55.
A data scientist is working with a Spark DataFrame called customerDF that contains customer information.
The DataFrame has a column named email with customer email addresses.
The data scientist needs to split this column into username and domain parts.
Which code snippet splits the email column into username and domain columns?
A. customerDF = customerDF.select("email").alias("username", "domain")
B. customerDF = customerDF.withColumn("domain", col("email").split("@")[1])
C. customerDF = customerDF.withColumn("username", regexp_replace(col("email"), "@", ""))
D. customerDF = customerDF \
.withColumn("username", split(col("email"), "@").getItem(0)) \
.withColumn("domain", split(col("email"), "@").getItem(1))
正解:D
解説: (Pass4Test メンバーにのみ表示されます)
838 お客様のコメント










Igarashi -
とても読みやすくて5日で読み終わることができました。そして早速試験に受けで、ほんとに合格できましたよ!嬉しいすぎます。