Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?
A. pandas API on Spark DataFrames are more performant than Spark DataFrames
B. pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata
C. pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata
D. pandas API on Spark DataFrames are less mutable versions of Spark DataFrames
正解:C
解説: (Pass4Test メンバーにのみ表示されます)
質問 2:
Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?
A. pandas
B. Spark ML
C. PvTorch
D. Scikit-learn
E. Keras
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 3:
Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?
A. TrainValidationSplit
B. TrainValidationSplitModel
C. DataFrame.where
D. DataFrame.randomSplit
E. CrossValidator
正解:D
解説: (Pass4Test メンバーにのみ表示されます)
質問 4:
A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.
Which of the following describes why?
A. Gradient boosting calculates gradients in evaluation metrics using all cores which prevents parallelization.
B. Gradient boosting is not a linear algebra-based algorithm which is required for parallelization
C. Gradient boosting requires access to all data at once which cannot happen during parallelization.
D. Gradient boosting is an iterative algorithm that requires information from the previous iteration to perform the next step.
正解:D
解説: (Pass4Test メンバーにのみ表示されます)
質問 5:
A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model.
The Spark DataFrame train_df has the following schema:

The machine learning engineer shares the following code block:

Which of the following changes does the machine learning engineer need to make to complete the task?
A. They do not need to make any changes
B. They need to utilize a Pipeline to fit the model
C. They need to call the transform method on train df
D. They need to convert the features column to be a vector
E. They need to split the features column out into one column for each feature
正解:D
解説: (Pass4Test メンバーにのみ表示されます)
質問 6:
A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation when k > 2.
Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?
A. Reproducibility is achievable when using a train-validation split
B. Fewer models need to be trained when using a train-validation split
C. Bias is avoidable when using a train-validation split
D. A holdout set is not necessary when using a train-validation split
E. Fewer hyperparameter values need to be tested when using a train-validation split
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
Ichikawa -
Databricks-Machine-Learning-Associateの問題集、読みやすく わかりやすい解説が付き、これで受かる気がしたっと思って受験して本当に受かりました。すごい。