You want to use a BigQuery table as a data sink. In which writing mode(s) can you use BigQuery as a sink?
A. Only streaming
B. BigQuery cannot be used as a sink
C. Only batch
D. Both batch and streaming
正解:D
解説: (Pass4Test メンバーにのみ表示されます)
質問 2:
What are the minimum permissions needed for a service account used with Google Dataproc?
A. Read and write to Google Cloud Storage; write to Google Cloud Logging
B. Execute to Google Cloud Storage; write to Google Cloud Logging
C. Execute to Google Cloud Storage; execute to Google Cloud Logging
D. Write to Google Cloud Storage; read to Google Cloud Logging
正解:A
解説: (Pass4Test メンバーにのみ表示されます)
質問 3:
You maintain ETL pipelines. You notice that a streaming pipeline running on Dataflow is taking a long time to process incoming data, which causes output delays. You also noticed that the pipeline graph was automatically optimized by Dataflow and merged into one step. You want to identify where the potential bottleneck is occurring. What should you do?
A. Insert a Reshuffle operation after each processing step, and monitor the execution details in the Dataflow console.
B. Verify that the Dataflow service accounts have appropriate permissions to write the processed data to the output sinks
C. Insert output sinks after each key processing step, and observe the writing throughput of each block.
D. Log debug information in each ParDo function, and analyze the logs at execution time.
正解:A
解説: (Pass4Test メンバーにのみ表示されます)
質問 4:
You have historical data covering the last three years in BigQuery and a data pipeline that delivers new data to BigQuery daily. You have noticed that when the Data Science team runs a query filtered on a date column and limited to 30-90 days of data, the query scans the entire table. You also noticed that your bill is increasing more quickly than you expected. You want to resolve the issue as cost-effectively as possible while maintaining the ability to conduct SQL queries. What should you do?
A. Recommend that the Data Science team export the table to a CSV file on Cloud Storage and use Cloud Datalab to explore the data by reading the files directly.
B. Modify your pipeline to maintain the last 30-90 days of data in one table and the longer history in a different table to minimize full table scans over the entire history.
C. Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.
D. Write an Apache Beam pipeline that creates a BigQuery table per day. Recommend that the Data Science team use wildcards on the table name suffixes to select the data they need.
正解:B
質問 5:
You work for a large ecommerce company. You are using Pub/Sub to ingest the clickstream data to Google Cloud for analytics. You observe that when a new subscriber connects to an existing topic to analyze data, they are unable to subscribe to older data for an upcoming yearly sale event in two months, you need a solution that, once implemented, will enable any new subscriber to read the last 30 days of data. What should you do?
A. Ask the source system to re-push the data to Pub/Sub, and subscribe to it.
B. Set the topic retention policy to 30 days.
C. Set the subscriber retention policy to 30 days.
D. Create a new topic, and publish the last 30 days of data each time a new subscriber connects to an existing topic.
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 6:
You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to the same dataset. You need to keep the costs of data sharing low and ensure that the data is current. What should you do?
A. Create a separate dataset in BigQuery that contains the relevant data to share, and provide third-party companies with access to the new dataset.
B. Use Analytics Hub to control data access, and provide third party companies with access to the dataset
C. Create a Dataflow job that reads the data in frequent time intervals and writes it to the relevant BigQuery dataset or Cloud Storage bucket for third-party companies to use.
D. Use Cloud Scheduler to export the data on a regular basis to Cloud Storage, and provide third-party companies with access to the bucket.
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 7:
Business owners at your company have given you a database of bank transactions. Each row contains the user ID, transaction type, transaction location, and transaction amount. They ask you to investigate what type of machine learning can be applied to the data. Which three machine learning applications can you use? (Choose three.)
A. Unsupervised learning to determine which transactions are most likely to be fraudulent.
B. Clustering to divide the transactions into N categories based on feature similarity.
C. Reinforcement learning to predict the location of a transaction.
D. Supervised learning to predict the location of a transaction.
E. Supervised learning to determine which transactions are most likely to be fraudulent.
F. Unsupervised learning to predict the location of a transaction.
正解:A,B,D
質問 8:
Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day's events. They also want to use streaming ingestion. What should you do?
A. Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.
B. Create a table called tracking_table with a TIMESTAMP column to represent the day.
C. Create a partitioned table called tracking_table and include a TIMESTAMP column.
D. Create a table called tracking_table and include a DATE column.
正解:C
質問 9:
You need to look at BigQuery data from a specific table multiple times a day. The underlying table you are querying is several petabytes in size, but you want to filter your data and provide simple aggregations to downstream users. You want to run queries faster and get up-to-date insights quicker. What should you do?
A. Create a materialized view based off of the query being run.
B. Use a cached query to accelerate time to results.
C. Run a scheduled query to pull the necessary data at specific intervals daily.
D. Limit the query columns being pulled in the final result.
正解:A
解説: (Pass4Test メンバーにのみ表示されます)
中泽** -
とりあえずこれ1冊しっかりやれば合格できる内容です。Professional-Data-Engineer平易な記述となっているので初学者でも自学自習進めやすい内容だと思います。