You are analyzing customer transaction data in Snowflake to identify fraudulent activities. The 'TRANSACTION AMOUNT' column exhibits a right-skewed distribution. Which of the following Snowflake queries is MOST effective in identifying outliers based on the Interquartile Range (IQR) method, specifically targeting unusually large transaction amounts? Assume IQR is already calculated as variable and QI as and Q3 as in snowflake session.
A. SELECT TRANSACTION ID FROM TRANSACTIONS WHERE TRANSACTION_AMOUNT < qi - (1.5 iqr);
B. SELECT TRANSACTION ID FROM TRANSACTIONS WHERE TRANSACTION AMOUNT > q3 + (1.5 iqr);
C. SELECT TRANSACTION ID FROM TRANSACTIONS WHERE TRANSACTION_AMOUNT > (SELECT + 3 FROM TRANSACTIONS);
D. SELECT TRANSACTION ID FROM TRANSACTIONS WHERE TRANSACTION_AMOUNT > (SELECT MEDIAN(TRANSACTION AMOUNT) FROM TRANSACTIONS);
E. SELECT TRANSACTION ID FROM TRANSACTIONS WHERE TRANSACTION_AMOUNT > (SELECT WITHIN GROUP (ORDER BY TRANSACTION_AMOUNT) FROM TRANSACTIONS);
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 2:
A data scientist is building a churn prediction model using Snowflake data'. They want to load a large dataset (50 million rows) from a Snowflake table 'customer_data' into a Pandas DataFrame for feature engineering. They are using the Snowflake Python connector. Given the code snippet below and considering performance and memory usage, which approach would be the most efficient for loading the data into the Pandas DataFrame? Assume you have a properly configured connection and cursor 'cur'. Furthermore, assume that the 'customer id' column is the primary key and uniquely identifies each customer. You are also aware that network bandwidth limitations exist within your environment. ```python import snowflake.connector import pandas as pd # Assume conn and cur are already initialized # conn = snowflake.connector.connect(...) # cur = conn.cursor() query = "SELECT FROM customer data```
A. ```python cur.execute(query) df = pd.read_sql(query, conn)
B. ```python import snowflake.connector import pandas as pd import pyarrow import pyarrow.parquet # Enable Arrow result format conn.cursor().execute("ALTER SESSION SET PYTHON USE ARROW RESULT FORMAT-TRUE") cur.execute(query) df =
C. ```python cur.execute(query) results = cur.fetchmany(size=1000000) df_list = 0 while results: df_list.append(pd.DataFrame(results, for col in cur.description])) results = cur.fetchmany(size=1000000) df = pd.concat(df_list, ignore_index=True)
D. ```python cur.execute(query) df = pd.DataFrame(cur.fetchall(), columns=[col[0] for col in cur.description])
E. ```python with conn.cursor(snowflake.connector.DictCursor) as cur: cur.execute(query) df = pd.DataFrame(cur.fetchall())
正解:B
解説: (Pass4Test メンバーにのみ表示されます)
質問 3:
A data scientist is tasked with creating features for a machine learning model predicting customer churn. They have access to the following data in a Snowflake table named 'CUSTOMER ID, 'DATE, 'ACTIVITY _ TYPE' (e.g., 'login', 'purchase', 'support_ticket'), and 'ACTIVITY VALUE (e.g., amount spent, duration of login). Which of the following feature engineering strategies, leveraging Snowflake's capabilities, could be useful for predicting customer churn? (Select all that apply)
A. Directly use the ACTIVITY TYPE column as a categorical feature without any transformation or engineering.
B. Create a feature representing the number of days since the customer's last login using "DATEDIFF and window functions.
C. Use 'APPROX COUNT DISTINCT to estimate the number of unique product categories purchased by each customer within the last 3 months to create a features.
D. Create features that capture the trend of customer activity over time (e.g., increasing or decreasing activity) using LACY and 'LEAD' window functions.
E. Calculate the recency, frequency, and monetary value (RFM) for each customer using window functions and aggregate functions.
正解:B,C,D,E
解説: (Pass4Test メンバーにのみ表示されます)
質問 4:
You are tasked with preparing a Snowflake table named 'PRODUCT REVIEWS' for sentiment analysis. This table contains columns like 'REVIEW ID, 'PRODUCT ID', 'REVIEW TEXT', 'RATING', and 'TIMESTAMP'. Your goal is to remove irrelevant fields to optimize model training. Which of the following options represent valid and effective strategies, using Snowpark SQL, for identifying and removing irrelevant or problematic fields from the 'PRODUCT REVIEWS' table, considering both storage efficiency and model accuracy? Assume that the model only need review text and review id and the rating.
A. Dropping rows with 'NULL' values in REVIEW_TEXT and then dropping the 'PRODUCT_ID' and 'TIMESTAMP' columns using 'ALTER TABLE. SQL: 'CREATE OR REPLACE TABLE PRODUCT REVIEWS AS SELECT FROM PRODUCT REVIEWS WHERE REVIEW TEXT IS NOT NULL; ALTER TABLE PRODUCT REVIEWS DROP COLUMN PRODUCT ID; ALTER TABLE PRODUCT REVIEWS DROP COLUMN TIMESTAMP;'
B. Using 'ALTER TABLE DROP COLUMN' to directly remove 'TIMESTAMP column, which is deemed irrelevant for the sentiment analysis model. SQL: 'ALTER TABLE PRODUCT REVIEWS DROP COLUMN TIMESTAMP;'
C. Creating a VIEW that only selects the 'REVIEW _ TEXT , 'REVIEW_ID', and 'RATING' columns, effectively hiding the irrelevant columns from the model. SQL: 'CREATE OR REPLACE VIEW REVIEWS FOR ANALYSIS AS SELECT REVIEW TEXT, REVIEW ID, RATING FROM PRODUCT REVIEWS;'
D. All of the above.
E. creating a new table 'REVIEWS_CLEANED containing only the relevant columns CREVIEW_TEXT , 'REVIEW_ID' , and 'RATING') using 'CREATE TABLE AS SELECT. SQL: 'CREATE OR REPLACE TABLE REVIEWS CLEANED AS SELECT REVIEW TEXT, REVIEW ID, RATING FROM PRODUCT REVIEWS;'
正解:D
解説: (Pass4Test メンバーにのみ表示されます)
質問 5:
You are tasked with identifying fraudulent transactions from unstructured log data stored in Snowflake. The logs contain various fields, including timestamps, user IDs, and transaction details embedded within free-text descriptions. You plan to use a supervised learning approach, having labeled a subset of transactions as 'fraudulent' or 'not fraudulent.' Which of the following methods best describes the extraction and processing of this data for training a machine learning model within Snowflake?
A. Treat the unstructured log description as a categorical feature and directly apply one-hot encoding within Snowflake, then train a classification model. Due to high dimensionality perform PCA for dimensionality reduction before training.
B. Export the entire log data to an external machine learning platform (e.g., AWS SageMaker) and perform feature extraction, NLP processing, and model training there. Import the trained model back into Snowflake as a UDF for prediction.
C. Use a combination of regular expressions and natural language processing (NLP) techniques within Snowflake UDFs to extract key features such as transaction amounts, product categories, and sentiment scores from the log descriptions. Then, combine these extracted features with other structured data (e.g., user demographics) and train a classification model using these features. The NLP steps include tokenization, stop word removal, and TF-IDF vectorization.
D. Extract the entire log description field and train a word embedding model (e.g., Word2Vec) on the entire dataset. Average the word vectors for each transaction's log description to create a document vector. Train a classification model (e.g., Random Forest) on these document vectors within Snowflake.
E. Use regular expressions within a Snowflake UDF to extract relevant information (e.g., amount, item description) from the log descriptions. Convert extracted data into numerical features using one-hot encoding within the UDF. Then, train a model using the extracted numerical features directly within Snowflake using SQL extensions for machine learning.
正解:C
解説: (Pass4Test メンバーにのみ表示されます)
質問 6:
You are designing a feature engineering pipeline using Snowpark Feature Store for a fraud detection model. You have a transaction table in Snowflake. One crucial feature is the 'average_transaction_amount_last_7_days' for each customer. You want to implement this feature using Snowpark Python and materialize it in the Feature Store. You have the following Snowpark DataFrame 'transactions_df containing 'customer_id' and 'transaction_amount'. Which of the following code snippets correctly defines and registers this feature in the Snowpark Feature Store, ensuring efficient computation and storage?
A.

B.

C.

D.

E.

正解:E
解説: (Pass4Test メンバーにのみ表示されます)
質問 7:
A data scientist is analyzing website traffic data stored in Snowflake. The data includes daily page views for different pages. The data scientist suspects that the variance of page views for a particular page, 'home', has significantly increased recently. Which of the following steps and Snowflake SQL queries could be used to identify a potential change in the variance of 'home' page views over time (e.g., comparing variance before and after a specific date)? Select all that apply.

A. Option E
B. Option C
C. Option A
D. Option D
E. Option B
正解:A,B,D,E
解説: (Pass4Test メンバーにのみ表示されます)
質問 8:
You are building a machine learning model using Snowpark for Python and have a feature column called 'TRANSACTION AMOUNT' in your 'transaction_df DataFrame. This column contains some missing values ('NULL). Your model is sensitive to missing data'. You want to impute the missing values using the median "TRANSACTION AMOUNT, but ONLY for specific customer segments (e.g., customers with a 'CUSTOMER TIER of 'Gold' or 'Platinum'). For other customer tiers, you want to impute with the mean. Which of the following Snowpark Python code snippets BEST achieves this selective imputation?
A.

B.

C.

D.

E.

正解:E
解説: (Pass4Test メンバーにのみ表示されます)
質問 9:
You're developing a model to predict customer churn using Snowflake. Your dataset is large and continuously growing. You need to implement partitioning strategies to optimize model training and inference performance. You consider the following partitioning strategies: 1. Partitioning by 'customer segment (e.g., 'High-Value', 'Medium-Value', 'Low-Value'). 2. Partitioning by 'signup_date' (e.g., monthly partitions). 3. Partitioning by 'region' (e.g., 'North America', 'Europe', 'Asia'). Which of the following statements accurately describe the potential benefits and drawbacks of these partitioning strategies within a Snowflake environment, specifically in the context of model training and inference?
A. Partitioning by 'signup_date' is ideal for capturing temporal dependencies in churn behavior and allows for easy retraining of models with the latest data. It also naturally aligns with a walk-forward validation approach. However, it might not be effective if churn drivers are independent of signup date.
B. Implementing partitioning requires modifying existing data loading pipelines and may introduce additional overhead in data management. If the cost of partitioning outweighs the performance gains, it's better to rely on Snowflake's built-in micro-partitioning alone. Also, data skew in partition keys is a major concern.
C. Partitioning by 'customer_segment' is beneficial if churn patterns are significantly different across segments, allowing for training separate models for each segment. However, if any segment has very few churned customers, it may lead to overfitting or unreliable models for that segment.
D. Partitioning by 'region' is useful if churn is heavily influenced by geographic factors (e.g., local market conditions). It can improve query performance during both training and inference when filtering by region. However, it can create data silos, making it difficult to build a global churn model that considers interactions across regions. Furthermore, the 'region' column must have low cardinality.
E. Using clustering in Snowflake on top of partitioning will always improve query performance significantly and reduce compute costs irrespective of query patterns.
正解:A,B,C,D
解説: (Pass4Test メンバーにのみ表示されます)