New Year Sale - Special 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70dumps

Databricks-Machine-Learning-Associate Questions and Answers

Question # 6

Which statement describes a Spark ML transformer?

A.

A transformer is an algorithm which can transform one DataFrame into another DataFrame

B.

A transformer is a hyperparameter grid that can be used to train a model

C.

A transformer chains multiple algorithms together to transform an ML workflow

D.

A transformer is a learning algorithm that can use a DataFrame to train a model

Full Access
Question # 7

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

A.

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

B.

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

C.

spark_df.to_pandas()

D.

import pandas as pd

df = pd.DataFrame(spark_df)

Full Access
Question # 8

A machine learning engineer is trying to scale a machine learning pipeline by distributing its single-node model tuning process. After broadcasting the entire training data onto each core, each core in the cluster can train one model at a time. Because the tuning process is still running slowly, the engineer wants to increase the level of parallelism from 4 cores to 8 cores to speed up the tuning process. Unfortunately, the total memory in the cluster cannot be increased.

In which of the following scenarios will increasing the level of parallelism from 4 to 8 speed up the tuning process?

A.

When the tuning process in randomized

B.

When the entire data can fit on each core

C.

When the model is unable to be parallelized

D.

When the data is particularly long in shape

E.

When the data is particularly wide in shape

Full Access
Question # 9

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

A.

One-hot encoding is not supported by most machine learning libraries.

B.

One-hot encoding is dependent on the target variable's values which differ for each application.

C.

One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

D.

One-hot encoding is not a common strategy for representing categorical feature variables numerically.

E.

One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

Full Access
Question # 10

A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.

In which situation will the machine learning engineer be correct?

A.

When the new solution requires if-else logic determining which model to use to compute each prediction

B.

When the new solution's models have an average latency that is larger than the size of the original model

C.

When the new solution requires the use of fewer feature variables than the original model

D.

When the new solution requires that each model computes a prediction for every record

E.

When the new solution's models have an average size that is larger than the size of the original model

Full Access
Question # 11

A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.

Which of the following lines of code can the data scientist run to accomplish the task?

A.

spark_df.summary ()

B.

spark_df.stats()

C.

spark_df.describe().head()

D.

spark_df.printSchema()

E.

spark_df.toPandas()

Full Access
Question # 12

A data scientist is utilizing MLflow Autologging to automatically track their machine learning experiments. After completing a series of runs for the experiment experiment_id, the data scientist wants to identify the run_id of the run with the best root-mean-square error (RMSE).

Which of the following lines of code can be used to identify the run_id of the run with the best RMSE in experiment_id?

A)

B)

C)

D)

A.

OptionA

B.

Option B

C.

Option C

D.

Option D

Full Access
Question # 13

A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically.

Which of the following lines of code will return the metadata description?

A.

There is no way to return the metadata description programmatically.

B.

fs.create_training_set("new_table")

C.

fs.get_table("new_table").description

D.

fs.get_table("new_table").load_df()

E.

fs.get_table("new_table")

Full Access
Question # 14

A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:

They have written the following incomplete code block to use predict to score each record of Spark DataFramespark_df:

Which of the following lines of code can be used to complete the code block to successfully complete the task?

A.

predict(*spark_df.columns)

B.

mapInPandas(predict)

C.

predict(Iterator(spark_df))

D.

mapInPandas(predict(spark_df.columns))

E.

predict(spark_df.columns)

Full Access
Question # 15

A data scientist has produced two models for a single machine learning problem. One of the models performs well when one of the features has a value of less than 5, and the other model performs well when the value of that feature is greater than or equal to 5. The data scientist decides to combine the two models into a single machine learning solution.

Which of the following terms is used to describe this combination of models?

A.

Bootstrap aggregation

B.

Support vector machines

C.

Bucketing

D.

Ensemble learning

E.

Stacking

Full Access
Question # 16

Which of the following approaches can be used to view the notebook that was run to create an MLflow run?

A.

Open the MLmodel artifact in the MLflow run paqe

B.

Click the "Models" link in the row corresponding to the run in the MLflow experiment paqe

C.

Click the "Source" link in the row corresponding to the run in the MLflow experiment page

D.

Click the "Start Time" link in the row corresponding to the run in the MLflow experiment page

Full Access
Question # 17

A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric column’s median value.

They have developed the following code block to accomplish this task:

The code block is not accomplishing the task.

Which reasons describes why the code block is not accomplishing the imputation task?

A.

It does not impute both the training and test data sets.

B.

The inputCols and outputCols need to be exactly the same.

C.

The fit method needs to be called instead of transform.

D.

It does not fit the imputer on the data to create an ImputerModel.

Full Access
Question # 18

A data scientist has developed a random forest regressor rfr and included it as the final stage in a Spark MLPipeline pipeline. They then set up a cross-validation process with pipeline as the estimator in the following code block:

Which of the following is a negative consequence of includingpipelineas the estimator in the cross-validation process rather thanrfras the estimator?

A.

The process will have a longer runtime because all stages of pipeline need to be refit or retransformed with each mode

B.

The process will leak data from the training set to the test set during the evaluation phase

C.

The process will be unable to parallelize tuning due to the distributed nature of pipeline

D.

The process will leak data prep information from the validation sets to the training sets for each model

Full Access
Question # 19

A machine learning engineer is using the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:

Assuming the default Spark configuration is in place, which of the following is a benefit of using anIterator?

A.

The data will be limited to a single executor preventing the model from being loaded multiple times

B.

The model will be limited to a single executor preventing the data from being distributed

C.

The model only needs to be loaded once per executor rather than once per batch during the inference process

D.

The data will be distributed across multiple executors during the inference process

Full Access
Question # 20

A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed thetrain_modelfunction, and they want to apply it to each group of DataFramedf.

They have written the following incomplete code block:

Which of the following pieces of code can be used to fill in the above blank to complete the task?

A.

applyInPandas

B.

mapInPandas

C.

predict

D.

train_model

E.

groupedApplyIn

Full Access
Question # 21

A data scientist has written a feature engineering notebook that utilizes the pandas library. As the size of the data processed by the notebook increases, the notebook's runtime is drastically increasing, but it is processing slowly as the size of the data included in the process increases.

Which of the following tools can the data scientist use to spend the least amount of time refactoring their notebook to scale with big data?

A.

PySpark DataFrame API

B.

pandas API on Spark

C.

Spark SQL

D.

Feature Store

Full Access
Question # 22

A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.

Which of the following approaches can the data scientist use to accomplish this MLflow run organization?

A.

Theycan turn on Databricks Autologging

B.

Theycan specify nested=True when startingthe child run for each unique combination of hyperparameter values

C.

Theycan start each child run inside the parentrun's indented code block usingmlflow.start runO

D.

They can start each child run with the same experiment ID as the parent run

E.

They can specify nested=True when starting the parent run for the tuningprocess

Full Access