I am running a classification model in Microsoft Azure using the pyspark.ml.classification
library with RandomForestClassifier
.
My question:
I know in sklearn.ensemble.RandomForestClassifier
you can specify the n_jobs parameter
to configure number of jobs to run in parallel.
When using pyspark.ml.classification.RandomForestClassifier
in Azure, I find that each job is run separately. It first runs, Job 1, when done it runs Job 2 etc.
Is there a way to specify the number of jobs to run in parallel in the pyspark.ml.classification.RandomForestClassifier
function?
CodePudding user response:
Is there a way to specify the number of jobs to run in parallel in the pyspark.ml.classification.RandomForestClassifier function?
You can use RandomForest.trainClassifier()
or write a user-defined function (udf) and train the RandomForest classifier
.
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest
data = [
LabeledPoint(0.0, [0.0]),
LabeledPoint(0.0, [1.0]),
LabeledPoint(1.0, [2.0]),
LabeledPoint(1.0, [3.0])
]
model = RandomForest.trainClassifier(sc.parallelize(data), 2, {}, 3, seed=42)
model.numTrees()
model.totalNumNodes()
print(model)
print(model.toDebugString())
model.predict([2.0])
model.predict([0.0])
rdd = sc.parallelize([[3.0], [1.0]])
model.predict(rdd).collect()
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def train_RF(boston_pd):
trees = boston_pd['trees'].unique()[0]
# get the train and test groups
boston_train = boston_pd[boston_pd['training'] == 1]
boston_test = boston_pd[boston_pd['training'] == 0]
# create data and label groups
y_train = boston_train['target']
X_train = boston_train.drop(['target'], axis=1)
y_test = boston_test['target']
X_test = boston_test.drop(['target'], axis=1)
# train a classifier
rf= RFR(n_estimators = trees)
model = rf.fit(X_train, y_train)
# make predictions
y_pred = model.predict(X_test)
r = pearsonr(y_pred, y_test)
# return the number of trees, and the R value
return pd.DataFrame({'trees': trees, 'r_squared': (r[0]**2)}, index=[0])
# use the Pandas UDF
results = full_df.groupby('trees').apply(train_RF)
# print the results
print(results.take(3))
You can refer to 3 Methods for Parallelization in Spark, Parallelism in spark.mllib and Training random forest classifier spark
CodePudding user response:
The Spark job you're describing does not have the same meaning with sklearn's job (which defining the parallelism via n_jobs
parameter).
Spark does run your classifier in parallel (in the background). The "Job 1" and "Job 2" etc is more about running some sequential steps, one after another, and each of them still running with multiple executors behind the scene.