Is Microsoft Azure running jobs in parallel automatically?-CodePudding

I am running a classification model in Microsoft Azure using the pyspark.ml.classification library with RandomForestClassifier.

My question:

I know in sklearn.ensemble.RandomForestClassifier you can specify the n_jobs parameter to configure number of jobs to run in parallel.

When using pyspark.ml.classification.RandomForestClassifier in Azure, I find that each job is run separately. It first runs, Job 1, when done it runs Job 2 etc.

Is there a way to specify the number of jobs to run in parallel in the pyspark.ml.classification.RandomForestClassifier function?

CodePudding user response：

Is there a way to specify the number of jobs to run in parallel in the pyspark.ml.classification.RandomForestClassifier function?

You can use RandomForest.trainClassifier() or write a user-defined function (udf) and train the RandomForest classifier.

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest

data = [
    LabeledPoint(0.0, [0.0]),
    LabeledPoint(0.0, [1.0]),
    LabeledPoint(1.0, [2.0]),
    LabeledPoint(1.0, [3.0])
]
model = RandomForest.trainClassifier(sc.parallelize(data), 2, {}, 3, seed=42)
model.numTrees()
model.totalNumNodes()
print(model)
print(model.toDebugString())

 
model.predict([2.0])
model.predict([0.0])
rdd = sc.parallelize([[3.0], [1.0]])
model.predict(rdd).collect()

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def train_RF(boston_pd):
    trees = boston_pd['trees'].unique()[0]

    # get the train and test groups 
    boston_train = boston_pd[boston_pd['training'] == 1]
    boston_test = boston_pd[boston_pd['training'] == 0] 
        
    # create data and label groups 
    y_train = boston_train['target']
    X_train = boston_train.drop(['target'], axis=1)
    y_test = boston_test['target']
    X_test = boston_test.drop(['target'], axis=1)
   
    # train a classifier 
    rf= RFR(n_estimators = trees)
    model = rf.fit(X_train, y_train)

    # make predictions
    y_pred = model.predict(X_test)
    r = pearsonr(y_pred, y_test)
    
    # return the number of trees, and the R value 
    return pd.DataFrame({'trees': trees, 'r_squared': (r[0]**2)}, index=[0])
  
# use the Pandas UDF
results = full_df.groupby('trees').apply(train_RF)

# print the results 
print(results.take(3))

You can refer to 3 Methods for Parallelization in Spark, Parallelism in spark.mllib and Training random forest classifier spark

CodePudding user response：

The Spark job you're describing does not have the same meaning with sklearn's job (which defining the parallelism via n_jobs parameter).

Spark does run your classifier in parallel (in the background). The "Job 1" and "Job 2" etc is more about running some sequential steps, one after another, and each of them still running with multiple executors behind the scene.