what is the difference between running pyspark program with and without cluster?-CodePudding

i am new to pyspark and i have a program that contain few lines of functions that uses pyspark (the rest is normal python).

the portion of my code that uses pyspark:

X.to_csv(r'first.txt', header=None, index=None, sep=' ', mode='a')

# load the dataset 
rows = np.loadtxt('first.txt')

rows = sc.parallelize(rows)

mat = RowMatrix(rows)
start_time = time.time()  #to calculate the execution time of the function bellow

# compute SVD 
svd = mat.computeSVD(20, computeU=True)

exemple_one = time.time() - start_time
print("---Exemple one : %s seconds ---" % (exemple_one))

first.txt is a text file that has 2346x27 matrix

0.0 0.0 ... 0.0 0.0 0.06664409020350408 0.0 0.0 0.0 0.0 0.0 .... 0 0.0 0.0

so my question is : is there any difference between running my program on a cluster (as YARN) and runing it on my own machine using (python command) ? and what are these differences .

i would highely appriciate any help :)

CodePudding user response：

There are no differences in terms of the result you will get.
Depending on your workload, you might have resource issues when running locally.

Spark enables you to use a resource manager (such as YARN) in order to scale your application by acquiring executors from the resource manager.

Please check the following links from Spark's official documentation and see if you have more specific questions: