I'm trying to create DataFrame from csv file in PySpark but it's taking too much time for a specific file
step1 = datetime.now()
file_path = r'csv/'
csv_data = glob.glob(file_path '*mycsv*.txt') or glob.glob(file_path='*test*.txt')
print('-------------------------------',csv_data[0])
print(F"\nStep-1 | {(datetime.now() - step1).total_seconds()}\n")
step2 = datetime.now()
df = spark.read.options(header=True, delimiter="|").csv(csv_data[0]) # here it's taking time
print(F"\nStep-2 | {(datetime.now() - step2).total_seconds()}\n")
step3 = datetime.now()
df = subset_df(csv_header, df)
print(F"\nStep-3 | {(datetime.now() - step3).total_seconds()}\n")
this the output of given code :
Step-1 | 0.000465
Step-2 | 3.708599
Step-3 | 0.38075
in given output Step-2 taking 3 seconds or sometime it take 5 second and inside my csv file I've only 4 rows including header any help appreciated.
CodePudding user response:
i will say that's correct.
step1 is only for getting target file names.
but the step2 is really the first time that action method(read) to kick up spark application run. as you know, for submit a spark job ,need start up a diver firstly, then register AM, apply resources, dispatch tasks for executing and so on, there are a lot of complicate progresses before we get first job run.
so i think it's normal for spent several seconds to kick up a spark job as the first time.
for step3 maybe is some transform on df, once the spark application started will no need that complicate, just need dispatch new job to run couldn't spent too much time.