I have a usecase, have mutltiple files in s3 which I am reading like this:
df1 = spark.read.csv("s3://bucket/fact/*.dat")
all of the .dat files have 6 digits at the start which is PO id
190234_purcahse.dat
125134_purcahse.dat
This PO id I need in the dataframe df1 as a new column while reading. How can I achieve this in most efficient way? Is there any way to to get the file name while reading the files?
CodePudding user response:
use input_file_name()
function from spark.
df1 = spark.read.csv("s3://bucket/fact/*.dat").withColumn("fn",input_file_name())
Then use regexp_extract()
function to extract your PO id.