How to read the file name while reading the files from s3 using pyspark-CodePudding

I have a usecase, have mutltiple files in s3 which I am reading like this:

df1 = spark.read.csv("s3://bucket/fact/*.dat")

all of the .dat files have 6 digits at the start which is PO id

190234_purcahse.dat
125134_purcahse.dat

This PO id I need in the dataframe df1 as a new column while reading. How can I achieve this in most efficient way? Is there any way to to get the file name while reading the files?

CodePudding user response：

use input_file_name() function from spark.

df1 = spark.read.csv("s3://bucket/fact/*.dat").withColumn("fn",input_file_name())

Then use regexp_extract() function to extract your PO id.