In databricks I have several CSV files that I need to load. I would like to add a column to my table with the file path, but I can't seem to find that option
My data is structured with
FileStore/subfolders/DATE01/filenameA.csv
FileStore/subfolders/DATE01/filenameB.csv
FileStore/subfolders/DATE02/filenameA.csv
FileStore/subfolders/DATE02/filenameB.csv
I'm using this SQL function in databricks, as this can loop through all the dates and add all filenameA into clevertablenameA, and all filenameB into clevertablenameB etc.
DROP view IF EXISTS clevertablenameA;
create temporary view clevertablenameA
USING csv
OPTIONS (path "dbfs:/FileStore/subfolders/*/filenameA.csv", header = true)
My desired outcome is something like this
col1 | col2|....| path
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
Is there a clever option, or should I load my data another way?
CodePudding user response:
The function input_file_name()
could be used to retrieve the file name while reading.
SELECT *, input_file_name() as path FROM clevertablenameA
Note that this does not add a column to the view and merely returns the name of the file being read.
Refer to below link for more information.
Alternatively you could try reading the files in a pyspark/scala cell and add the file name using the same function using the .withColumn("path", input_file_name())
function and then create the view on top of it.