Home > other >  Databricks: Adding path to table from csv
Databricks: Adding path to table from csv

Time:08-31

In databricks I have several CSV files that I need to load. I would like to add a column to my table with the file path, but I can't seem to find that option

My data is structured with

FileStore/subfolders/DATE01/filenameA.csv
FileStore/subfolders/DATE01/filenameB.csv
FileStore/subfolders/DATE02/filenameA.csv
FileStore/subfolders/DATE02/filenameB.csv

I'm using this SQL function in databricks, as this can loop through all the dates and add all filenameA into clevertablenameA, and all filenameB into clevertablenameB etc.

DROP view  IF EXISTS clevertablenameA;
create temporary view clevertablenameA
USING csv
OPTIONS (path "dbfs:/FileStore/subfolders/*/filenameA.csv", header = true)

My desired outcome is something like this

col1 | col2|....| path
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv

Is there a clever option, or should I load my data another way?

CodePudding user response:

The function input_file_name() could be used to retrieve the file name while reading.

SELECT *, input_file_name() as path FROM clevertablenameA

Note that this does not add a column to the view and merely returns the name of the file being read.

Refer to below link for more information.

https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/functions/input_file_name

Alternatively you could try reading the files in a pyspark/scala cell and add the file name using the same function using the .withColumn("path", input_file_name()) function and then create the view on top of it.

  • Related