Home > Software engineering >  How do I know the Databricks File System (DBFS) file storage format
How do I know the Databricks File System (DBFS) file storage format

Time:09-28

I am trying to read a file from the dbfs folder: dbfs/testdatasets/nsk as a dataframe. However I dont see type of file storage for the files in this directory .Is there a command to know what file format the files are stored in this dbfs directory?

CodePudding user response:

You can try to probe the individual files with the file tool:

%sh
file /dbfs/testdatasets/nsk

You can also probe all files in a dir with one command:

%sh
file /dbfs/testdatasets/*

The above will at least detect parquet format.

CodePudding user response:

The /dbfs/ like any other directory in Databricks is just a storage container such as blob (Azure) or bucket (AWS) that is mounted to a linux VM (your Databricks driver node) and hence it behaves like any other linux drive. You can store any files in there such as .csv, .parquets, .txt etc. If you can't find the extension of your file then it means that it is missing one rather than Databricks had it removed.

Furthermore Databricks default storage format is .parquet where data is often stored in multiple such files under a common folder and then you usually refer to a folder path when reading the data rather than any particualr single file. You might be confusing folder for a file.

You can explore the directories of the DBFS by using %sh magic command which allows you to execute shell commands against a driver node.

%sh
ls -l /dbfs/testdatasets/nsk 

The -l flag makes sure to show more details about the files in the directory.

You can also use dbutils utility in a langauge of your choice to consume the directories and files programatically

%python
files_and_dirs = dbutils.fs.ls("/dbfs/testdatasets/nsk")

for item in files_and_dirs:
    print(item)

Lastly, depending on your Databricks setup you can have a look inside the mounted storage of your cluster i.e. your Blob storage (Azure) or Buckets (AWS) through their respective web portals. This gives you a nice UI with which you can browse the files.

  • Related