I have a need to move files from one S3 bucket directory to two others. I have to do this from a Databricks notebook. If the the file has a json extension, I will move into jsonDir. Otherwise, I will move into otherDir. Presumably I would do this with pyspark, and databrick utils (dbutils).
I do not know the name of the S3 bucket, only the relative path off of it (call it MYPATH). For instance, I can do:
dbutils.fs.ls(MYPATH)
and it lists all the files in the S3 directory. Unfortunately with dbutils, you can move one file at a time or all of them (no wildcards). The bulk of my program is:
for file in fileList:
if file.endswith("json"):
dbutils.fs.mv(file, jsonDir)
continue
if not file.endswith("json")
dbutils.fs.mv(file, otherDir)
continue
My Problem: I do not know how to retrieve the list of files from MYPATH to put them in array "fileList". I would be grateful for any ideas. Thanks.
CodePudding user response:
I think your code runs if you do these minor changes:
fileList = dbutils.fs.ls(MYPATH)
for file in fileList:
if file.name.endswith("/"): # Don't copy dirs
continue
if file.name.endswith("json"):
dbutils.fs.mv(file.path, jsonDir file.name)
continue
if not file.name.endswith("json"):
dbutils.fs.mv(file.path, otherDir file.name)
continue
Here, file.name
is appended to keep the name of the file in the new dir. I need this one Azure dbfs backed storage, otherwise everything gets moved to the same blob.
It is critical that jsonDir
and otherDir
ends with a /
character.