How to Create list of filenames in an S3 directory using pyspark and/or databricks utils-CodePudding

I have a need to move files from one S3 bucket directory to two others. I have to do this from a Databricks notebook. If the the file has a json extension, I will move into jsonDir. Otherwise, I will move into otherDir. Presumably I would do this with pyspark, and databrick utils (dbutils).

I do not know the name of the S3 bucket, only the relative path off of it (call it MYPATH). For instance, I can do:

dbutils.fs.ls(MYPATH)

and it lists all the files in the S3 directory. Unfortunately with dbutils, you can move one file at a time or all of them (no wildcards). The bulk of my program is:

for file in fileList:
  if file.endswith("json"):
    dbutils.fs.mv(file, jsonDir)
    continue
  if not file.endswith("json")
    dbutils.fs.mv(file, otherDir)
    continue

My Problem: I do not know how to retrieve the list of files from MYPATH to put them in array "fileList". I would be grateful for any ideas. Thanks.

CodePudding user response：

I think your code runs if you do these minor changes:

fileList = dbutils.fs.ls(MYPATH)
for file in fileList:
  if file.name.endswith("/"): # Don't copy dirs
    continue
  if file.name.endswith("json"):
    dbutils.fs.mv(file.path, jsonDir   file.name)
    continue
  if not file.name.endswith("json"):
    dbutils.fs.mv(file.path, otherDir   file.name)
    continue

Here, file.name is appended to keep the name of the file in the new dir. I need this one Azure dbfs backed storage, otherwise everything gets moved to the same blob. It is critical that jsonDir and otherDir ends with a / character.