How to save a PySpark dataframe as a CSV with custom file name?-CodePudding

Here is the spark DataFrame I want to save as a csv.

type(MyDataFrame)
--Output: <class 'pyspark.sql.dataframe.DataFrame'>

To save this as a CSV, I have the following code:

MyDataFrame.write.csv(csv_path, mode = 'overwrite', header = 'true')

When I save this, the file name is something like this:

part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv

Is there a way I can give this a custom name while saving it? Like "MyDataFrame.csv"

CodePudding user response：

No. That's how Spark work (at least for now). You'd have MyDataFrame.csv as a directory name, and under that directory, you'd have multiple files with the same format as part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv, part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c001.csv etc

It's not recommended, but if your data is small enough (arguably what is "small enough" here), you can always convert it to Pandas and save it to a single CSV file with any name you wanted.

CodePudding user response：

I have the same requirement.You can write to one path, and then change the file path. This is my solution.

def write_to_hdfs_specify_path(df, spark, hdfs_path, file_name):
    """
    :param df: dataframe which you want to save
    :param spark: sparkSession
    :param hdfs_path: target path(shoul be not exises)
    :param file_name: csv file name
    :return: 
    """
    sc = spark.sparkContext
    Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
    df.coalesce(1).write.option("header", True).option("delimiter", "|").option("compression", "none").csv(hdfs_path)
    fs = FileSystem.get(Configuration())
    file = fs.globStatus(Path("%s/part*" % hdfs_path))[0].getPath().getName()
    full_path = "%s/%s" % (hdfs_path, file_name)
    result = fs.rename(Path("%s/%s" % (hdfs_path, file)), Path(full_path))
    return result