from varname import nameof
from pyspark.sql import SparkSession
cwd = os.getcwd()
def output_to_csv(df):
df.coalesce(1).write.option("header", "true")\
.mode('overwrite')\
.csv(cwd '/output_files/' nameof(df))
return None
def main():
spark = SparkSession.builder.appName('test').getOrCreate()
## other functions here ##
output_to_csv(dataframe_abc)
spark.stop()
So what I am trying to do is to dynamically name the output csv files from the function (pyspark) output_to_csv(). My desired output would be /output_files/dataframe_abc. The function works correctly in that it outputs the correct data from dataframe_abc. However, the outputted name for the folder containing the csv file from the spark output is "df" - from the original function. I'm new to Python and very new to PySpark. Can anyone give me a steer please?
CodePudding user response:
The issue is that when you call nameof
in output_to_csv
, the variable name is in fact 'df' since it's local to the function and doesn't have a sense of the name of the variable that was passed in.
My suggestion would be to move the call to nameof
into your main
function, then pass that as an argument to the output_to_csv
function:
from varname import nameof
from pyspark.sql import SparkSession
cwd = os.getcwd()
def output_to_csv(df, fname):
df.coalesce(1).write.option("header", "true")\
.mode('overwrite')\
.csv(cwd '/output_files/' fname)
return None
def main():
spark = SparkSession.builder.appName('test').getOrCreate()
## other functions here ##
output_to_csv(dataframe_abc, nameof(dataframe_abc))
spark.stop()