Home > Back-end >  Passing dataframe between 2 functions python
Passing dataframe between 2 functions python

Time:05-12

from varname import nameof
from pyspark.sql import SparkSession

cwd = os.getcwd()

def output_to_csv(df):
    df.coalesce(1).write.option("header", "true")\
        .mode('overwrite')\
            .csv(cwd   '/output_files/'   nameof(df))
    return None


def main():
      spark = SparkSession.builder.appName('test').getOrCreate()
      ## other functions here ##
      output_to_csv(dataframe_abc)
      spark.stop()

So what I am trying to do is to dynamically name the output csv files from the function (pyspark) output_to_csv(). My desired output would be /output_files/dataframe_abc. The function works correctly in that it outputs the correct data from dataframe_abc. However, the outputted name for the folder containing the csv file from the spark output is "df" - from the original function. I'm new to Python and very new to PySpark. Can anyone give me a steer please?

CodePudding user response:

The issue is that when you call nameof in output_to_csv, the variable name is in fact 'df' since it's local to the function and doesn't have a sense of the name of the variable that was passed in.

My suggestion would be to move the call to nameof into your main function, then pass that as an argument to the output_to_csv function:

from varname import nameof
from pyspark.sql import SparkSession

cwd = os.getcwd()

def output_to_csv(df, fname):
    df.coalesce(1).write.option("header", "true")\
        .mode('overwrite')\
            .csv(cwd   '/output_files/'   fname)
    return None


def main():
      spark = SparkSession.builder.appName('test').getOrCreate()
      ## other functions here ##
      output_to_csv(dataframe_abc, nameof(dataframe_abc))
      spark.stop()
  • Related