Home > Software engineering >  how do I pass multiple column names dynamically in pyspark?
how do I pass multiple column names dynamically in pyspark?

Time:02-04

I am writing a python function that will do a leftanti join on two dataframe, and the joining condition may vary. i.e. sometime 2 DFs might have just one column as unique key for joining, and soemtime 2 DFs might have more than 1 columns to join on.

So, I have written the below code. Please suggest what changes should I make

def integraty_check(testdata, refdata, cond = []):
    df = func.join_dataframe(testdata, refdata, cond, "leftanti", logger)
    df = df.select(cond)
    func.write_df_as_parquet_file(df, curate_path, logger)
    return df

here the parameter cond may have 1 or more than 1 column names as comma separated.

So, hwo do I pass the dynamic list of column names when I am calling the function?

Please suggest what would be the best way to achieve the purpose.

CodePudding user response:

you can use python's Unpacking Operator (PEP 448)

df = df.select(*cond)

You can find more examples on how to use the asterisk operator: Packing and Unpacking Arguments in Python

  • Related