how to make dataframe select query generic in pyspark-CodePudding

Description : I have a list of column names which I need. I want to check if all these columns names are present in dataframe.if some columns are present then use those columns and make a generic code like

Df1=df.select(df[column1],df(column2])

List=[column1,column2,column3,column4] Want to check if columns in list is present and whatever the columns are present in dataframe use it in select query

CodePudding user response：

You need to do it in an iterative fashion

select_list = ['col1','col2','col3']
df_columns = sparkDF.columns ### ['col1','col2','col5','col7']

final_select_list = []

for col in select_list:
    if col in df_columns:
       final_select_list  = [col]

### final_select_list --> ['col1','col2']


sparkDF.select(*final_select_list).show()

CodePudding user response：

The other answer(s) work perfectly. But it can also be written in a one liner.

# predefined list of all required columns
reqd_cols = ['id', 'dt', 'name', 'phone']

data_sdf. \
    select(*[k for k in data_sdf.columns if k in reqd_cols])

The list comprehension within the select() checks if any column from data_sdf dataframe is present in the reqd_cols list and keeps only the ones that are overlapping.