Home > database >  how to make dataframe select query generic in pyspark
how to make dataframe select query generic in pyspark

Time:08-19

Description : I have a list of column names which I need. I want to check if all these columns names are present in dataframe.if some columns are present then use those columns and make a generic code like

Df1=df.select(df[column1],df(column2])

List=[column1,column2,column3,column4] Want to check if columns in list is present and whatever the columns are present in dataframe use it in select query

CodePudding user response:

You need to do it in an iterative fashion

select_list = ['col1','col2','col3']
df_columns = sparkDF.columns ### ['col1','col2','col5','col7']

final_select_list = []

for col in select_list:
    if col in df_columns:
       final_select_list  = [col]

### final_select_list --> ['col1','col2']


sparkDF.select(*final_select_list).show()

CodePudding user response:

The other answer(s) work perfectly. But it can also be written in a one liner.

# predefined list of all required columns
reqd_cols = ['id', 'dt', 'name', 'phone']

data_sdf. \
    select(*[k for k in data_sdf.columns if k in reqd_cols])

The list comprehension within the select() checks if any column from data_sdf dataframe is present in the reqd_cols list and keeps only the ones that are overlapping.

  • Related