Description : I have a list of column names which I need. I want to check if all these columns names are present in dataframe.if some columns are present then use those columns and make a generic code like
Df1=df.select(df[column1],df(column2])
List=[column1,column2,column3,column4] Want to check if columns in list is present and whatever the columns are present in dataframe use it in select query
CodePudding user response:
You need to do it in an iterative fashion
select_list = ['col1','col2','col3']
df_columns = sparkDF.columns ### ['col1','col2','col5','col7']
final_select_list = []
for col in select_list:
if col in df_columns:
final_select_list = [col]
### final_select_list --> ['col1','col2']
sparkDF.select(*final_select_list).show()
CodePudding user response:
The other answer(s) work perfectly. But it can also be written in a one liner.
# predefined list of all required columns
reqd_cols = ['id', 'dt', 'name', 'phone']
data_sdf. \
select(*[k for k in data_sdf.columns if k in reqd_cols])
The list comprehension within the select()
checks if any column from data_sdf
dataframe is present in the reqd_cols
list and keeps only the ones that are overlapping.