Home > OS >  Pyspark dynamic column selection from dataframe
Pyspark dynamic column selection from dataframe

Time:02-01

I have a dataframe with multiple columns as t_orno,t_pono, t_sqnb ,t_pric,....and so on(it's a table with multiple columns). The 2nd dataframe contains certain name of the columns from 1st dataframe. Eg.

columnname
t_pono
t_pric
:
:

I need to select only those columns from the 1st dataframe whose name is present in the 2nd. In above example t_pono,t_pric.

How can this be done?

CodePudding user response:

Let's say you have the following columns (which can be obtained using df.columns, which returns a list):

df1_cols = ["t_orno", "t_pono", "t_sqnb", "t_pric"]
df2_cols = ["columnname", "t_pono", "t_pric"]

To get only those columns from the first dataframe that are present in the second one, you can do set intersection (and I cast it to a list, so it can be used to select data):

list(set(df1_cols).intersection(df2_cols))

And we get the result:

["t_pono", "t_pric"]

To put it all together and select only those columns:

select_columns = list(set(df1_cols).intersection(df2_cols))

new_df = df1.select(*select_columns)
  • Related