Home > other >  How to use a function with select to batch select columns
How to use a function with select to batch select columns

Time:11-11

I want to select columns from a dataframe,

however, I want to get the names and aliases from a config file and keep it variable. The config.json file give me a dict like

conf

"data":
{
    "a":"a",
    "b":"great",
    "c":"example"
}

Now, I can select my columns like this:

from pyspark import functions as F
df= df.select
(
F.col("a").alias("a"),
F.col("b").alias("great"),
F.col("c").alias("example)
)

But I would rather do it in a loop. like

 For all data.items(), do df=df.select(F.col(item[0]).alias(item[1]))

But I cannot wrap my head around it(maybe I should go to bed earlier) Thanks

CodePudding user response:

you could do a list comprehension with the dict items.

here's an example

cols = {
    "a":"a",
    "b":"great",
    "c":"example"
}

spark.sparkContext.parallelize([(1, 2, 3)]).toDF(['a', 'b', 'c']). \
    selectExpr(*['{0} as {1}'.format(item[0], item[1]) for item in cols.items()]). \
    show()

#  --- ----- ------- 
# |  a|great|example|
#  --- ----- ------- 
# |  1|    2|      3|
#  --- ----- ------- 

CodePudding user response:

You can use df.select([F.col(k).alias(v) for k, v in data.items()]).

Full example:

df = spark.createDataFrame(data=[ ["s1", 10, True], ["s2", 20, False] ], schema=["a", "b", "c"])

[Out]:
 --- --- ----- 
|  a|  b|    c|
 --- --- ----- 
| s1| 10| true|
| s2| 20|false|
 --- --- ----- 

data = {
    "a":"a",
    "b":"great",
    "c":"example"
}

df = df.select([F.col(k).alias(v) for k, v in data.items()])

[Out]:
 --- ----- ------- 
|  a|great|example|
 --- ----- ------- 
| s1|   10|   true|
| s2|   20|  false|
 --- ----- ------- 
  • Related