I want to select columns from a dataframe,
however, I want to get the names and aliases from a config file and keep it variable. The config.json file give me a dict like
conf
"data":
{
"a":"a",
"b":"great",
"c":"example"
}
Now, I can select my columns like this:
from pyspark import functions as F
df= df.select
(
F.col("a").alias("a"),
F.col("b").alias("great"),
F.col("c").alias("example)
)
But I would rather do it in a loop. like
For all data.items(), do df=df.select(F.col(item[0]).alias(item[1]))
But I cannot wrap my head around it(maybe I should go to bed earlier) Thanks
CodePudding user response:
you could do a list comprehension with the dict items.
here's an example
cols = {
"a":"a",
"b":"great",
"c":"example"
}
spark.sparkContext.parallelize([(1, 2, 3)]).toDF(['a', 'b', 'c']). \
selectExpr(*['{0} as {1}'.format(item[0], item[1]) for item in cols.items()]). \
show()
# --- ----- -------
# | a|great|example|
# --- ----- -------
# | 1| 2| 3|
# --- ----- -------
CodePudding user response:
You can use df.select([F.col(k).alias(v) for k, v in data.items()])
.
Full example:
df = spark.createDataFrame(data=[ ["s1", 10, True], ["s2", 20, False] ], schema=["a", "b", "c"])
[Out]:
--- --- -----
| a| b| c|
--- --- -----
| s1| 10| true|
| s2| 20|false|
--- --- -----
data = {
"a":"a",
"b":"great",
"c":"example"
}
df = df.select([F.col(k).alias(v) for k, v in data.items()])
[Out]:
--- ----- -------
| a|great|example|
--- ----- -------
| s1| 10| true|
| s2| 20| false|
--- ----- -------