How can I check if a dataframe contains a column according to a list of column names in Pyspark?-CodePudding

numerical_cols = ["temperature","timestamp"]

ID  temperature system_state    timestamp
0   B   12      inactive        1632733508
1   B   13      active          1632733508
2   A   4       NULL            1632733511
3   A   11      NULL            1632733512
4   D   20      450             1632733513
5   D   22      431             1632733515
6   C   25      20              1632733518
7   C   19      30              1632733521

I have a dataframe with several columns and a list containing partwise the names of the df columns. Now I want to check if the column exists in the list. If the column is in the list, it should be casted into a double type. How can I do this?

CodePudding user response：

Here's an example how to do that:

spark = SparkSession.builder.getOrCreate()
data = [{"a": "12.1", "b": "23.2", "c": "33.2"}]
columns = ["a", "c"]
df = spark.createDataFrame(data)
df = df.select(
    [F.col(c).cast(DoubleType()) if c in columns else F.col(c) for c in df.columns]
)

Result:

root
 |-- a: double (nullable = true)
 |-- b: string (nullable = true)
 |-- c: double (nullable = true)

 ---- ---- ----                                                                 
|a   |b   |c   |
 ---- ---- ---- 
|12.1|23.2|33.2|
 ---- ---- ----