how to select column iteratively in spark-CodePudding

If I have dataframe with 100s columns. How do I select iteratevly below columns.

One final dataframe output from below code may be:

|a      | id    | year|m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1    | 2001  | 0    | 0     | 0   | ... |   0  |
|"hello"| 1   | 2015  | 0    | 0     | 0   | ... |   0  |
|"hello"| 2   | 2002  | 0    | 0     | 0   | ... |   0  |
|"hello"| 2   | 2015  | 0    | 0     | 0   | ... |   0  |

but another dataframe may have more years so it will be like the below example

|a      | id    | year|m2000 | m2001 | m2002 | .... | m2019|
|"hello"| 1    | 2001  | 0    | 0     | 0   | ... |   0  |
|"hello"| 1   | 2015  | 0    | 0     | 0   | ... |   0  |
|"hello"| 2   | 2002  | 0    | 0     | 0   | ... |   0  |
|"hello"| 2   | 2015  | 0    | 0     | 0   | ... |   0  |

I can not use drop as I would have drop 100s cols so select is better in this use case.

I have tried the below but say in this example the range is changeable.

a=2000
b=2015
for i in range(a, b   1):
    df = df.withColumn("M"   str(i), lit(0))
    df = df.select("M"   str(i),"a","id","year")


df.show()

but it only shows the first year m2000

 a      | id    | year|m2000 | 
|"hello"| 1    | 2001  | 0  |  
|"hello"| 1   | 2015  | 0   | 
|"hello"| 2   | 2002  | 0   | 
|"hello"| 2   | 2015  | 0  |

where I want it to show

|a      | id    | year|m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1    | 2001  | 0    | 0     | 0   | ... |   0  |
|"hello"| 1   | 2015  | 0    | 0     | 0   | ... |   0  |
|"hello"| 2   | 2002  | 0    | 0     | 0   | ... |   0  |
|"hello"| 2   | 2015  | 0    | 0     | 0   | ... |   0  |

CodePudding user response：

You are not quite clear on what columns you need. If you want to select everything starting with m plus a, id, year, colRegex may be helpful.

df.select('a','id','year', df.colRegex("`^m200 . `")).show()

If you want to selectively select columns between 2000 and 2015 use list comprehension with the walrus operator as follows

df.select('a','id','year', *[c for x in range(2000,2015) if (c:='m' str(x))in (df.columns)]).show()

Please remember 2000 and 2015 can also be declared as variables and passed as follows

a=2000
 b=2015
df.select('a','id','year', *[c for x in range(a,b) if (c:='m' str(x))in (df.columns)]).show()