If I have dataframe with 100s columns. How do I select iteratevly below columns.
One final dataframe output from below code may be:
|a | id | year|m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1 | 2001 | 0 | 0 | 0 | ... | 0 |
|"hello"| 1 | 2015 | 0 | 0 | 0 | ... | 0 |
|"hello"| 2 | 2002 | 0 | 0 | 0 | ... | 0 |
|"hello"| 2 | 2015 | 0 | 0 | 0 | ... | 0 |
but another dataframe may have more years so it will be like the below example
|a | id | year|m2000 | m2001 | m2002 | .... | m2019|
|"hello"| 1 | 2001 | 0 | 0 | 0 | ... | 0 |
|"hello"| 1 | 2015 | 0 | 0 | 0 | ... | 0 |
|"hello"| 2 | 2002 | 0 | 0 | 0 | ... | 0 |
|"hello"| 2 | 2015 | 0 | 0 | 0 | ... | 0 |
I can not use drop as I would have drop 100s cols so select is better in this use case.
I have tried the below but say in this example the range is changeable.
a=2000
b=2015
for i in range(a, b 1):
df = df.withColumn("M" str(i), lit(0))
df = df.select("M" str(i),"a","id","year")
df.show()
but it only shows the first year m2000
a | id | year|m2000 |
|"hello"| 1 | 2001 | 0 |
|"hello"| 1 | 2015 | 0 |
|"hello"| 2 | 2002 | 0 |
|"hello"| 2 | 2015 | 0 |
where I want it to show
|a | id | year|m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1 | 2001 | 0 | 0 | 0 | ... | 0 |
|"hello"| 1 | 2015 | 0 | 0 | 0 | ... | 0 |
|"hello"| 2 | 2002 | 0 | 0 | 0 | ... | 0 |
|"hello"| 2 | 2015 | 0 | 0 | 0 | ... | 0 |
CodePudding user response:
You are not quite clear on what columns you need. If you want to select everything starting with m
plus a
, id
, year
, colRegex may be helpful.
df.select('a','id','year', df.colRegex("`^m200 . `")).show()
If you want to selectively select columns between 2000 and 2015 use list comprehension with the walrus operator as follows
df.select('a','id','year', *[c for x in range(2000,2015) if (c:='m' str(x))in (df.columns)]).show()
Please remember 2000 and 2015 can also be declared as variables and passed as follows
a=2000
b=2015
df.select('a','id','year', *[c for x in range(a,b) if (c:='m' str(x))in (df.columns)]).show()