Home > Software design >  loop over two variables to create multiple year columns
loop over two variables to create multiple year columns

Time:11-10

If I have table

|a      | b     | c|
|"hello"|"world"| 1|

and the variables

start =2000 
end =2015

How do I in pyspark add 15 cols with 1st column m2000 and second m2001 etc and all these new cols have 0 so new dataframe is

|a      | b     | c|m2000 | m2001 | m2002 | ... | m2015|
|"hello"|"world"| 1| 0    | 0     | 0     | ... |   0  |

I have tried below but

   df = df.select(
        '*',
        *["0".alias(f'm{i}') for i in range(2000, 2016)]
    )
    df.show()

I get the error

AttributeError: 'str' object has no attribute 'alias'

CodePudding user response:

You can simply use withColumn to add relevant columns.

from pyspark.sql.functions import col,lit

df = spark.createDataFrame(data=[("hello","world",1)],schema=["a","b","c"])

df.show()

 ----- ----- --- 
|    a|    b|  c|
 ----- ----- --- 
|hello|world|  1|
 ----- ----- --- 

for i in range(2000, 2015):
    df = df.withColumn("m" str(i), lit(0))

df.show()

 ----- ----- --- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 
|    a|    b|  c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|
 ----- ----- --- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 
|hello|world|  1|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 ----- ----- --- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 

CodePudding user response:

You can use one-liner

df = df.select(df.columns   [F.lit(0).alias(f"m{i}") for i in range(2000, 2015)])

Full example:

df = spark.createDataFrame([["hello","world",1]],["a","b","c"])
df = df.select(df.columns   [F.lit(0).alias(f"m{i}") for i in range(2000, 2015)])

[Out]:
 ----- ----- --- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 
|    a|    b|  c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|
 ----- ----- --- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 
|hello|world|  1|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 ----- ----- --- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 

CodePudding user response:

in pandas, you can do the following:

import pandas as pd

df = pd.Series({'a': 'Hello', 'b': 'World', 'c': 1}).to_frame().T
df[['m{}'.format(x) for x in range(2000, 2016)]] = 0
print(df)

I am not very familiar with the spark-synthax, but the approach should be near-identical.

What is happening: The term ['m{}'.format(x) for x in range(2000, 2016)] is a list-comprehension that creates the list of desired column names. We assign the value 0 to these columns. Since the columns do not yet exist, they are added.

  • Related