If I have table
|a | b | c|
|"hello"|"world"| 1|
and the variables
start =2000
end =2015
How do I in pyspark add 15 cols with 1st column m2000 and second m2001 etc and all these new cols have 0 so new dataframe is
|a | b | c|m2000 | m2001 | m2002 | ... | m2015|
|"hello"|"world"| 1| 0 | 0 | 0 | ... | 0 |
I have tried below but
df = df.select(
'*',
*["0".alias(f'm{i}') for i in range(2000, 2016)]
)
df.show()
I get the error
AttributeError: 'str' object has no attribute 'alias'
CodePudding user response:
You can simply use withColumn
to add relevant columns.
from pyspark.sql.functions import col,lit
df = spark.createDataFrame(data=[("hello","world",1)],schema=["a","b","c"])
df.show()
----- ----- ---
| a| b| c|
----- ----- ---
|hello|world| 1|
----- ----- ---
for i in range(2000, 2015):
df = df.withColumn("m" str(i), lit(0))
df.show()
----- ----- --- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
| a| b| c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|
----- ----- --- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
|hello|world| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
----- ----- --- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
CodePudding user response:
You can use one-liner
df = df.select(df.columns [F.lit(0).alias(f"m{i}") for i in range(2000, 2015)])
Full example:
df = spark.createDataFrame([["hello","world",1]],["a","b","c"])
df = df.select(df.columns [F.lit(0).alias(f"m{i}") for i in range(2000, 2015)])
[Out]:
----- ----- --- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
| a| b| c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|
----- ----- --- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
|hello|world| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
----- ----- --- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
CodePudding user response:
in pandas
, you can do the following:
import pandas as pd
df = pd.Series({'a': 'Hello', 'b': 'World', 'c': 1}).to_frame().T
df[['m{}'.format(x) for x in range(2000, 2016)]] = 0
print(df)
I am not very familiar with the spark-synthax, but the approach should be near-identical.
What is happening:
The term ['m{}'.format(x) for x in range(2000, 2016)]
is a list-comprehension that creates the list of desired column names. We assign the value 0 to these columns. Since the columns do not yet exist, they are added.