I have the following pyspark dataframe
identification | p1 | p2 | p3 | p4 |
---|---|---|---|---|
1 | 1 | 0 | 0 | 1 |
2 | 0 | 1 | 1 | 0 |
3 | 0 | 0 | 0 | 1 |
I want to concatenate all columns from p1 to p4 in a way to gather the values of 1 and 0 in this way
identification | p1 | p2 | p3 | p4 | joined_column |
---|---|---|---|---|---|
1 | 1 | 0 | 0 | 1 | 1001 |
2 | 0 | 1 | 1 | 0 | 0110 |
3 | 0 | 0 | 0 | 1 | 0001 |
All columns from p1 to p4 are Integer, so I tried to cast them into string before concatenate all columns ( except the 'identification' column ) in this way:
from pyspark.sql.types import StringType
from pyspark.sql import functions as F
df_concat=df.withColumn('joined_column', F.concat([F.col(c).cast(StringType()) for c in df.columns if c!='identification']))
I get the following error :
TypeError: Invalid argument, not a string or column:
Is there any solution or other way to concatenate all columns of a pyspark dataframe into one string ?
CodePudding user response:
You have to use concat_ws
from pyspark.sql.functions import concat_ws
df = df.withColumn('joined_column',concat_ws('',df.p1, df.p2, df.p3, df.p4))
CodePudding user response:
you can use below code. If you will check your schema, the data in those columns may not be integer. You need to convert p1 to p4 as integer type
from pyspark.sql.types import IntegerType
df = df \
.withColumn("p1" , df["p1"].cast(IntegerType())) \
.withColumn("p2", df["p2"].cast(IntegerType())) \
.withColumn("p3" ,df["p3"].cast(IntegerType())) \
.withColumn("p4" , df["p4"] .cast(IntegerType()))
import pyspark
from pyspark.sql import functions as sf
df = df.withColumn('joined_column',sf.concat(sf.col('p1'),sf.col('p2'),sf.col('p3'),sf.col('p4')))
display(df)
CodePudding user response:
I will answer the question based on @samkart comment. We don't need to convert each column into a string, it will be converted automatically while concatenating.
from pyspark.sql import functions as F
df_concat=df.withColumn('joined_column', F.concat(*[F.col(c) for c in df.columns if c!='identification']))
This will be true independent of columns number and names