Home > Back-end >  pyspark concatonate multiple columns where the column name is in a list
pyspark concatonate multiple columns where the column name is in a list

Time:12-04

I am trying to concatonate multiple columns to just one column, but only if the column name is in a list. so issue = {'a','b','c'} is my list and would need to concatonate it as issue column with ; seperator.

I have tried: 1.

df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))

Which returns invalid syntax error

2.

df_issue = df.withColumn('issue', lit(issue)) 

this just returnd a b c and not their value

Thank you

I have tried: 1.

df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))

Which returns invalid syntax error

2.

df_issue = df.withColumn('issue', lit(issue)) 

this just returnd a b c and not their value

Thank you

CodePudding user response:

To concatenate multiple columns to one column, you can use the concat_ws() function in combination with the collect_list() function. The concat_ws() function concatenates the values of multiple columns into one column, using a specified separator. The collect_list() function collects the values of a specified column into a list.

Here is an example of how to use these functions to concatenate the values of the columns in the list issue into one column:

from pyspark.sql.functions import concat_ws, collect_list

df_issue = df.withColumn('issue', concat_ws(';', collect_list(issue)))

This code uses the concat_ws() and collect_list() functions to concatenate the values of the columns in the issue list into one column, using a semicolon as the separator. The resulting column will contain the concatenated values of the specified columns.

CodePudding user response:

You can simply use concat_ws:

from pyspark.sql import functions as F

columns_to_concat = ['a', 'b', 'c']
df.withColumn('issue', F.concat_ws(';', *columns_to_concat))

So, if your input DataFrame is:

 --- --- --- ---------- ---------- ----- 
|  a|  b|  c|     date1|     date2|value|
 --- --- --- ---------- ---------- ----- 
| k1| k2| k3|2022-11-11|2022-11-14|    5|
| k4| k5| k6|2022-11-15|2022-11-19|    5|
| k7| k8| k9|2022-11-15|2022-11-19|    5|
 --- --- --- ---------- ---------- ----- 

The previous code will produce:

 --- --- --- ---------- ---------- ----- -------- 
|  a|  b|  c|     date1|     date2|value|   issue|
 --- --- --- ---------- ---------- ----- -------- 
| k1| k2| k3|2022-11-11|2022-11-14|    5|k1;k2;k3|
| k4| k5| k6|2022-11-15|2022-11-19|    5|k4;k5;k6|
| k7| k8| k9|2022-11-15|2022-11-19|    5|k7;k8;k9|
 --- --- --- ---------- ---------- ----- -------- 
  • Related