How to concatenate two columns of spark dataframe with null values but get one value-CodePudding

I have two columns in my spark dataframe:

Name_ls  Name_mg
  Herry   null
  null    Cong   
  Duck    Duck77
  Tinh    Tin_Lee
  Huong   null
  null    Ngon
  Lee     null

My requirement is to add a new column to dataframe by concatenating the above 2 columns
but value of the new column will be one in the two value of the old column is not null
How to do that in pyspark ?
Expected output:
Name_ls  Name_mg       Name
  Herry   null         Herry
  null    Luck         Luck
  Duck    Duck77       Duck
  Tinh    Tin_Lee      Tinh
  Huong   null         Huong
  null    Ngon         Ngon
  Lee     null         Lee

CodePudding user response：

you can coalesce function from 'pyspark.sql.functions`,

from pyspark.sql import functions as f

df.withColumn("name",f.coalesce("name_ls","name_mg")).show()

 ------- ------- ----- 
|name_ls|name_mg| name|
 ------- ------- ----- 
|  Herry|   null|Herry|
|   null|   Cong| Cong|
|   Duck| Duck77| Duck|
|   Tinh|Tin_Lee| Tinh|
|  Huong|   null|Huong|
|   null|   Ngon| Ngon|
|    Lee|   null|  Lee|
 ------- ------- -----

CodePudding user response：

You can do this with a when-otherwise statement

When - Otherwise - First checks if name_mg is Null ,replaces with name_ls , elif is Not Null , checks if name_ls is not Null , replaces with name_ls

Data Preparation

input_str = """
  Herry   null
  null    Cong   
  Duck    Duck77
  Tinh    Tin_Lee
  Huong   null
  null    Ngon
  Lee     null
""".split()

input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))

cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "name_ls,name_mg".split(',')))
        
n = len(input_values)
n_cols = 2

input_list = [tuple(input_values[i:i n_cols]) for i in range(0,n,n_cols)]

sparkDF = sql.createDataFrame(input_list, cols)

sparkDF.show()

 ------- ------- 
|name_ls|name_mg|
 ------- ------- 
|  Herry|   null|
|   null|   Cong|
|   Duck| Duck77|
|   Tinh|Tin_Lee|
|  Huong|   null|
|   null|   Ngon|
|    Lee|   null|
 ------- -------

When - Otherwise

sparkDF = sparkDF.withColumn('name',F.when(F.col('name_mg').isNull()
                      ,F.col('name_ls')).when(F.col('name_ls').isNotNull(),F.col('name_ls'))\
                                        .otherwise(F.col('name_mg'))
              )

sparkDF.show()

 ------- ------- ----- 
|name_ls|name_mg| name|
 ------- ------- ----- 
|  Herry|   null|Herry|
|   null|   Cong| Cong|
|   Duck| Duck77| Duck|
|   Tinh|Tin_Lee| Tinh|
|  Huong|   null|Huong|
|   null|   Ngon| Ngon|
|    Lee|   null|  Lee|
 ------- ------- -----

CodePudding user response：

If you like spark SQL statements as much as I do, you can consider NVL function.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = [
    ('Herry', None),
    (None, 'Cong'),
    ('Duck', 'Duck77'),
    ('Tinh', 'Tin_Lee'),
    ('Huong', None),
    (None, 'Ngon'),
    ('Lee', None)
]
schema = ['Name_ls', 'Name_mg']
df = spark.createDataFrame(data, schema)
df.createOrReplaceTempView('tmp')
res_sql = """
    select Name_ls,Name_mg,nvl(Name_ls, Name_mg) Name
    from tmp
"""
res_df = spark.sql(res_sql)
res_df.show(truncate=False)