Home > Software design >  How to concatenate two columns of spark dataframe with null values but get one value
How to concatenate two columns of spark dataframe with null values but get one value

Time:10-04

I have two columns in my spark dataframe:

Name_ls  Name_mg
  Herry   null
  null    Cong   
  Duck    Duck77
  Tinh    Tin_Lee
  Huong   null
  null    Ngon
  Lee     null

My requirement is to add a new column to dataframe by concatenating the above 2 columns
but value of the new column will be one in the two value of the old column is not null
How to do that in pyspark ?
Expected output:
Name_ls  Name_mg       Name
  Herry   null         Herry
  null    Luck         Luck
  Duck    Duck77       Duck
  Tinh    Tin_Lee      Tinh
  Huong   null         Huong
  null    Ngon         Ngon
  Lee     null         Lee

CodePudding user response:

you can coalesce function from 'pyspark.sql.functions`,

from pyspark.sql import functions as f

df.withColumn("name",f.coalesce("name_ls","name_mg")).show()

 ------- ------- ----- 
|name_ls|name_mg| name|
 ------- ------- ----- 
|  Herry|   null|Herry|
|   null|   Cong| Cong|
|   Duck| Duck77| Duck|
|   Tinh|Tin_Lee| Tinh|
|  Huong|   null|Huong|
|   null|   Ngon| Ngon|
|    Lee|   null|  Lee|
 ------- ------- ----- 

CodePudding user response:

You can do this with a when-otherwise statement

When - Otherwise - First checks if name_mg is Null ,replaces with name_ls , elif is Not Null , checks if name_ls is not Null , replaces with name_ls

Data Preparation

input_str = """
  Herry   null
  null    Cong   
  Duck    Duck77
  Tinh    Tin_Lee
  Huong   null
  null    Ngon
  Lee     null
""".split()

input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))

cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "name_ls,name_mg".split(',')))
        
n = len(input_values)
n_cols = 2

input_list = [tuple(input_values[i:i n_cols]) for i in range(0,n,n_cols)]

sparkDF = sql.createDataFrame(input_list, cols)

sparkDF.show()

 ------- ------- 
|name_ls|name_mg|
 ------- ------- 
|  Herry|   null|
|   null|   Cong|
|   Duck| Duck77|
|   Tinh|Tin_Lee|
|  Huong|   null|
|   null|   Ngon|
|    Lee|   null|
 ------- ------- 

When - Otherwise

sparkDF = sparkDF.withColumn('name',F.when(F.col('name_mg').isNull()
                      ,F.col('name_ls')).when(F.col('name_ls').isNotNull(),F.col('name_ls'))\
                                        .otherwise(F.col('name_mg'))
              )

sparkDF.show()

 ------- ------- ----- 
|name_ls|name_mg| name|
 ------- ------- ----- 
|  Herry|   null|Herry|
|   null|   Cong| Cong|
|   Duck| Duck77| Duck|
|   Tinh|Tin_Lee| Tinh|
|  Huong|   null|Huong|
|   null|   Ngon| Ngon|
|    Lee|   null|  Lee|
 ------- ------- ----- 

CodePudding user response:

If you like spark SQL statements as much as I do, you can consider NVL function.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = [
    ('Herry', None),
    (None, 'Cong'),
    ('Duck', 'Duck77'),
    ('Tinh', 'Tin_Lee'),
    ('Huong', None),
    (None, 'Ngon'),
    ('Lee', None)
]
schema = ['Name_ls', 'Name_mg']
df = spark.createDataFrame(data, schema)
df.createOrReplaceTempView('tmp')
res_sql = """
    select Name_ls,Name_mg,nvl(Name_ls, Name_mg) Name
    from tmp
"""
res_df = spark.sql(res_sql)
res_df.show(truncate=False)
  • Related