I have two columns in my spark dataframe: Name_ls Name_mg Herry null null Cong Duck Duck77 Tinh Tin_Lee Huong null null Ngon Lee null My requirement is to add a new column to dataframe by concatenating the above 2 columns but value of the new column will be one in the two value of the old column is not null How to do that in pyspark ? Expected output: Name_ls Name_mg Name Herry null Herry null Luck Luck Duck Duck77 Duck Tinh Tin_Lee Tinh Huong null Huong null Ngon Ngon Lee null Lee
CodePudding user response:
you can coalesce
function from 'pyspark.sql.functions`,
from pyspark.sql import functions as f
df.withColumn("name",f.coalesce("name_ls","name_mg")).show()
------- ------- -----
|name_ls|name_mg| name|
------- ------- -----
| Herry| null|Herry|
| null| Cong| Cong|
| Duck| Duck77| Duck|
| Tinh|Tin_Lee| Tinh|
| Huong| null|Huong|
| null| Ngon| Ngon|
| Lee| null| Lee|
------- ------- -----
CodePudding user response:
You can do this with a when-otherwise statement
When - Otherwise - First checks if name_mg
is Null ,replaces with name_ls
, elif is Not Null , checks if name_ls
is not Null , replaces with name_ls
Data Preparation
input_str = """
Herry null
null Cong
Duck Duck77
Tinh Tin_Lee
Huong null
null Ngon
Lee null
""".split()
input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "name_ls,name_mg".split(',')))
n = len(input_values)
n_cols = 2
input_list = [tuple(input_values[i:i n_cols]) for i in range(0,n,n_cols)]
sparkDF = sql.createDataFrame(input_list, cols)
sparkDF.show()
------- -------
|name_ls|name_mg|
------- -------
| Herry| null|
| null| Cong|
| Duck| Duck77|
| Tinh|Tin_Lee|
| Huong| null|
| null| Ngon|
| Lee| null|
------- -------
When - Otherwise
sparkDF = sparkDF.withColumn('name',F.when(F.col('name_mg').isNull()
,F.col('name_ls')).when(F.col('name_ls').isNotNull(),F.col('name_ls'))\
.otherwise(F.col('name_mg'))
)
sparkDF.show()
------- ------- -----
|name_ls|name_mg| name|
------- ------- -----
| Herry| null|Herry|
| null| Cong| Cong|
| Duck| Duck77| Duck|
| Tinh|Tin_Lee| Tinh|
| Huong| null|Huong|
| null| Ngon| Ngon|
| Lee| null| Lee|
------- ------- -----
CodePudding user response:
If you like spark SQL statements as much as I do, you can consider NVL function.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [
('Herry', None),
(None, 'Cong'),
('Duck', 'Duck77'),
('Tinh', 'Tin_Lee'),
('Huong', None),
(None, 'Ngon'),
('Lee', None)
]
schema = ['Name_ls', 'Name_mg']
df = spark.createDataFrame(data, schema)
df.createOrReplaceTempView('tmp')
res_sql = """
select Name_ls,Name_mg,nvl(Name_ls, Name_mg) Name
from tmp
"""
res_df = spark.sql(res_sql)
res_df.show(truncate=False)