Home > database >  withColumn is adding values only to the first row in the dataframe in pyspark
withColumn is adding values only to the first row in the dataframe in pyspark

Time:08-16

withColumn is adding values only to the first row in the dataframe in pyspark

from pyspark.sql import SparkSession
from pyspark.sql import functions as F, Window as W

columns = ["language","users_count"]
data = [(" ", 20000), ("Python", 100000), ("Scala", 3000)]

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)

columns = ["language","users_count"]
df = rdd.toDF(columns)
df.printSchema();

df.filter((F.trim(df.language) == '') | (df.users_count >= 1000)).withColumn("errors", F.when(F.trim(F.col("language")) == '', F.concat(F.lit("Invalid Language;")))).withColumn("errors", F.when(F.col("users_count") > 1000, F.concat(F.col("errors"), F.lit("Invalid Users_Count;")))).show(truncate=False)

I am trying filter rows based on the filter criteria and if any of the rows have whitespace in the 'language' column or if the value in 'users_count' column is greater than 1000, then the application will add an appropriate error message in 'errors' column like so, Invalid Language; Invalid Users_Count.

But this error message in 'errors' column appears only for first row, all the other rows have 'null' value in 'errors' column.

This is the output in databricks for the above code:

     -------- ----------- ------------------------------------- 
|language|users_count|errors                               |
 -------- ----------- ------------------------------------- 
|        |20000      |Invalid Language;Invalid Users_Count;|
|Python  |100000     |null                                 |
|Scala   |3000       |null                                 |
 -------- ----------- ------------------------------------- 

CodePudding user response:

.concat() does not work with a null column, you would have to use .concat_ws().

df.filter((F.trim(df.language) == '') | (df.users_count >= 1000)).withColumn("errors", F.when(F.trim(F.col("language")) == '', F.concat(F.lit("Invalid Language;")))).withColumn("errors", F.when(F.col("users_count") > 1000, F.concat_ws("", F.col("errors"), F.lit("Invalid Users_Count;")))).show(truncate=False)
  • Related