Home > Software design >  show first occurence(s) of a column
show first occurence(s) of a column

Time:05-17

I want to use pyspark to create new dataframe based on input where it prints out the first occurrence of each different value column. Would rownumber() work or window(). Not sure best way approach this or would sparksql be best. Basically the second table is what I want output to be where it prints out just the first occurrence of a value column from input. I only interested in first occurrence of the "value" column. If a value is repeated only show the first one seen.

 -------- -------- -------- 
|   VALUE|   DAY  |  Color
 -------- -------- -------- 
|20      |MON     |    BLUE|
|20      |TUES    |    BLUE|
|30      |WED     |    BLUE|
 -------- -------- -------- 


 -------- -------- -------- 
|   VALUE|   DAY  |  Color
 -------- -------- -------- 
|20      |MON     |    BLUE|
|30      |WED     |    BLUE|
 -------- -------- -------- 

CodePudding user response:

Appears to me you want to drop duplicated items by VALUE. if so, use dropDuplicates

 df.dropDuplicates(['VALUE']).show()


 ----- --- ----- 
|VALUE|DAY|Color|
 ----- --- ----- 
|   20|MON| BLUE|
|   30|WED| BLUE|
 ----- --- ----- 

CodePudding user response:

Here's how to do it with a window. In this example they us salary as the example. In your case I think you'd use 'DAY' for orderBy and 'Value' for partitionBy.

from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data = [("James","Sales",3000),("Michael","Sales",4600),
      ("Robert","Sales",4100),("Maria","Finance",3000),
      ("Raman","Finance",3000),("Scott","Finance",3300),
      ("Jen","Finance",3900),("Jeff","Marketing",3000),
      ("Kumar","Marketing",2000)]

df = spark.createDataFrame(data,["Name","Department","Salary"])
df.show()
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
w2 = Window.partitionBy("department").orderBy(col("salary"))
df.withColumn("row",row_number().over(w2)) \
  .filter(col("row") == 1).drop("row") \
  .show()
 ------------- ---------- ------ 
|employee_name|department|salary|
 ------------- ---------- ------ 
|        James|     Sales|  3000|
|        Maria|   Finance|  3000|
|        Kumar| Marketing|  2000|
 ------------- ---------- ------ 

Yes, you'd need to develop a way of ordering days, but I think you get that it's possible and you picked the correct tool. I always like to warn people, this uses a window and they suck all the data to 1 executor to complete the work. This is not particularly efficient. On small datasets this is likely performant. On larger data sets it may take way too long to complete.

CodePudding user response:

Here's how I'd do this without using window. It will likely perform better on large data sets as it can use more of the cluster to do the work. You would need to use 'VALUE' as Department and 'Salary' as 'DATE' in your case.

from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data = [("James","Sales",3000),("Michael","Sales",4600),
      ("Robert","Sales",4100),("Maria","Finance",3000),
      ("Raman","Finance",3000),("Scott","Finance",3300),
      ("Jen","Finance",3900),("Jeff","Marketing",3000),
      ("Kumar","Marketing",2000)]

df = spark.createDataFrame(data,["Name","Department","Salary"])
unGroupedDf = df.select( \
  df["Department"], \
  f.struct(*[\ # Make a struct with all the record elements.
    df["Salary"].alias("Salary"),\  #will be sorted on Salary first      
    df["Department"].alias("Dept"),\
    df["Name"].alias("Name")] )\
  .alias("record") )
unGroupedDf.groupBy("Department")\ #group
 .agg(f.collect_list("record")\  #Gather all the element in a group
  .alias("record"))\
  .select(\
    f.reverse(\ #Make the sort Descending
      f.array_sort(\ #Sort the array ascending
        f.col("record")\ #the struct
      )\
    )[0].alias("record"))\ #grab the "Max element in the array
    ).select( f.col("record.*") ).show() # use struct as Columns
  .show()
 --------- ------ ------- 
|     Dept|Salary|   Name|
 --------- ------ ------- 
|    Sales|  4600|Michael|
|  Finance|  3900|    Jen|
|Marketing|  3000|   Jeff|
 --------- ------ ------- 
  • Related