Home > Software engineering >  For count of records having more than 1 retrieve the least value
For count of records having more than 1 retrieve the least value

Time:10-20

I have a table in the database. I need to get records from the table where if my count of the id is greater than 1 then I have to choose the record having the minimum value(itm_num).

All the unique ids are retrieved. For id's having more than 1 as count, then retrieve itm_num which has the minimum value(sort by ascending).

Input :

Source  id    group     cd     itm_num
eu2  10404458 MELDING  DEF      0003
eu2  10404458 MELDING  DEF      0002
eu2  10404458 AANV     PLAN     0001
pda  10020520 AANVRAA  PLAN1    0001
pda  10020520 BGAAD    PLAN1    0007
pda  10020527 HYGGG    PLAN1    0002
sys  10020120 HYGGG    PLAN1    0002
pda  10020620 HYGGG    PLAN1    0002

Expected output :

  Source   id    group     cd     itm_num
    eu2  10404458 AANV     PLAN     0001
    pda  10020520 AANVRAA  PLAN1    0001
    pda  10020527 HYGGG    PLAN1    0002
    sys  10020120 HYGGG    PLAN1    0002
    pda  10020620 HYGGG    PLAN1    0002

I want this result in Pyspark(SQL will also help) Please help! Thanks.

CodePudding user response:

I would simply compute a row number and select the first one.

from pyspark.sql import functions as F, Window

df.withColumn(
    "rwnb",
    F.row_number().over(Window.partitionBy("id").orderBy("itm_num"))
).where("rwnb = 1").drop("rwnb ")

But if you do not need the whole line, a simple min groupby is enough :

df.groupBy("id").agg(F.min("itm_num"))

CodePudding user response:

You're using a reserved word as column name witch needs to surrounded by back ticks (MySQL). It's likely to be different with other DB's like postgres (""). But if you're working with that kind of naming you might already know. Don't forget to replace the table name "test" with yours.

select * from test where (id, itm_num) in 
(select id, min(itm_num) from test group by id) order by id asc
  • Related