I have a table in the database. I need to get records from the table where if my count of the id is greater than 1 then I have to choose the record having the minimum value(itm_num).
All the unique ids are retrieved. For id's having more than 1 as count, then retrieve itm_num which has the minimum value(sort by ascending).
Input :
Source id group cd itm_num
eu2 10404458 MELDING DEF 0003
eu2 10404458 MELDING DEF 0002
eu2 10404458 AANV PLAN 0001
pda 10020520 AANVRAA PLAN1 0001
pda 10020520 BGAAD PLAN1 0007
pda 10020527 HYGGG PLAN1 0002
sys 10020120 HYGGG PLAN1 0002
pda 10020620 HYGGG PLAN1 0002
Expected output :
Source id group cd itm_num
eu2 10404458 AANV PLAN 0001
pda 10020520 AANVRAA PLAN1 0001
pda 10020527 HYGGG PLAN1 0002
sys 10020120 HYGGG PLAN1 0002
pda 10020620 HYGGG PLAN1 0002
I want this result in Pyspark(SQL will also help) Please help! Thanks.
CodePudding user response:
I would simply compute a row number and select the first one.
from pyspark.sql import functions as F, Window
df.withColumn(
"rwnb",
F.row_number().over(Window.partitionBy("id").orderBy("itm_num"))
).where("rwnb = 1").drop("rwnb ")
But if you do not need the whole line, a simple min groupby is enough :
df.groupBy("id").agg(F.min("itm_num"))
CodePudding user response:
You're using a reserved word as column name witch needs to surrounded by back ticks (MySQL). It's likely to be different with other DB's like postgres (""). But if you're working with that kind of naming you might already know. Don't forget to replace the table name "test" with yours.
select * from test where (id, itm_num) in
(select id, min(itm_num) from test group by id) order by id asc