Home > other >  pandas groupby based on multi-columns, but keep the most repeated duplicates number on other column
pandas groupby based on multi-columns, but keep the most repeated duplicates number on other column

Time:04-08

I have a table looks below.

lon lat output
4050 -47.812224 -19.043365 1890.283215
5149 -47.812224 -19.043365 1890.283215
7316 -47.812224 -19.043365 1890.283215
8406 -47.812224 -19.043365 1890.283215
511 -47.812014 -19.007094 1813.785728
1555 -47.812014 -19.007094 1813.785728
3764 -47.812014 -19.007094 1821.363582
4846 -47.812014 -19.007094 1813.785728
29 -47.811177 -19.008053 1763.091936
1114 -47.811177 -19.008053 1763.091936
3262 -47.811177 -19.008053 1763.091936
4357 -47.811177 -19.008053 1763.091936
1436 -47.774424 -19.008700 2172.781911
2557 -47.774424 -19.008700 2174.394848
4725 -47.774424 -19.008700 2172.781911
5840 -47.774424 -19.008700 2172.781911
5211 -47.774166 -19.043847 2897.092502
6313 -47.774166 -19.043847 2897.092502
8460 -47.774166 -19.043847 2897.092502
9543 -47.774166 -19.043847 2897.092502
1679 -47.773958 -19.007574 2179.670924
2770 -47.773958 -19.007574 2179.670924
4998 -47.773958 -19.007574 2179.670924
6088 -47.773958 -19.007574 2179.670924
1937 -47.773121 -19.008533 2236.769862
3004 -47.773121 -19.008533 2236.769862
5231 -47.773121 -19.008533 2236.769862
6332 -47.773121 -19.008533 2236.769862

I would like to drop the duplicates by using groupby on lon and lat but keep the most repeated value on output

for example

lon lat output
-47.812224 -19.043365 1890.283215
-47.812014 -19.007094 1813.785728
-47.811177 -19.008053 1763.091936
-47.774424 -19.008700 2172.781911
-47.774166 -19.043847 2897.092502
-47.773958 -19.007574 2179.670924
-47.773121 -19.008533 2236.769862

Could anyone tell me how to do this?

CodePudding user response:

You can combine .groupby with Series.mode:

x = df.groupby(["lon", "lat"])["output"].apply(lambda x: x.mode()[0])
print(x.reset_index())

Prints:

         lon        lat       output
0 -47.812224 -19.043365  1890.283215
1 -47.812014 -19.007094  1813.785728
2 -47.811177 -19.008053  1763.091936
3 -47.774424 -19.008700  2172.781911
4 -47.774166 -19.043847  2897.092502
5 -47.773958 -19.007574  2179.670924
6 -47.773121 -19.008533  2236.769862

CodePudding user response:

We can use the .groupby aggregate methods as an alternative to the Andrej's method of using .apply and computing for each row.

While it does solve our problem .apply methods tend to become slower for a large dataset due to lack of vectorization.

Also, reset_index works much faster when used along with 'inplace=True'.

%%timeit
df.groupby(['lat', 'lon']).agg(pd.Series.mode).reset_index(inplace=True)

Hope that helps!!

  • Related