I have a table looks below.
lon | lat | output | |
---|---|---|---|
4050 | -47.812224 | -19.043365 | 1890.283215 |
5149 | -47.812224 | -19.043365 | 1890.283215 |
7316 | -47.812224 | -19.043365 | 1890.283215 |
8406 | -47.812224 | -19.043365 | 1890.283215 |
511 | -47.812014 | -19.007094 | 1813.785728 |
1555 | -47.812014 | -19.007094 | 1813.785728 |
3764 | -47.812014 | -19.007094 | 1821.363582 |
4846 | -47.812014 | -19.007094 | 1813.785728 |
29 | -47.811177 | -19.008053 | 1763.091936 |
1114 | -47.811177 | -19.008053 | 1763.091936 |
3262 | -47.811177 | -19.008053 | 1763.091936 |
4357 | -47.811177 | -19.008053 | 1763.091936 |
1436 | -47.774424 | -19.008700 | 2172.781911 |
2557 | -47.774424 | -19.008700 | 2174.394848 |
4725 | -47.774424 | -19.008700 | 2172.781911 |
5840 | -47.774424 | -19.008700 | 2172.781911 |
5211 | -47.774166 | -19.043847 | 2897.092502 |
6313 | -47.774166 | -19.043847 | 2897.092502 |
8460 | -47.774166 | -19.043847 | 2897.092502 |
9543 | -47.774166 | -19.043847 | 2897.092502 |
1679 | -47.773958 | -19.007574 | 2179.670924 |
2770 | -47.773958 | -19.007574 | 2179.670924 |
4998 | -47.773958 | -19.007574 | 2179.670924 |
6088 | -47.773958 | -19.007574 | 2179.670924 |
1937 | -47.773121 | -19.008533 | 2236.769862 |
3004 | -47.773121 | -19.008533 | 2236.769862 |
5231 | -47.773121 | -19.008533 | 2236.769862 |
6332 | -47.773121 | -19.008533 | 2236.769862 |
I would like to drop the duplicates by using groupby on lon and lat but keep the most repeated value on output
for example
lon | lat | output |
---|---|---|
-47.812224 | -19.043365 | 1890.283215 |
-47.812014 | -19.007094 | 1813.785728 |
-47.811177 | -19.008053 | 1763.091936 |
-47.774424 | -19.008700 | 2172.781911 |
-47.774166 | -19.043847 | 2897.092502 |
-47.773958 | -19.007574 | 2179.670924 |
-47.773121 | -19.008533 | 2236.769862 |
Could anyone tell me how to do this?
CodePudding user response:
You can combine .groupby
with Series.mode
:
x = df.groupby(["lon", "lat"])["output"].apply(lambda x: x.mode()[0])
print(x.reset_index())
Prints:
lon lat output
0 -47.812224 -19.043365 1890.283215
1 -47.812014 -19.007094 1813.785728
2 -47.811177 -19.008053 1763.091936
3 -47.774424 -19.008700 2172.781911
4 -47.774166 -19.043847 2897.092502
5 -47.773958 -19.007574 2179.670924
6 -47.773121 -19.008533 2236.769862
CodePudding user response:
We can use the .groupby aggregate methods as an alternative to the Andrej's method of using .apply and computing for each row.
While it does solve our problem .apply methods tend to become slower for a large dataset due to lack of vectorization.
Also, reset_index works much faster when used along with 'inplace=True'.
%%timeit
df.groupby(['lat', 'lon']).agg(pd.Series.mode).reset_index(inplace=True)
Hope that helps!!