It feels so straight forward but I haven't found the answer to my question yet. How does one group by proximity, or closeness, of two floats in pandas?
Ok, I could do this the loopy way but my data is big and I hope I can expand my pandas skills with your help and do this elegantly:
I have a column of times in nanoseconds in my DataFrame. I want to group these based on the proximity of their values to little clusters. Most of them will be two rows per cluster maybe up to five or six. I do not know the number of clusters. It will be a massive amount of very small clusters. I thought I could e.g. introduce a second index or just an additional column with 1 for all rows of the first cluster, 2 for the second and so forth so that groupby gets straight forward thereafter. something like:
t (ns) | cluster | |
---|---|---|
71 | 1524957248.4375 | 1 |
72 | 1524957265.625 | 1 |
699 | 14624846476.5625 | 2 |
700 | 14624846653.125 | 2 |
701 | 14624846661.287 | 2 |
1161 | 25172864926.5625 | 3 |
1160 | 25172864935.9375 | 3 |
Thanks for your help!
CodePudding user response:
Assuming you want to create the "cluster" column from the index based on the proximity of the successive values, you could use:
thresh = 1
df['cluster'] = df.index.to_series().diff().gt(thresh).cumsum().add(1)
using the "t (ns)":
thresh = 1
df['cluster'] = df['t (ns)'].diff().gt(thresh).cumsum().add(1)
output:
t (ns) cluster
71 1.524957e 09 1
72 1.524957e 09 1
699 1.462485e 10 2
700 1.462485e 10 2
701 1.462485e 10 2
1161 2.517286e 10 3
1160 2.517286e 10 3
CodePudding user response:
You can 'round' the t (ns)
column by floor dividing them with a threshold value and looking at their differences:
df[['t (ns)']].assign(
cluster=(df['t (ns)'] // 10E7)
.diff().gt(0).cumsum().add(1)
)
Or you can experiment with the number of clusters you try to organize your data:
bins=3
df[['t (ns)']].assign(
bins=pd.cut(
df['t (ns)'], bins=bins).cat.rename_categories(range(1, bins 1)
)
)