Grouping rows by proximity of floats in a python pandas dataframe-CodePudding

It feels so straight forward but I haven't found the answer to my question yet. How does one group by proximity, or closeness, of two floats in pandas?

Ok, I could do this the loopy way but my data is big and I hope I can expand my pandas skills with your help and do this elegantly:

I have a column of times in nanoseconds in my DataFrame. I want to group these based on the proximity of their values to little clusters. Most of them will be two rows per cluster maybe up to five or six. I do not know the number of clusters. It will be a massive amount of very small clusters. I thought I could e.g. introduce a second index or just an additional column with 1 for all rows of the first cluster, 2 for the second and so forth so that groupby gets straight forward thereafter. something like:

	t (ns)	cluster
71	1524957248.4375	1
72	1524957265.625	1
699	14624846476.5625	2
700	14624846653.125	2
701	14624846661.287	2
1161	25172864926.5625	3
1160	25172864935.9375	3

Thanks for your help!

CodePudding user response：

Assuming you want to create the "cluster" column from the index based on the proximity of the successive values, you could use:

thresh = 1
df['cluster'] = df.index.to_series().diff().gt(thresh).cumsum().add(1)

using the "t (ns)":

thresh = 1
df['cluster'] = df['t (ns)'].diff().gt(thresh).cumsum().add(1)

output:

            t (ns)  cluster
71    1.524957e 09        1
72    1.524957e 09        1
699   1.462485e 10        2
700   1.462485e 10        2
701   1.462485e 10        2
1161  2.517286e 10        3
1160  2.517286e 10        3

CodePudding user response：

You can 'round' the t (ns) column by floor dividing them with a threshold value and looking at their differences:

df[['t (ns)']].assign(
    cluster=(df['t (ns)'] // 10E7)
    .diff().gt(0).cumsum().add(1)
)

Or you can experiment with the number of clusters you try to organize your data:

bins=3
df[['t (ns)']].assign(
    bins=pd.cut(
        df['t (ns)'], bins=bins).cat.rename_categories(range(1, bins   1)
    )
)