Home > database >  Give unique identifiers to clusters containing the same value
Give unique identifiers to clusters containing the same value

Time:07-19

Say I had a dataframe column of ones and zeros, and I wanted to group by clusters of where the value is 1. Using groupby would ordinarily render 2 groups, a single group of zeros, and a single group of ones.

df = pd.DataFrame([1,1,1,0,0,0,0,1,1,0,0,0,1,0,1,1,1],columns=['clusters'])

print df
    clusters
0          1
1          1
2          1
3          0
4          0
5          0
6          0
7          1
8          1
9          0
10         0
11         0
12         1
13         0
14         1
15         1
16         1

for k, g in df.groupby(by=df.clusters):
    print k, g

0     clusters
3          0
4          0
5          0
6          0
9          0
10         0
11         0
13         0
1     clusters
0          1
1          1
2          1
7          1
8          1
12         1
14         1
15         1
16         1

So in effect, I need to have a new column with a unique identifier for all clusters of 1: hence we would end up with:

       clusters  unique
0          1       1
1          1       1
2          1       1
3          0       0
4          0       0
5          0       0
6          0       0
7          1       2
8          1       2
9          0       0
10         0       0
11         0       0
12         1       3
13         0       0
14         1       4
15         1       4
16         1       4

Any help welcome. Thanks.

CodePudding user response:

Let us do ngroup

m = df['clusters'].eq(0)
df['unqiue'] = df.groupby(m.cumsum()[~m]).ngroup()   1

    clusters  unqiue
0          1       1
1          1       1
2          1       1
3          0       0
4          0       0
5          0       0
6          0       0
7          1       2
8          1       2
9          0       0
10         0       0
11         0       0
12         1       3
13         0       0
14         1       4
15         1       4
16         1       4

CodePudding user response:

Using a mask:

m = df['clusters'].eq(0)
df['unique'] = m.ne(m.shift()).mask(m, False).cumsum().mask(m, 0)

output:

    clusters  unique
0          1       1
1          1       1
2          1       1
3          0       0
4          0       0
5          0       0
6          0       0
7          1       2
8          1       2
9          0       0
10         0       0
11         0       0
12         1       3
13         0       0
14         1       4
15         1       4
16         1       4
  • Related