Home > Mobile >  Pandas - Identify non-unique rows, grouping any pairs except in particular case
Pandas - Identify non-unique rows, grouping any pairs except in particular case

Time:07-09

This is an extension to this question.

I am trying to figure out a non-looping way to identify (auto-incrementing int would be ideal) the non-unique groups of rows (a group can contain 1 or more rows) within each TDateID, GroupID combination. Except I need it to ignore that paired grouping if all the rows have Structure = "s".

Here is an example DataFrame that looks like

Index Cents Structure SD_YF TDateID GroupID
10 182.5 s 2.1 0 0
11 182.5 s 2.1 0 0
12 153.5 s 1.05 0 1
13 153.5 s 1.05 0 1
14 43 p 11 1 2
15 43 p 11 1 2
4 152 s 21 1 2
5 152 s 21 1 2
21 53 s 13 2 3
22 53 s 13 2 3
24 252 s 25 2 3
25 252 s 25 2 3

In pandas form:

df = pd.DataFrame({'Index': [10, 11, 12, 13, 14, 15, 4, 5, 21, 22, 24, 25],
 'Cents': [182.5,
  182.5,
  153.5,
  153.5,
  43.0,
  43.0,
  152.0,
  152.0,
  53.0,
  53.0,
  252.0,
  252.0],
 'Structure': ['s', 's', 's', 's', 'p', 'p', 's', 's', 's', 's', 's', 's'],
 'SD_YF': [2.1,
  2.1,
  1.05,
  1.05,
  11.0,
  11.0,
  21.0,
  21.0,
  13.0,
  13.0,
  25.0,
  25.0],
 'TDateID': [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
 'GroupID': [0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]})

My ideal output would be:

Index Cents Structure SD_YF TDateID GroupID UniID
10 182.5 s 2.1 0 0 1
11 182.5 s 2.1 0 0 2
12 153.5 s 1.05 0 1 3
13 153.5 s 1.05 0 1 4
14 43 p 11 1 2 5
15 43 p 11 1 2 6
4 152 s 21 1 2 5
5 152 s 21 1 2 6
21 53 s 13 2 3 7
22 53 s 13 2 3 8
24 252 s 25 2 3 9
25 252 s 25 2 3 10

I have bolded #5 to draw attention to how index 14, 4 are paired together. Similar with #6. I hope that makes sense!

Using the following code worked great, except it would need to be adapted for the "Structure != "s" for all rows in the grouping" part.

df['UniID'] = (df['GroupID']
  df.groupby('GroupID').ngroup().add(1)
  df.groupby(['GroupID', 'Cents', 'SD_YF']).cumcount()
)

CodePudding user response:

Do the IDs need to be consecutive?

If the occurrence of "duplicate" rows is small, looping over just those groups might not be too bad.

First set an ID to all the pairs using the code you have (and add an indicator column that a row belongs in a group). Then select out all the rows in groups (using the indicator column) and iterate over the groups. If the group has all S, then reassign the IDs to be unique for each row.

CodePudding user response:

The tricky thing is to imagine how this should generalize. Here is my understanding: create a sequential count ignoring the p, then back fill those.

m = df['Structure'].eq('s')
df['UniID'] = m.cumsum() (~m).cumsum().mask(m,0)

Output:

    Index  Cents Structure  SD_YF  TDateID  GroupID  UniID
0      10  182.5         s   2.10        0        0      1
1      11  182.5         s   2.10        0        0      2
2      12  153.5         s   1.05        0        1      3
3      13  153.5         s   1.05        0        1      4
4      14   43.0         p  11.00        1        2      5
5      15   43.0         p  11.00        1        2      6
6       4  152.0         s  21.00        1        2      5
7       5  152.0         s  21.00        1        2      6
8      21   53.0         s  13.00        2        3      7
9      22   53.0         s  13.00        2        3      8
10     24  252.0         s  25.00        2        3      9
11     25  252.0         s  25.00        2        3     10
  • Related