Home > database >  Filter rows if value of a column exist in two or more rows using Pandas
Filter rows if value of a column exist in two or more rows using Pandas

Time:08-11

I have the following dataframe:

Input:

    col1           col2           Col3      
0   parent         200007653.0    2021-06-13 
1   parent         200005568.2    2021-06-13 
2   parent         200005246.0    2021-06-13 
3   parent         200008323.7    2021-06-14 
4   parent         200002324.1    2021-06-15 
6   parent         200009999.1    2021-06-13 
7   parent         200005246.0    2021-06-16 
8   son            200005246.0    2021-06-13 
9   daughter       200005246.0    2021-06-17
10  son            200007653.0    2021-06-13 
11  son            200003411.0    2021-06-13 
12  daughter       200005568.2    2021-06-16
13  daughter       200006666.0    2021-06-17
14  daughter       200006666.0    2021-06-17

What I am trying to do is that:

i) IF the SAME value for col2 appears in two or more columns then we mark new column col4 with a value "1", otherwise "0"

ii) Also, one of the column to have this value MUST be for col1=parent.

Desired output:

    col1           col2           Col3          Col4 
0   parent         200007653.0    2021-06-13    1
1   parent         200005568.2    2021-06-13    1
2   parent         200005246.0    2021-06-13    0
3   parent         200008323.7    2021-06-14    0
4   parent         200002324.1    2021-06-15    0
6   parent         200009999.1    2021-06-13    0
7   parent         200005246.0    2021-06-16    1
8   son            200005246.0    2021-06-13    1
9   daughter       200005246.0    2021-06-17    1
10  son            200007653.0    2021-06-13    1
11  son            200003411.0    2021-06-13    0
12  daughter       200005568.2    2021-06-16    1
13  daughter       200006666.0    2021-06-17    0
14  daughter       200006666.0    2021-06-17    0

Note: If you check last two entries (13 and 14). Despite these two rows having same value for col2 but atleast one of them is NOT parent, therefore we don't count it.

CodePudding user response:

Use masks:

# is one of the col1 parent?
m1 = df['col1'].eq('parent').groupby(df['col2']).transform('any')

# is col2 duplicated?
m2 = df['col2'].duplicated(keep=False)

# if both conditions are True, set up 1
df['col4'] = (m1&m2).astype(int)

Output:

        col1         col2        Col3  col4
0     parent  200007653.0  2021-06-13     1
1     parent  200005568.2  2021-06-13     1
2     parent  200005246.0  2021-06-13     1
3     parent  200008323.7  2021-06-14     0
4     parent  200002324.1  2021-06-15     0
6     parent  200009999.1  2021-06-13     0
7     parent  200005246.0  2021-06-16     1
8        son  200005246.0  2021-06-13     1
9   daughter  200005246.0  2021-06-17     1
10       son  200007653.0  2021-06-13     1
11       son  200003411.0  2021-06-13     0
12  daughter  200005568.2  2021-06-16     1
13  daughter  200006666.0  2021-06-17     0
14  daughter  200006666.0  2021-06-17     0p

CodePudding user response:

IIUC, it seems you are looking for ( actually its almost the same as mozway's answer):

df['col4'] = (
    df.groupby('col2')['col2'].transform(lambda x: len(x) > 1) & 
    df.groupby('col2')['col1'].transform(lambda x: (x=='parent').any())
).astype(int)

output:

        col1         col2        Col3  col4
0     parent  200007653.0  2021-06-13     1
1     parent  200005568.2  2021-06-13     1
2     parent  200005246.0  2021-06-13     1
3     parent  200008323.7  2021-06-14     0
4     parent  200002324.1  2021-06-15     0
5     parent  200009999.1  2021-06-13     0
6     parent  200005246.0  2021-06-16     1
7        son  200005246.0  2021-06-13     1
8   daughter  200005246.0  2021-06-17     1
9        son  200007653.0  2021-06-13     1
10       son  200003411.0  2021-06-13     0
11  daughter  200005568.2  2021-06-16     1
12  daughter  200006666.0  2021-06-17     0
13  daughter  200006666.0  2021-06-17     0
  • Related