I have the following dataframe:
Input:
col1 col2 Col3
0 parent 200007653.0 2021-06-13
1 parent 200005568.2 2021-06-13
2 parent 200005246.0 2021-06-13
3 parent 200008323.7 2021-06-14
4 parent 200002324.1 2021-06-15
6 parent 200009999.1 2021-06-13
7 parent 200005246.0 2021-06-16
8 son 200005246.0 2021-06-13
9 daughter 200005246.0 2021-06-17
10 son 200007653.0 2021-06-13
11 son 200003411.0 2021-06-13
12 daughter 200005568.2 2021-06-16
13 daughter 200006666.0 2021-06-17
14 daughter 200006666.0 2021-06-17
What I am trying to do is that:
i) IF the SAME value for col2 appears in two or more columns then we mark new column col4 with a value "1", otherwise "0"
ii) Also, one of the column to have this value MUST be for col1=parent.
Desired output:
col1 col2 Col3 Col4
0 parent 200007653.0 2021-06-13 1
1 parent 200005568.2 2021-06-13 1
2 parent 200005246.0 2021-06-13 0
3 parent 200008323.7 2021-06-14 0
4 parent 200002324.1 2021-06-15 0
6 parent 200009999.1 2021-06-13 0
7 parent 200005246.0 2021-06-16 1
8 son 200005246.0 2021-06-13 1
9 daughter 200005246.0 2021-06-17 1
10 son 200007653.0 2021-06-13 1
11 son 200003411.0 2021-06-13 0
12 daughter 200005568.2 2021-06-16 1
13 daughter 200006666.0 2021-06-17 0
14 daughter 200006666.0 2021-06-17 0
Note: If you check last two entries (13 and 14). Despite these two rows having same value for col2 but atleast one of them is NOT parent, therefore we don't count it.
CodePudding user response:
Use masks:
# is one of the col1 parent?
m1 = df['col1'].eq('parent').groupby(df['col2']).transform('any')
# is col2 duplicated?
m2 = df['col2'].duplicated(keep=False)
# if both conditions are True, set up 1
df['col4'] = (m1&m2).astype(int)
Output:
col1 col2 Col3 col4
0 parent 200007653.0 2021-06-13 1
1 parent 200005568.2 2021-06-13 1
2 parent 200005246.0 2021-06-13 1
3 parent 200008323.7 2021-06-14 0
4 parent 200002324.1 2021-06-15 0
6 parent 200009999.1 2021-06-13 0
7 parent 200005246.0 2021-06-16 1
8 son 200005246.0 2021-06-13 1
9 daughter 200005246.0 2021-06-17 1
10 son 200007653.0 2021-06-13 1
11 son 200003411.0 2021-06-13 0
12 daughter 200005568.2 2021-06-16 1
13 daughter 200006666.0 2021-06-17 0
14 daughter 200006666.0 2021-06-17 0p
CodePudding user response:
IIUC, it seems you are looking for ( actually its almost the same as mozway's answer):
df['col4'] = (
df.groupby('col2')['col2'].transform(lambda x: len(x) > 1) &
df.groupby('col2')['col1'].transform(lambda x: (x=='parent').any())
).astype(int)
output:
col1 col2 Col3 col4
0 parent 200007653.0 2021-06-13 1
1 parent 200005568.2 2021-06-13 1
2 parent 200005246.0 2021-06-13 1
3 parent 200008323.7 2021-06-14 0
4 parent 200002324.1 2021-06-15 0
5 parent 200009999.1 2021-06-13 0
6 parent 200005246.0 2021-06-16 1
7 son 200005246.0 2021-06-13 1
8 daughter 200005246.0 2021-06-17 1
9 son 200007653.0 2021-06-13 1
10 son 200003411.0 2021-06-13 0
11 daughter 200005568.2 2021-06-16 1
12 daughter 200006666.0 2021-06-17 0
13 daughter 200006666.0 2021-06-17 0