I have the following DataFrame in pandas:
A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90],
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]
I want to create a new column and in that column I want to have values from column A based on the condition on column B. Conditions are if there is no ''txt'' in between two consecutive ''BW'', then I will have those on column C. But if there is ''txt'' between two consecutive ''BW'', I want to drop all those values. So the expected output should look like:
A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90],
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]
C = [1,10,23, BW, 24,24,55, BW, nan, nan, nan, nan, nan, nan, BW, 43,BW]
I have no clue how to do it. Any help is much appreciated.
CodePudding user response:
I don't know if this is the most efficient way to do it, but you can create a new column called mask
from mapping the values in column B the following way: 'BW'
to True
, 'txt'
to False
and all other values to np.nan
.
Then if you forward fill the NaN from mask
, and backward fill the NaN from mask
and logically combine the results (set equal to True as long as one of the forward or backward filled columns is False), you can create a column called final_mask
where all of the values between consecutive BW containing txt are filled in with True.
You can then use .apply
to select the value of column A only when the final_mask
is False and column B isn't 'BW', select column B if final_mask
is False and column B is 'BW', and np.nan
otherwise.
import numpy as np
import pandas as pd
A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
B = [24,23,29, 'BW',49,59,72, 'BW',9,183,17, 'txt',2,49,'BW',479,'BW']
df = pd.DataFrame({'A':A,'B':B})
df["mask"] = df["B"].apply(lambda x: True if x == 'BW' else False if x == 'txt' else np.nan)
df["ffill"] = df["mask"].fillna(method="ffill")
df["bfill"] = df["mask"].fillna(method="bfill")
df["final_mask"] = (df["ffill"] == False) | (df["bfill"] == False)
df["C"] = df.apply(lambda x: x['A'] if (
(x['final_mask'] == False) & (x['B'] != 'BW'))
else x['B'] if ((x['final_mask'] == False) & (x['B'] == 'BW'))
else np.nan, axis=1
)
>>> df
A B mask ffill bfill final_mask C
0 1 24 NaN NaN True False 1
1 10 23 NaN NaN True False 10
2 23 29 NaN NaN True False 23
3 45 BW True True True False BW
4 24 49 NaN True True False 24
5 24 59 NaN True True False 24
6 55 72 NaN True True False 55
7 67 BW True True True False BW
8 73 9 NaN True False True NaN
9 26 183 NaN True False True NaN
10 13 17 NaN True False True NaN
11 96 txt False False False True NaN
12 53 2 NaN False True True NaN
13 23 49 NaN False True True NaN
14 24 BW True True True False BW
15 43 479 NaN True True False 43
16 90 BW True True True False BW
Dropping the columns we created along the way:
df.drop(columns=['mask','ffill','bfill','final_mask'])
A B C
0 1 24 1
1 10 23 10
2 23 29 23
3 45 BW BW
4 24 49 24
5 24 59 24
6 55 72 55
7 67 BW BW
8 73 9 NaN
9 26 183 NaN
10 13 17 NaN
11 96 txt NaN
12 53 2 NaN
13 23 49 NaN
14 24 BW BW
15 43 479 43
16 90 BW BW
CodePudding user response:
The following will work:
import pandas as pd
import numpy as np
df = df.assign(group = (df[~df['B'].between(BW,BW)].index.to_series().diff() > 1).cumsum())
df['C'] = np.where(df.group == df[df.B == txt].group.values[0], np.nan, df.B)
df['C'] = df['C'].astype('Int64')
df = df.drop('group', axis=1)
In [435]: df
Out[435]:
A B C
0 1 24 24
1 10 23 23
2 23 29 29
3 45 999 999
4 24 49 49
5 24 59 59
6 55 72 72
7 67 999 999
8 73 9 <NA>
9 26 183 <NA>
10 13 17 <NA>
11 96 -999 <NA>
12 53 2 <NA>
13 23 49 <NA>
14 24 999 999
15 43 479 479
16 90 999 999
You can achieve it like so, assuming BW
and txt
are specific values I just filled them with some random number to differentiate them
In [277]: BW = 999
In [278]: txt = -999
In [293]: A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
...: B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,49,BW,479,BW]
In [300]: df = pd.DataFrame({'A': A, 'B': B})
In [301]: df
Out[301]:
A B
0 1 24
1 10 23
2 23 29
3 45 999
4 24 49
5 24 59
6 55 72
7 67 999
8 73 9
9 26 183
10 13 17
11 96 -999
12 53 2
13 23 49
14 24 999
15 43 479
16 90 999
First lets split the different groups of values, here I am splitting them into unique groups where each group contains the values of B
that are between the value BW
and the next BW
.
In [321]: df = df.assign(group = (df[~df['B'].between(BW,BW)].index.to_series().diff() > 1).cumsum())
In [322]: df
Out[322]:
A B group
0 1 24 0.00000000
1 10 23 0.00000000
2 23 29 0.00000000
3 45 999 NaN
4 24 49 1.00000000
5 24 59 1.00000000
6 55 72 1.00000000
7 67 999 NaN
8 73 9 2.00000000
9 26 183 2.00000000
10 13 17 2.00000000
11 96 -999 2.00000000
12 53 2 2.00000000
13 23 49 2.00000000
14 24 999 NaN
15 43 479 3.00000000
16 90 999 NaN
Next with the use of np.where()
we can replace the values depending on the condition that you set.
In [360]: df['C'] = np.where(df.group == df[df.B == txt].group.values[0], np.nan, df.B)
In [432]: df
Out[432]:
A B C group
0 1 24 24.00000000 0.00000000
1 10 23 23.00000000 0.00000000
2 23 29 29.00000000 0.00000000
3 45 999 999.00000000 NaN
4 24 49 49.00000000 1.00000000
5 24 59 59.00000000 1.00000000
6 55 72 72.00000000 1.00000000
7 67 999 999.00000000 NaN
8 73 9 NaN 2.00000000
9 26 183 NaN 2.00000000
10 13 17 NaN 2.00000000
11 96 -999 NaN 2.00000000
12 53 2 NaN 2.00000000
13 23 49 NaN 2.00000000
14 24 999 999.00000000 NaN
15 43 479 479.00000000 3.00000000
16 90 999 999.00000000 NaN
Lastly just convert the float column to int and drop the group
column which we do not need anymore. If you want to maintain that the NaN values are np.nan
then ignore the conversion to Int64
.
In [396]: df.C = df.C.astype('Int64')
In [397]: df
Out[397]:
A B group C
0 1 24 0.00000000 1
1 10 23 0.00000000 10
2 23 29 0.00000000 23
3 45 999 NaN 45
4 24 49 1.00000000 24
5 24 59 1.00000000 24
6 55 72 1.00000000 55
7 67 999 NaN 67
8 73 9 2.00000000 <NA>
9 26 183 2.00000000 <NA>
10 13 17 2.00000000 <NA>
11 96 -999 2.00000000 <NA>
12 53 2 2.00000000 <NA>
13 23 49 2.00000000 <NA>
14 24 999 NaN 24
15 43 479 3.00000000 43
16 90 999 NaN 90
In [398]: df = df.drop('group', axis=1)
In [435]: df
Out[435]:
A B C
0 1 24 24
1 10 23 23
2 23 29 29
3 45 999 999
4 24 49 49
5 24 59 59
6 55 72 72
7 67 999 999
8 73 9 <NA>
9 26 183 <NA>
10 13 17 <NA>
11 96 -999 <NA>
12 53 2 <NA>
13 23 49 <NA>
14 24 999 999
15 43 479 479
16 90 999 999