Home > Enterprise >  Data frame segmentation and dropping
Data frame segmentation and dropping

Time:09-30

I have the following DataFrame in pandas:

A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90], 
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]

I want to create a new column and in that column I want to have values from column A based on the condition on column B. Conditions are if there is no ''txt'' in between two consecutive ''BW'', then I will have those on column C. But if there is ''txt'' between two consecutive ''BW'', I want to drop all those values. So the expected output should look like:

A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90], 
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]
C = [1,10,23, BW, 24,24,55, BW, nan, nan, nan, nan, nan, nan, BW, 43,BW]

I have no clue how to do it. Any help is much appreciated.

CodePudding user response:

I don't know if this is the most efficient way to do it, but you can create a new column called mask from mapping the values in column B the following way: 'BW' to True, 'txt' to False and all other values to np.nan.

Then if you forward fill the NaN from mask, and backward fill the NaN from mask and logically combine the results (set equal to True as long as one of the forward or backward filled columns is False), you can create a column called final_mask where all of the values between consecutive BW containing txt are filled in with True.

You can then use .apply to select the value of column A only when the final_mask is False and column B isn't 'BW', select column B if final_mask is False and column B is 'BW', and np.nan otherwise.

import numpy as np
import pandas as pd

A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
B = [24,23,29, 'BW',49,59,72, 'BW',9,183,17, 'txt',2,49,'BW',479,'BW']
df = pd.DataFrame({'A':A,'B':B})

df["mask"] = df["B"].apply(lambda x: True if x == 'BW' else False if x == 'txt' else np.nan)
df["ffill"] = df["mask"].fillna(method="ffill")
df["bfill"] = df["mask"].fillna(method="bfill")
df["final_mask"] = (df["ffill"] == False) | (df["bfill"] == False)

df["C"] = df.apply(lambda x: x['A'] if (
    (x['final_mask'] == False) & (x['B'] != 'BW')) 
    else x['B'] if ((x['final_mask'] == False) & (x['B'] == 'BW')) 
    else np.nan, axis=1
)

>>> df
     A    B   mask  ffill  bfill  final_mask    C
0    1   24    NaN    NaN   True       False    1
1   10   23    NaN    NaN   True       False   10
2   23   29    NaN    NaN   True       False   23
3   45   BW   True   True   True       False   BW
4   24   49    NaN   True   True       False   24
5   24   59    NaN   True   True       False   24
6   55   72    NaN   True   True       False   55
7   67   BW   True   True   True       False   BW
8   73    9    NaN   True  False        True  NaN
9   26  183    NaN   True  False        True  NaN
10  13   17    NaN   True  False        True  NaN
11  96  txt  False  False  False        True  NaN
12  53    2    NaN  False   True        True  NaN
13  23   49    NaN  False   True        True  NaN
14  24   BW   True   True   True       False   BW
15  43  479    NaN   True   True       False   43
16  90   BW   True   True   True       False   BW

Dropping the columns we created along the way:

df.drop(columns=['mask','ffill','bfill','final_mask'])

     A    B    C
0    1   24    1
1   10   23   10
2   23   29   23
3   45   BW   BW
4   24   49   24
5   24   59   24
6   55   72   55
7   67   BW   BW
8   73    9  NaN
9   26  183  NaN
10  13   17  NaN
11  96  txt  NaN
12  53    2  NaN
13  23   49  NaN
14  24   BW   BW
15  43  479   43
16  90   BW   BW

CodePudding user response:

The following will work:

import pandas as pd
import numpy as np

df = df.assign(group = (df[~df['B'].between(BW,BW)].index.to_series().diff() > 1).cumsum())
df['C'] = np.where(df.group == df[df.B == txt].group.values[0], np.nan, df.B)
df['C'] = df['C'].astype('Int64')
df = df.drop('group', axis=1)
In [435]: df
Out[435]: 
     A    B     C
0    1   24    24
1   10   23    23
2   23   29    29
3   45  999   999
4   24   49    49
5   24   59    59
6   55   72    72
7   67  999   999
8   73    9  <NA>
9   26  183  <NA>
10  13   17  <NA>
11  96 -999  <NA>
12  53    2  <NA>
13  23   49  <NA>
14  24  999   999
15  43  479   479
16  90  999   999

You can achieve it like so, assuming BW and txt are specific values I just filled them with some random number to differentiate them

In [277]: BW = 999

In [278]: txt = -999

In [293]: A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
     ...: B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,49,BW,479,BW]

In [300]: df = pd.DataFrame({'A': A, 'B': B})

In [301]: df
Out[301]: 
     A    B
0    1   24
1   10   23
2   23   29
3   45  999
4   24   49
5   24   59
6   55   72
7   67  999
8   73    9
9   26  183
10  13   17
11  96 -999
12  53    2
13  23   49
14  24  999
15  43  479
16  90  999

First lets split the different groups of values, here I am splitting them into unique groups where each group contains the values of B that are between the value BW and the next BW.

In [321]: df = df.assign(group = (df[~df['B'].between(BW,BW)].index.to_series().diff() > 1).cumsum())

In [322]: df
Out[322]: 
     A    B      group
0    1   24 0.00000000
1   10   23 0.00000000
2   23   29 0.00000000
3   45  999        NaN
4   24   49 1.00000000
5   24   59 1.00000000
6   55   72 1.00000000
7   67  999        NaN
8   73    9 2.00000000
9   26  183 2.00000000
10  13   17 2.00000000
11  96 -999 2.00000000
12  53    2 2.00000000
13  23   49 2.00000000
14  24  999        NaN
15  43  479 3.00000000
16  90  999        NaN

Next with the use of np.where() we can replace the values depending on the condition that you set.

In [360]: df['C'] = np.where(df.group == df[df.B == txt].group.values[0], np.nan, df.B)

In [432]: df
Out[432]: 
     A    B            C      group
0    1   24  24.00000000 0.00000000
1   10   23  23.00000000 0.00000000
2   23   29  29.00000000 0.00000000
3   45  999 999.00000000        NaN
4   24   49  49.00000000 1.00000000
5   24   59  59.00000000 1.00000000
6   55   72  72.00000000 1.00000000
7   67  999 999.00000000        NaN
8   73    9          NaN 2.00000000
9   26  183          NaN 2.00000000
10  13   17          NaN 2.00000000
11  96 -999          NaN 2.00000000
12  53    2          NaN 2.00000000
13  23   49          NaN 2.00000000
14  24  999 999.00000000        NaN
15  43  479 479.00000000 3.00000000
16  90  999 999.00000000        NaN

Lastly just convert the float column to int and drop the group column which we do not need anymore. If you want to maintain that the NaN values are np.nan then ignore the conversion to Int64.

In [396]: df.C = df.C.astype('Int64')

In [397]: df
Out[397]: 
     A    B      group     C
0    1   24 0.00000000     1
1   10   23 0.00000000    10
2   23   29 0.00000000    23
3   45  999        NaN    45
4   24   49 1.00000000    24
5   24   59 1.00000000    24
6   55   72 1.00000000    55
7   67  999        NaN    67
8   73    9 2.00000000  <NA>
9   26  183 2.00000000  <NA>
10  13   17 2.00000000  <NA>
11  96 -999 2.00000000  <NA>
12  53    2 2.00000000  <NA>
13  23   49 2.00000000  <NA>
14  24  999        NaN    24
15  43  479 3.00000000    43
16  90  999        NaN    90

In [398]: df = df.drop('group', axis=1)

In [435]: df
Out[435]: 
     A    B     C
0    1   24    24
1   10   23    23
2   23   29    29
3   45  999   999
4   24   49    49
5   24   59    59
6   55   72    72
7   67  999   999
8   73    9  <NA>
9   26  183  <NA>
10  13   17  <NA>
11  96 -999  <NA>
12  53    2  <NA>
13  23   49  <NA>
14  24  999   999
15  43  479   479
16  90  999   999
  • Related