Home > Net >  How to mark start/end of a series of non-null and non-0 values in a column of a Pandas DataFrame?
How to mark start/end of a series of non-null and non-0 values in a column of a Pandas DataFrame?

Time:09-20

I have a set of data which includes various cycles and data points to correspond with.

I want to create a function that will go through and identify the letter A next to the start of the data and the Letter B to the end of the data.

For the following dataset:

    import pandas as pd

data = {'Cycle': [1, 1, 1, 1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3],
        'Value': [0,19,18,14,65,0,0,0,0,0,0,1,18,12,65,0,0,0,0,0,0,19,18,14,65,54,32,0,0,0]}  

df = pd.DataFrame(data) 

I would like to have a result like the following, with the values that arent POI's being a "-" similar to the following:

  import pandas as pd

data = {'Cycle': [1, 1, 1, 1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3],
        'Value': [0,19,18,14,65,0,0,0,0,0,0,1,18,12,65,0,0,0,0,0,0,19,18,14,65,54,32,0,0,0],
        'POI': [0,'A',0,0,'B',0,0,0,0,0,0,'A',0,0,'B',0,0,0,0,0,0,'A',0,0,0,0,'B',0,0,0]}  

df = pd.DataFrame(data)  

I have tried using something like this:

data['POI2'] = data.groupby('Cycle').apply(
lambda g: np.select([g['Value'] == g['Value'].max(),
                     g['Value'] == g['Value'].min()],
                    ['A', 'B'], default='-')
).explode().set_axis(data.index, axis=0)

With no luck. Basically I need to parse through the points for each cycle that are Nan or 0 and then identify the start and stop with an A and B.

CodePudding user response:

Here's one way:

mp = df.where(df!=0).groupby('Cycle')['Value'].agg([pd.Series.first_valid_index, 
                                            pd.Series.last_valid_index])
df.loc[mp['first_valid_index'], 'POI'] = 'A'
df.loc[mp['last_valid_index'], 'POI'] = 'B'
df['POI'] = df['POI'].fillna(0)
df

Output:

    Cycle  Value POI
0       1      0   0
1       1     19   A
2       1     18   0
3       1     14   0
4       1     65   B
5       1      0   0
6       1      0   0
7       1      0   0
8       1      0   0
9       1      0   0
10      2      0   0
11      2      1   A
12      2     18   0
13      2     12   0
14      2     65   B
15      2      0   0
16      2      0   0
17      2      0   0
18      2      0   0
19      2      0   0
20      3      0   0
21      3     19   A
22      3     18   0
23      3     14   0
24      3     65   0
25      3     54   0
26      3     32   B
27      3      0   0
28      3      0   0
29      3      0   0

CodePudding user response:

Why another answer? if there are already an accepted one provided by Scott Boston and another one by PaulS?

Because the code in this answer works correctly also in each of the following cases:

  • if a Cycle does not contain any data blocks with values
  • if more than one block of valid data occurs within a Cycle
  • if a data block within a Cycle consist of only one value

The current answers (2022-09-20 01:10 CET) by Scott Boston and PaulS fail to provide the right result in at least two of the above mentioned cases.

In other words if there is a chance that a Cycle does not contain any data, or if there will be two or more blocks of valid data within a Cycle, or if the data block will consist of only one value, the code in the other answers (if they would not in between be updated to fix their issues) can't be used because it produces wrong markings or fails with an Error.

To solve the problem with data blocks containing only one value the code below uses next to the 'A' and 'B' marks also an 'AB' mark allowing to mark occurrences of single values within a Cycle.

from itertools import groupby
def getPOI(df):
    lstCV = zip(df.Cycle, df.Value)
    lstPOI = []
    for (c, v), g in groupby(lstCV, lambda cv: 
                            (cv[0], cv[1]!=0 and not pd.isnull(cv[1]))):
        llg = sum(1 for item in g)
        if v is False: 
            lstPOI.extend([0]*llg)
        else: 
           lstPOI.extend(['A'] (llg-2)*[0] ['B'] if llg > 1 else ['AB'])
    return lstPOI
df["POI"] = getPOI(df)
print(df.POI.to_list())

Let's compare the outcome of the solutions provided in another answers with the outcome of the code above for following example of data:

data = {'Cycle': [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3],
        'Value': [7,0,0,0,7,7,7,0,7,7,0,0,7,0,0]}  
df = pd.DataFrame(data)
['AB',0,  0,  0,'AB','A','B', 0, 'A','B', 0,  0, 'AB', 0,  0 ] Claudio 
['A', 0,  0,  0, 'B','A', 0 , 0,  0 ,'B', 0,  0,  'B', 0,  0 ] Scott Boston
['A','A','A','A','B','A','0','0','0','B','B','B','A' ,'A','A'] PaulS

Notice that in case of multiple data blocks in one series the marking done by the code in another answers marks the start of the first block and the end of the last block in cycle. The code by Scott Boston marks in case of a single value with B as the A-mark will be overwritten and the code by PaulS shows a strange behavior.

I don't cover in the data example the case of a Cycle without any values as Scott Bostons code fails at it raising a KeyError.

Here once again the same comparison as above but in another form making it easier to compare the output to the values in Cycle and Value columns:

                   Claudio   Scott   Paul
    Cycle  Value     POI     POI     POI  
0       1      7      AB       A       A
1       1      0       0       0       A
2       1      0       0       0       A
3       1      0       0       0       A
4       1      7      AB       B       B
5       2      7       A       A       A
6       2      7       B       0       0
7       2      0       0       0       0
8       2      7       A       0       0
9       2      7       B       B       B
10      3      0       0       0       B
11      3      0       0       0       B
12      3      7      AB       B       A
13      3      0       0       0       A
14      3      0       0       0       A

CodePudding user response:

Another possible solution:

(df.groupby('Cycle')
  .apply(lambda x: x.assign(POI = np.where(np.cumsum(x['Value'] != 0) == 1, 'A',
    np.where(np.cumsum(x.loc[::-1, 'Value'] != 0) == 1, 'B', 0)[::-1]))))

Output:

    Cycle  Value POI
0       1      0   0
1       1     19   A
2       1     18   0
3       1     14   0
4       1     65   B
5       1      0   0
6       1      0   0
7       1      0   0
8       1      0   0
9       1      0   0
10      2      0   0
11      2      1   A
12      2     18   0
13      2     12   0
14      2     65   B
15      2      0   0
16      2      0   0
17      2      0   0
18      2      0   0
19      2      0   0
20      3      0   0
21      3     19   A
22      3     18   0
23      3     14   0
24      3     65   0
25      3     54   0
26      3     32   B
27      3      0   0
28      3      0   0
29      3      0   0
  • Related