Home > Enterprise >  Python dataframe: labelling (1-0) ) adjacent rows upon condition
Python dataframe: labelling (1-0) ) adjacent rows upon condition

Time:11-09

I have a column number containing number and NaN. I want to add a column label identifying by 1 and 0 the "zones" where we have a number: the zone includes adjacents (above and below) rows.

The result should look like below:

Number   Label
Nan      0
Nan      1
4        1
Nan      1
Nan      0
Nan      0
Nan      1
8.9      1
Nan      1
Nan      0
Nan      0
Nan      1
47       1

I came up with the following solution. But it's ugly and it wound not scale if I wanted to Label more adjacent cells ( above 2 below).

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 100)

#Generating our DataFrame and ensuring there are some NaN
df = pd.DataFrame(np.random.randn(100), columns=['number'])
df.loc[df.number<1] = np.nan

#diffusing the values on adjacent cells and summing
df['label'] = df.number.fillna(0) 
              df.number.shift(1).fillna(0)
              df.number.shift(-1).fillna(0)

#Replace values by 1
df.loc[df.label>0, 'label'] = 1
print(df)

Could anyone help me find a more elegant solution? Maybe with a nice df.apply that I have so much difficulties using?

CodePudding user response:

I suggest using the convolution operation for this. It's really nice when you want to "mask" an array over a certain space. In your case, you want to mask the array [..., 1, 1, 1, ...] on top of each label == 1. Here's my approach:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Number": [pd.NA,pd.NA,4,pd.NA,pd.NA,pd.NA,pd.NA,8.9,pd.NA,pd.NA,pd.NA,pd.NA,47,]})

PADDING_VALUE = 1 # set padding

def change_neighbors(np_array):
    len_of_neighbors = 2*PADDING_VALUE 1 # imagine this is the length going from [1] -> [..., 1, 1, 1, ...]
    conv_arr = np.convolve(np_array, [1]*len_of_neighbors, "same") # "same" value makes it extrapolate zeros at boundaries.

    # need to account for overlaps when convolving. 
    # Some values might be "2" depending on closeness of non-nan chars
    conv_arr[conv_arr>1]=1 

    return conv_arr

df["Label"] = df["Number"].notnull()
df["Label"] = change_neighbors(df["Label"].values)

print(df)
# >>>    Number  Label
# >>> 0    <NA>      0
# >>> 1    <NA>      1
# >>> 2       4      1
# >>> 3    <NA>      1
# >>> 4    <NA>      0
# >>> 5    <NA>      0
# >>> 6    <NA>      1
# >>> 7     8.9      1
# >>> 8    <NA>      1
# >>> 9    <NA>      0
# >>> 10   <NA>      0
# >>> 11   <NA>      1
# >>> 12     47      1

Note that when you apply the convolution, some values can add up to be larger than 1 (if padding is especially big, or the "1-elements" are very close together). For this reason I set all values larger than one equal to one, but you might have another use-case for this. Hope this helps!

CodePudding user response:

Shift based method

(Spoiler alert: it was my first answer, but not my best. This is not the fastest. See end of post for faster solution)

As long as your condition remains "number on previous, current or next row" (I mean, if you don't want to extend that to "k previous rows or k next rows") the shift method seems to be the fastest way. I am not sure for the fillna idea tho.

I would use a more direct approach

df['label'] = 1*(~df.number.isna() | ~df.number.shift(1).isna() | ~df.number.shift(-1).isna())

Sliding-Window-view

(Other spoiler: I thought this method was pertinent only in the general case, not in the specific case of a 3 row large window (current 1 before 1 after). But if fact, it is faster even for this case)

With a variable window, you can use np.lib.stride_tricks.sliding_window_view to quickly have a view on adjacent values

def fillLabel(df):
    df['label']=0
    v = np.lib.stride_tricks.sliding_window_view(~np.isnan(df.number.values), (PADDING_VALUE*2 1,))
    # Note: PADDING_VALUE is the same as in Steinn Hauser Magnusson's answer
    label=np.any(v,axis=1)
    df.label.values[1:-1]=label

Note that sliding_window_view, as its name suggest, is a view. Not a copy of the data. So even if you have 1 million rows and PADDING_VALUE is 10000, it won't fill your memory with a 10 billions 2d-array. It is just a convenient way to check for adjacent values in a single row.

I've recently used it in another answer where I explained it a little bit more.

Timit result on those 4 methods (so far): yours, Steinn Hauser Magnusson's, and my 2

Method Timing (ms)
Yours 1.58
My shift-one-liner 1.30
Steinn's convolution 0.58
Sliding-window 0.28

So, I must confess that I wasn't expecting neither convolution nor my 2nd method to be faster than the simple one-liner for the simple "3 rows" case. But this sliding window view function is so fast (because, again, it is just a view), that it is fastest even in that case. It wins on both criteria: it is the fastest, and yet you can choose the window size.

  • Related