Check the value in every row in a large dataframe Python-CodePudding

How do I check whether the row values is matching some conditions and modify the values? For example, if value1 and value2 are matched with the conditions, then I change them by another value.

I am using this method:

df['column'] = df['column'].replace(['matching1', 'matching2'], 'value1')
df['column'] = df['column'].replace(['matching3', 'matching4'], 'value2')
...
df['column'] = df['column'].replace(['matching999', 'matching1000'], 'value500')

But I need to handle many conditions so writing many commands like this may seem too clumsy since the whole dataframe has to be rechecked one time by every condition.

Is there any way to complete it faster? For example, the row that has been modified after the condition is met does not need to be checked again

CodePudding user response：

I can think of two options. I think both are more efficient than repeating replace, in terms of the number of scans. (Not benchmarked, though.)

Let's assume that we want to change 'Apple' and 'Orange' into 'value1', and 'Lemon' into 'value2'.

replace with dict

df = pd.DataFrame(['Apple','Apple','Orange','Lemon','Banana','Apple'], columns=['column'])

df['column'].replace({'Apple': 'value1', 'Orange': 'value1', 'Lemon': 'value2'}, inplace=True)

map

df = pd.DataFrame(['Apple','Apple','Orange','Lemon','Banana','Apple'], columns=['column'])

def func(x):
    if x == 'Apple' or x == 'Orange':
        return 'value1'
    elif x == 'Lemon':
        return 'value2'
    return x

df['column'] = df['column'].map(func)

One difference between the two is that the map option is more flexible, while the dict option is more succinct in some cases. For example, if you want to do case-insensitive match, you would want to use map.

CodePudding user response：

You can use np.select(), where you create a condition-list and a choice-list.

col = df['column']
condlist = [col='matching1',col='matching2',col='matching3',col='matching4',...,col='matchingn']
choicelistlist = ['value1','value1','value2','value3',...'valuei']
df['column'] = np.select(condlist,choicelist)

I have been led to believe up to this point that most numpy function, including np.select() are vectorized and about as efficient as you can get. I don't have a great sense of how that works myself, but seems to be the case.