Consider the following 2d numPy array:
import numpy as np
daily = np.array([
['2022-01-01', 'AccountName1', 123456789, 'campaignname1', 111, 100, 1.1, 'group'],
['2022-01-01', 'AccountName1', 123456789, 'campaignname2', 222, 200, 2.2, 'group'],
['2022-01-01', 'AccountName1', 123456789, 'campaignname3', 333, 300, 3.3, 'group'],
['2022-01-02', 'AccountName1', 123456789, 'campaignname1', 111, 400, 4.4, 'group'],
['2022-01-02', 'AccountName1', 123456789, 'campaignname2', 222, 500, 5.5, 'group'],
['2022-01-02', 'AccountName1', 123456789, 'campaignname3', 333, 600, 6.6, 'group'],
['2022-01-03', 'AccountName1', 123456789, 'campaignname1', 111, 700, 7.7, 'group'],
['2022-01-03', 'AccountName1', 123456789, 'campaignname2', 222, 800, 8.8, 'group'],
['2022-01-03', 'AccountName1', 123456789, 'campaignname3', 333, 900, 9.9, 'group'],
], dtype = object)
daily
And here is the other 1d numPy array (this could be list if needed):
campaigns = np.array([111, 333], dtype = object)
campaigns
What is the fastest way to replace the last column values from 'group' into 'new' or 'old' depending on whether the values from the campaigns exist or not? The way I was able to do it with python for loop if statements is very slow for the final goal. The final go is to check several billion combinations of new/old so we need something very quick.
%%time
for x in daily:
if x[4] in campaigns:
x[7] = 'new'
else:
x[7] = 'old'
daily
And here is the expected result:
result = np.array([
['2022-01-01', 'AccountName1', 123456789, 'campaignname1', 111, 100, 1.1, 'new'],
['2022-01-01', 'AccountName1', 123456789, 'campaignname2', 222, 200, 2.2, 'old'],
['2022-01-01', 'AccountName1', 123456789, 'campaignname3', 333, 300, 3.3, 'new'],
['2022-01-02', 'AccountName1', 123456789, 'campaignname1', 111, 400, 4.4, 'new'],
['2022-01-02', 'AccountName1', 123456789, 'campaignname2', 222, 500, 5.5, 'old'],
['2022-01-02', 'AccountName1', 123456789, 'campaignname3', 333, 600, 6.6, 'new'],
['2022-01-03', 'AccountName1', 123456789, 'campaignname1', 111, 700, 7.7, 'new'],
['2022-01-03', 'AccountName1', 123456789, 'campaignname2', 222, 800, 8.8, 'old'],
['2022-01-03', 'AccountName1', 123456789, 'campaignname3', 333, 900, 9.9, 'new']
], dtype=object)
result
CodePudding user response:
The whole 4 column:
In [58]: daily[:,4]
Out[58]: array([111, 222, 333, 111, 222, 333, 111, 222, 333], dtype=object)
We can match it with campaigns
with:
In [60]: np.in1d(daily[:,4],campaigns)
Out[60]: array([ True, False, True, True, False, True, True, False, True])
In [62]: mask = np.in1d(daily[:,4],campaigns)
In [63]: daily[mask,7]
Out[63]: array(['group', 'group', 'group', 'group', 'group', 'group'], dtype=object)
where
lets us convert that to an array of strings:
In [67]: np.where(mask, 'new','old')
Out[67]:
array(['new', 'old', 'new', 'new', 'old', 'new', 'new', 'old', 'new'],
dtype='<U3')
Which we can assign to the 7 column:
In [68]: daily[:,7] = _
I see lots of pandas
questions about using np.where
in the same sort of way.