I have a dataframe called df
which has the following columns header of data:
bureau0 | bureau1 | bureau2 | bureau3 | bureau4 | waysConstitution |
---|---|---|---|---|---|
4753900 | 4773300 | NaN | 4713002 | Three | 2135 |
4721103 | NaN | NaN | NaN | 4751201 | 2135 |
6204000 | 6201501 | 6319400 | NaN | 6201501 | 2062 |
7911200 | 8230001 | NaN | 8291100 | NaN | 2135 |
4713002 | NaN | NaN | 4751201 | 4753900 | 2054 |
In fact, the bureau column goes up to 100, but I think these 5 cols examples already help.
I also have another dataset, called top_bureau, which contains some bureau code. Like this:
top_bureau |
---|
7911200 |
4713002 |
4773300 |
5811500 |
6201501 |
A more lines in the top_bureau, but I think for example it should work.
What I would like to do is: Create a new column in df that returns the value "potential_costumer" if in any bureau column there is some code that has in top_bureau and the waysConstitution column has the value equal to "2135" or "2062".
Expected output:
bureau0 | bureau1 | bureau2 | bureau3 | bureau4 | waysConstitution | potential_customer |
---|---|---|---|---|---|---|
4753900 | 4773300 | NaN | 4713002 | Three | 2135 | Yes |
4721103 | NaN | NaN | NaN | 4751201 | 2135 | Yes |
6204000 | 6201501 | 6319400 | NaN | 6201501 | 2062 | Yes |
7911200 | 8230001 | NaN | 8291100 | NaN | 2135 | Yes |
4713002 | NaN | NaN | 4751201 | 4753900 | 2054 | No |
Reasoning example: As in line 1, in the bureau3 column the value "4713002" is in the top_bureau table and the "waysConstitution" column is equal to 2135 or 2062 so this customer can be flagged as a potential customer
How to make this iteration loop between all columns?
CodePudding user response:
IIUC, you can perform simple boolean indexing.
I am assuming df
the first dataframe and df2
the second one.
# is any row value in top_bureau?
m1 = df.filter(like='bureau').isin(df2['top_bureau'].unique()).any(1)
# is waysConstitution 2135 or 2062?
m2 = df['waysConstitution'].isin([2135, 2062])
# if both conditions are True, add a "potential_customer" string
df.loc[m1&m2, 'new'] = 'potential_customer'
output:
bureau0 bureau1 bureau2 bureau3 bureau4 waysConstitution new
0 4753900 4773300.0 NaN 4713002.0 Three 2135 potential_customer
1 4721103 NaN NaN NaN 4751201 2135 NaN
2 6204000 6201501.0 6319400.0 NaN 6201501 2062 potential_customer
3 7911200 8230001.0 NaN 8291100.0 NaN 2135 potential_customer
4 4713002 NaN NaN 4751201.0 4753900 2054 NaN
alternative output (Yes/No):
import numpy as np
m1 = df.filter(like='bureau').isin(df2['top_bureau'].unique()).any(1)
m2 = df['waysConstitution'].isin([2135, 2062])
df['potential_customer'] = np.where(m1&m2, 'Yes', 'No')
output:
bureau0 bureau1 bureau2 bureau3 bureau4 waysConstitution potential_customer
0 4753900 4773300.0 NaN 4713002.0 Three 2135 Yes
1 4721103 NaN NaN NaN 4751201 2135 No
2 6204000 6201501.0 6319400.0 NaN 6201501 2062 Yes
3 7911200 8230001.0 NaN 8291100.0 NaN 2135 Yes
4 4713002 NaN NaN 4751201.0 4753900 2054 No
CodePudding user response:
You can use iteritems. This will loop over everything. Since I dont see your expected output or your code so far I cant really know what you exactly want.
for (columnName, columnData) in df.iteritems():
print('Column Name : ', columnName)
print('Column Contents : ', columnData.values)
You can fill this with your condition needed to create a new column?
CodePudding user response:
There's a few ways to do it, as you can see by the solutions. Here's another:
import pandas as pd
import numpy as np
top_bureau = ['7911200','4713002','4773300','5811500','6201501']
columns = ['bureau0','bureau1','bureau2','bureau3','bureau4','waysConstitution']
data = [['4753900', '4773300', np.nan, '4713002', 'Three', '2135'],
['4721103', np.nan, np.nan, np.nan, '4751201', '2135'],
['6204000', '6201501', '6319400', np.nan, '6201501', '2062'],
['7911200', '8230001', np.nan, '8291100', np.nan, '2135'],
['123', np.nan, np.nan, '456', '789', '101112'],
['4713002', np.nan, np.nan, '4751201', '4753900', '2054']]
df = pd.DataFrame(data, columns=columns)
df['check'] = df[df.columns].apply(lambda x: list(x), axis=1)
potentialCustomerIdx = list(df[(pd.DataFrame(df.check.tolist()).isin(top_bureau).any(1).values) & (df['waysConstitution'].isin(["2135","2062"]))].index)
df['potential_customer'] = 'No'
df.loc[potentialCustomerIdx,'potential_customer'] = 'Yes'
df = df.drop(['check'],axis=1)
Output:
print(df)
bureau0 bureau1 bureau2 ... bureau4 waysConstitution potential_customer
0 4753900 4773300 NaN ... Three 2135 Yes
1 4721103 NaN NaN ... 4751201 2135 No
2 6204000 6201501 6319400 ... 6201501 2062 Yes
3 7911200 8230001 NaN ... NaN 2135 Yes
4 123 NaN NaN ... 789 101112 No
5 4713002 NaN NaN ... 4753900 2054 No
[6 rows x 7 columns]