I have a Pandas dataframe 'my_df'. I'm using np.select (numpy) to assign specific values based on a truth condition. For example, if my_df contains "one" then the cell for my new column 'new' receives a "1" and so on.
data = ['One', 'Two', ['One', 'Four'], 'One', ['Three', 'One', 'Four']]
my_df = pd.DataFrame(data, columns=['test'])
print(my_df)
| test | one | two | three | four |
|---------------------|-------|--------|--------|--------|
| [One] | True | False | False | False |
| [Two] | False | True | False | False |
| [One, Four] | True | False | False | True |
| [One] | True | False | False | False |
| [Three, One, Four] | True | False | True | True |
my_df["one"] = my_df["test"].apply(lambda x: 'One' in x)
my_df["two"] = my_df["test"].apply(lambda x: 'Two' in x)
my_df["three"] = my_df["test"].apply(lambda x: 'Three' in x)
my_df["four"] = my_df["test"].apply(lambda x: 'Four' in x)
conditions = [
(my_df["one"] == True),
(my_df["two"] == True),
(my_df["three"] == True),
(my_df["four"] == True),
]
values = ['1', '2', '3', '4']
my_df['new'] = np.select(conditions, values)
My code here works well where there's only one input per cell to check the truth value of (as above). However, it's possible that more than one string value is in a particular cell. For example, how do I have it so that:
- "one" is assigned '1' in the new column when it is the only string present, as above
- where there's multiple strings in a cell, if "one" is one of them, and 'three' isn't then '1' is assigned
- in all cases where 'Three' is in the cell, a 3 is assigned
I have a dataframe that contains a lot more strings than this, and a lot more potential combinations of strings in each cell and so I need to be able to set values based not only on single string inputs, but where there's multiple string inputs. I can't keep adding conditions chaining AND over and over as the potential number of combinations is far too large.
Expected output
| test | one | two | three | four | new |
|---------------------|-------|--------|--------|--------|------|
| [One] | True | False | False | False | 1 |
| [Two] | False | True | False | False | 2 |
| [One, Four] | True | False | False | True | 1 |
| [One] | True | False | False | False | 1 |
| [Three, One, Four] | True | False | True | True | 3 |
CodePudding user response:
Using str.contains
provides sufficient utility for this task without the need to invoke Numpy. The correct order of the conditions and (re-)assignment of the new
column does the trick. The code is
data = ['One', 'Two', ['One', 'Four'], 'One', ['Three', 'One', 'Four']]
df = pd.DataFrame(data, columns=['test'])
# assign 4 digit-columns
df["one"] = df["test"].apply(lambda x: 'One' in x)
df["two"] = df["test"].apply(lambda x: 'Two' in x)
df["three"] = df["test"].apply(lambda x: 'Three' in x)
df["four"] = df["test"].apply(lambda x: 'Four' in x)
# sequential re-assignment
df.loc[df['test'].map(str).str.contains('Two'), 'new'] = 2
df.loc[df['test'].map(str).str.contains('Four'), 'new'] = 4
df.loc[df['test'].map(str).str.contains('One'), 'new'] = 1
df.loc[df['test'].map(str).str.contains('Three'), 'new'] = 3
df['new'] = df['new'].astype(int)
and yields