Home > Software engineering >  Conditionally assigning values to a pandas data-frame cell that may contain several strings using np
Conditionally assigning values to a pandas data-frame cell that may contain several strings using np

Time:07-09

I have a Pandas dataframe 'my_df'. I'm using np.select (numpy) to assign specific values based on a truth condition. For example, if my_df contains "one" then the cell for my new column 'new' receives a "1" and so on.

data = ['One', 'Two', ['One', 'Four'], 'One', ['Three', 'One', 'Four']]
  
my_df = pd.DataFrame(data, columns=['test'])

print(my_df)

|                test |   one |    two |  three |   four |
|---------------------|-------|--------|--------|--------|
|               [One] |  True |  False |  False |  False |
|               [Two] | False |   True |  False |  False |
|         [One, Four] |  True |  False |  False |  True  |
|               [One] |  True |  False |  False |  False |
|  [Three, One, Four] |  True |  False |   True |  True  |

my_df["one"] = my_df["test"].apply(lambda x: 'One' in x)
my_df["two"] = my_df["test"].apply(lambda x: 'Two' in x)
my_df["three"] = my_df["test"].apply(lambda x: 'Three' in x)
my_df["four"] = my_df["test"].apply(lambda x: 'Four' in x)

conditions = [
    (my_df["one"] == True),
    (my_df["two"] == True),
    (my_df["three"] == True),
    (my_df["four"] == True),
]

values = ['1', '2', '3', '4']

my_df['new'] = np.select(conditions, values)

My code here works well where there's only one input per cell to check the truth value of (as above). However, it's possible that more than one string value is in a particular cell. For example, how do I have it so that:

  • "one" is assigned '1' in the new column when it is the only string present, as above
  • where there's multiple strings in a cell, if "one" is one of them, and 'three' isn't then '1' is assigned
  • in all cases where 'Three' is in the cell, a 3 is assigned

I have a dataframe that contains a lot more strings than this, and a lot more potential combinations of strings in each cell and so I need to be able to set values based not only on single string inputs, but where there's multiple string inputs. I can't keep adding conditions chaining AND over and over as the potential number of combinations is far too large.

Expected output

|                test |   one |    two |  three |   four |  new |
|---------------------|-------|--------|--------|--------|------|
|               [One] |  True |  False |  False |  False |    1 |
|               [Two] | False |   True |  False |  False |    2 |
|         [One, Four] |  True |  False |  False |  True  |    1 |
|               [One] |  True |  False |  False |  False |    1 |
|  [Three, One, Four] |  True |  False |   True |  True  |    3 |

CodePudding user response:

Using str.contains provides sufficient utility for this task without the need to invoke Numpy. The correct order of the conditions and (re-)assignment of the new column does the trick. The code is

data = ['One', 'Two', ['One', 'Four'], 'One', ['Three', 'One', 'Four']]
df = pd.DataFrame(data, columns=['test'])

# assign 4 digit-columns
df["one"]   = df["test"].apply(lambda x: 'One' in x)
df["two"]   = df["test"].apply(lambda x: 'Two' in x)
df["three"] = df["test"].apply(lambda x: 'Three' in x)
df["four"]  = df["test"].apply(lambda x: 'Four' in x)

# sequential re-assignment
df.loc[df['test'].map(str).str.contains('Two'), 'new']   = 2
df.loc[df['test'].map(str).str.contains('Four'), 'new']  = 4
df.loc[df['test'].map(str).str.contains('One'), 'new']   = 1
df.loc[df['test'].map(str).str.contains('Three'), 'new'] = 3
df['new'] = df['new'].astype(int)

and yields

enter image description here

  • Related