Update column depending if other column value is in list-CodePudding

Im working with a data set that needs some manual cleanup. One thing that i need to do is assign a certain value in one column to some of my rows, if in another column, that row has a value that is present in a list ive defined.

So here a reduced example of what i want to do:

to_be_changed = ['b','e','a']

df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ]})

# change col1 in all rows which label shows up in to_be_changed to 3

So the desidered modified Dataframe would look like:

  col1 col2
0    3    a
1    3    b
2    2    c
3    1    d
4    3    e

My closest attempt to solving this is:

df = pd.DataFrame(np.where(df=='b' ,3,df)
  ,index=df.index,columns=df.columns)

Which produces:

 col1 col2
0    1    a
1    2    3
2    2    c
3    1    d
4    2    e

This only changes col2 and obviously only the rows with the hardcoded-label 'b'.

I also tried:

df = pd.DataFrame(np.where(df in to_be_changed ,3,df)
  ,index=df.index,columns=df.columns)

But that produces an error:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_11084/574679588.py in <cell line: 4>()
      3 df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ]})
      4 df = pd.DataFrame(
----> 5   np.where(df in to_be_changed ,3,df)
      6   ,index=df.index,columns=df.columns)
      7 df

~/.local/lib/python3.9/site-packages/pandas/core/generic.py in __nonzero__(self)
   1525     @final
   1526     def __nonzero__(self):
-> 1527         raise ValueError(
   1528             f"The truth value of a {type(self).__name__} is ambiguous. "
   1529             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Thanks for any help !

CodePudding user response：

use numpy.where like this

import numpy as np
to_be_changed = ['b','e','a']

df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ]})

df["col1"] = np.where(df["col2"].isin(to_be_changed), 3, df["col1"])

# output
  col1  col2
0   3   a
1   3   b
2   2   c
3   1   d
4   3   e

CodePudding user response：

Use boolean indexing:

# which rows need to be changed?
m = df['col2'].isin(to_be_changed)

# update those specifically
df.loc[m, ['col1']] = 3

output:

   col1 col2
0     3    a
1     3    b
2     2    c
3     1    d
4     3    e

CodePudding user response：

df.loc[(df["col2"]=="a")|(df["col2"]=="b")|(df["col2"]=="e"), "col1"]=3

CodePudding user response：

You could also use pandas loc (documentation), using the same isin() function:

import pandas as pd

to_be_changed = ['b','e','a']
df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ]})

df.loc[df['col2'].isin(to_be_changed), 'col1'] = 3

produces the expected output:

   col1 col2
0     3    a
1     3    b
2     2    c
3     1    d
4     3    e

I find it usefull because you can change several columns at once given the same condition.

import pandas as pd

to_be_changed = ['b','e','a']
df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ],'col2':[5,6,7,8,9]})

df.loc[df['col2'].isin(to_be_changed), ['col1','col3']] = [3,0]

which gives you:

   col1 col2  col3
0     3    a     0
1     3    b     0
2     2    c     7
3     1    d     8
4     3    e     0

However for large dataframes, np.where is probably faster... but I didn't check.