Im working with a data set that needs some manual cleanup. One thing that i need to do is assign a certain value in one column to some of my rows, if in another column, that row has a value that is present in a list ive defined.
So here a reduced example of what i want to do:
to_be_changed = ['b','e','a']
df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ]})
# change col1 in all rows which label shows up in to_be_changed to 3
So the desidered modified Dataframe would look like:
col1 col2
0 3 a
1 3 b
2 2 c
3 1 d
4 3 e
My closest attempt to solving this is:
df = pd.DataFrame(np.where(df=='b' ,3,df)
,index=df.index,columns=df.columns)
Which produces:
col1 col2
0 1 a
1 2 3
2 2 c
3 1 d
4 2 e
This only changes col2 and obviously only the rows with the hardcoded-label 'b'
.
I also tried:
df = pd.DataFrame(np.where(df in to_be_changed ,3,df)
,index=df.index,columns=df.columns)
But that produces an error:
ValueError Traceback (most recent call last)
/tmp/ipykernel_11084/574679588.py in <cell line: 4>()
3 df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ]})
4 df = pd.DataFrame(
----> 5 np.where(df in to_be_changed ,3,df)
6 ,index=df.index,columns=df.columns)
7 df
~/.local/lib/python3.9/site-packages/pandas/core/generic.py in __nonzero__(self)
1525 @final
1526 def __nonzero__(self):
-> 1527 raise ValueError(
1528 f"The truth value of a {type(self).__name__} is ambiguous. "
1529 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Thanks for any help !
CodePudding user response:
use numpy.where
like this
import numpy as np
to_be_changed = ['b','e','a']
df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ]})
df["col1"] = np.where(df["col2"].isin(to_be_changed), 3, df["col1"])
# output
col1 col2
0 3 a
1 3 b
2 2 c
3 1 d
4 3 e
CodePudding user response:
Use boolean indexing:
# which rows need to be changed?
m = df['col2'].isin(to_be_changed)
# update those specifically
df.loc[m, ['col1']] = 3
output:
col1 col2
0 3 a
1 3 b
2 2 c
3 1 d
4 3 e
CodePudding user response:
df.loc[(df["col2"]=="a")|(df["col2"]=="b")|(df["col2"]=="e"), "col1"]=3
CodePudding user response:
You could also use pandas loc
(documentation), using the same isin()
function:
import pandas as pd
to_be_changed = ['b','e','a']
df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ]})
df.loc[df['col2'].isin(to_be_changed), 'col1'] = 3
produces the expected output:
col1 col2
0 3 a
1 3 b
2 2 c
3 1 d
4 3 e
I find it usefull because you can change several columns at once given the same condition.
import pandas as pd
to_be_changed = ['b','e','a']
df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ],'col2':[5,6,7,8,9]})
df.loc[df['col2'].isin(to_be_changed), ['col1','col3']] = [3,0]
which gives you:
col1 col2 col3
0 3 a 0
1 3 b 0
2 2 c 7
3 1 d 8
4 3 e 0
However for large dataframes, np.where
is probably faster... but I didn't check.