I've got a dataframe with hundreds of columns and millions of rows. I need to conditionally replace the values of selected columns by another value. what is the most efficient way to do this, if I know the index or names of the columns that need to be changed?
example below:
df = pd.DataFrame({'ID1':[0,1,2,3,4,5,6], 'ID2': [0,1,2,0,4,0,5], 'Value1':[0,1,6,0,4,7,0], 'Value2':[1,0,2,3,0,4,5] })
ID1 ID2 Value1 Value2
0 0 0 0 1
1 1 1 1 0
2 2 2 6 2
3 3 0 0 3
4 4 4 4 0
5 5 0 7 4
6 6 5 0 5
I want the values of Value1,Value2,..., ValueN which are larger than 0 to be replaced by 1. Note that ID1, ID2, ..., IDN should be excluded.
Desired Output:
ID1 ID2 Value1 Value2
0 0 0 0 1
1 1 1 1 0
2 2 2 1 1
3 3 0 0 1
4 4 4 1 0
5 5 0 1 1
6 6 5 0 1
dataframe has hundreds of columns and millions of rows.... so I'd like to do this as computationally efficient as possible.
CodePudding user response:
Depending on how many ValueN columns you have, you can first build a list of them:
cols = [x for x in df.columns if 'Value' in x]
An efficient way is using mask
:
df[cols] = df[cols].mask(df[cols] > 0, 1)
Alternatively, you can try np.where
:
df[cols] = np.where(df[cols] > 1, 0, df[cols])
CodePudding user response:
Or you can try this:
df[ df.iloc[0: ,2:n] >0 ] = 1
n is the maximum value of your column index number 1.
df[ df >0 ] = 1
can check if value in df >0, replace it with 1.
But you want first two column(ID1, ID2) remain the same, so you can use df.iloc[0: ,2:n]
extract all row and column[2] to [n],
Reference:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
How to set value of first row of pandas dataframe meeting condition?