Suppose to have a dataframe like this
A
0 -1
1 -1
2 1
3 0
4 -2
5 3
6 0
I want to add a new column with an incremental value in every row where the value of the column A
is negative.
A B
0 -1 1
1 -1 2
2 1 nan
3 0 nan
4 -2 3
5 3 nan
6 0 nan
Here is my code
ids = iter(range(1, np.sum((df['A'] < 0).values.ravel()) 1))
df['B'] = np.where(df['A'] < 0, next(ids), np.nan)
Unfortunately what I get is
A B
0 -1 1
1 -1 1
2 1 nan
3 0 nan
4 -2 1
5 3 nan
6 0 nan
I tried also with a generator
def id_generator(max_id):
curr = 1
while curr <= max_id:
yield curr
curr = 1
df['B'] = np.where(df['A'] < 0, next(id_generator(np.sum((df['A'] < 0).values.ravel()))), np.nan)
As the previous solution I get this dataframe
A B
0 -1 1
1 -1 1
2 1 nan
3 0 nan
4 -2 1
5 3 nan
6 0 nan
It seems that it creates an iterator/generator for every row it processes and for that reason the id is always 1
.
The only one solution that I found is using an intermediate dataframe
index = df[df['A'] < 0].index
new_df = pd.DataFrame(data=[x 1 for x in range(len(index))], columns=['B'], index=index)
df = df.join(new_df)
So my question is, is there a way to stick with np.where
solution instead of create a new temp dataframe? Or are there any pandas
bultin functions to do so?
CodePudding user response:
Use DataFrame.loc
for possible set values by range
:
m = df['A'] < 0
df.loc[m, 'B'] = range(1, m.sum() 1)
print (df)
A B
0 -1 1.0
1 -1 2.0
2 1 NaN
3 0 NaN
4 -2 3.0
5 3 NaN
6 0 NaN
CodePudding user response:
Almost anything can be done using numpy and pandas functions directly. Try to avoid iterators and generators
I have a solution using a couple of lines
First, create a dataframe for your example:
import numpy as np
import pandas as pd
df = pd.DataFrame(data={'A': [-1,2,-1,0,-1,-1,2,3,5,-1]})
You can use cumsum() cummulative sum to count how many negative numbers you've encountered and assign it to a new column
df['B'] = (df['A'] < 0).cumsum()
A B 0 -1 1 1 2 1 2 -1 2 3 0 2 4 -1 3 5 -1 4 6 2 4 7 3 4 8 5 4 9 -1 5
This still will have a values for the positive number in column A, so you can replace these by NaN values
df.loc[df['A'] >=0, 'B'] = np.NaN
You end up with:
A B 0 -1 1.0 1 2 NaN 2 -1 2.0 3 0 NaN 4 -1 3.0 5 -1 4.0 6 2 NaN 7 3 NaN 8 5 NaN 9 -1 5.0