I have a pandas dataframe with historical stock prices. The issue is that some columns contain prices up to a certain date, then some blank rows, and finally price data again. This is an example:
date Stock A
6/30/1990 0.19
7/31/1990 0.19
8/31/1990 0.25
9/30/1990 0.34
10/31/1990 NaN
11/30/1990 NaN
12/31/1990 NaN
1/31/1991 NaN
2/28/1991 NaN
3/31/1991 NaN
4/30/1991 20.88
5/31/1991 18.25
6/30/1991 17.00
7/31/1991 17.25
8/31/1991 17.50
So what I am trying to do is in this column is to replace all the rows above the last non-NaN value with a NaN value, in this case, replace all the values from the first 4 rows with NaNs, so that the final result is the following:
date Stock A
6/30/1990 Nan
7/31/1990 Nan
8/31/1990 Nan
9/30/1990 Nan
10/31/1990 NaN
11/30/1990 NaN
12/31/1990 NaN
1/31/1991 NaN
2/28/1991 NaN
3/31/1991 NaN
4/30/1991 20.88
5/31/1991 18.25
6/30/1991 17.00
7/31/1991 17.25
8/31/1991 17.50
I have tried to do it manually in Excel, since I am fairly new to Python, but given the number of columns, the process is taking too long.
The solution would be related to iterating between all the columns and check wether the condition explained above checks, and then modifying such columns.
What could be a possible solution?
CodePudding user response:
This should do it for you.
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': {0: '6/30/1990',
1: '7/31/1990',
2: '8/31/1990',
3: '9/30/1990',
4: '10/31/1990',
5: '11/30/1990',
6: '12/31/1990',
7: '1/31/1991',
8: '2/28/1991',
9: '3/31/1991',
10: '4/30/1991',
11: '5/31/1991',
12: '6/30/1991',
13: '7/31/1991',
14: '8/31/1991'},
'Stock': {0: np.nan,
1: 18.25,
2: np.nan,
3: np.nan,
4: 18.25,
5: np.nan,
6: np.nan,
7: np.nan,
8: np.nan,
9: np.nan,
10: 20.88,
11: 18.25,
12: 17.0,
13: 17.25,
14: 17.5}})
# find the index value of last np.nan occurence
idx = np.where(df['Stock'].isnull())[-1][-1]
# Use np.where() to update values with index value below nan_idx
df['Stock'] = np.where(df.index < idx , np.nan , df['Stock'])
Output df:
date Stock
0 6/30/1990 NaN
1 7/31/1990 NaN
2 8/31/1990 NaN
3 9/30/1990 NaN
4 10/31/1990 NaN
5 11/30/1990 NaN
6 12/31/1990 NaN
7 1/31/1991 NaN
8 2/28/1991 NaN
9 3/31/1991 NaN
10 4/30/1991 20.88
11 5/31/1991 18.25
12 6/30/1991 17.00
13 7/31/1991 17.25
14 8/31/1991 17.50
If you want to do it on all your columns you can use a simple for loop.
for col in df.columns:
idx = np.where(df[col].isnull())[-1][-1]
df[col] = np.where(df.index < idx , np.nan , df[col])
CodePudding user response:
try this:
df.loc[:df.Stock.isnull().idxmax(), 'Stock'] = None
print(df)
>>>
date Stock
0 6/30/1990 NaN
1 7/31/1990 NaN
2 8/31/1990 NaN
3 9/30/1990 NaN
4 10/31/1990 NaN
5 11/30/1990 NaN
6 12/31/1990 NaN
7 1/31/1991 NaN
8 2/28/1991 NaN
9 3/31/1991 NaN
10 4/30/1991 20.88
11 5/31/1991 18.25
12 6/30/1991 17.00
13 7/31/1991 17.25
14 8/31/1991 17.50