How to replace gap rows in pandas-CodePudding

I have a pandas dataframe with historical stock prices. The issue is that some columns contain prices up to a certain date, then some blank rows, and finally price data again. This is an example:

date        Stock A
6/30/1990   0.19
7/31/1990   0.19
8/31/1990   0.25
9/30/1990   0.34
10/31/1990  NaN
11/30/1990  NaN
12/31/1990  NaN
1/31/1991   NaN
2/28/1991   NaN
3/31/1991   NaN
4/30/1991   20.88
5/31/1991   18.25
6/30/1991   17.00
7/31/1991   17.25
8/31/1991   17.50

So what I am trying to do is in this column is to replace all the rows above the last non-NaN value with a NaN value, in this case, replace all the values from the first 4 rows with NaNs, so that the final result is the following:

date        Stock A
6/30/1990   Nan
7/31/1990   Nan
8/31/1990   Nan
9/30/1990   Nan
10/31/1990  NaN
11/30/1990  NaN
12/31/1990  NaN
1/31/1991   NaN
2/28/1991   NaN
3/31/1991   NaN
4/30/1991   20.88
5/31/1991   18.25
6/30/1991   17.00
7/31/1991   17.25
8/31/1991   17.50

I have tried to do it manually in Excel, since I am fairly new to Python, but given the number of columns, the process is taking too long.

The solution would be related to iterating between all the columns and check wether the condition explained above checks, and then modifying such columns.

What could be a possible solution?

CodePudding user response：

This should do it for you.

import pandas as pd
import numpy as np

df = pd.DataFrame({'date': {0: '6/30/1990',
  1: '7/31/1990',
  2: '8/31/1990',
  3: '9/30/1990',
  4: '10/31/1990',
  5: '11/30/1990',
  6: '12/31/1990',
  7: '1/31/1991',
  8: '2/28/1991',
  9: '3/31/1991',
  10: '4/30/1991',
  11: '5/31/1991',
  12: '6/30/1991',
  13: '7/31/1991',
  14: '8/31/1991'},
 'Stock': {0: np.nan,
  1: 18.25,
  2: np.nan,
  3: np.nan,
  4: 18.25,
  5: np.nan,
  6: np.nan,
  7: np.nan,
  8: np.nan,
  9: np.nan,
  10: 20.88,
  11: 18.25,
  12: 17.0,
  13: 17.25,
  14: 17.5}})

# find the index value of last np.nan occurence
idx = np.where(df['Stock'].isnull())[-1][-1]

# Use np.where() to update values with index value below nan_idx
df['Stock'] = np.where(df.index < idx , np.nan , df['Stock'])

Output df:

    date        Stock
0   6/30/1990   NaN
1   7/31/1990   NaN
2   8/31/1990   NaN
3   9/30/1990   NaN
4   10/31/1990  NaN
5   11/30/1990  NaN
6   12/31/1990  NaN
7   1/31/1991   NaN
8   2/28/1991   NaN
9   3/31/1991   NaN
10  4/30/1991   20.88
11  5/31/1991   18.25
12  6/30/1991   17.00
13  7/31/1991   17.25
14  8/31/1991   17.50

If you want to do it on all your columns you can use a simple for loop.

for col in df.columns:
    idx = np.where(df[col].isnull())[-1][-1]
    df[col] = np.where(df.index < idx , np.nan , df[col])

CodePudding user response：

try this:

df.loc[:df.Stock.isnull().idxmax(), 'Stock'] = None
print(df)
>>>
    date        Stock
0   6/30/1990   NaN
1   7/31/1990   NaN
2   8/31/1990   NaN
3   9/30/1990   NaN
4   10/31/1990  NaN
5   11/30/1990  NaN
6   12/31/1990  NaN
7   1/31/1991   NaN
8   2/28/1991   NaN
9   3/31/1991   NaN
10  4/30/1991   20.88
11  5/31/1991   18.25
12  6/30/1991   17.00
13  7/31/1991   17.25
14  8/31/1991   17.50