Home > database >  How to mask a dataframe, but keep the index?
How to mask a dataframe, but keep the index?

Time:06-06

I have almost a years worth of minute data for a particular stock ticker, which I have masked to return a Pandas series of booleans whether an EMA has crossed another, e.g:

print(entries)
print(type(entries))
print(f'len of entries {len(entries)}')

Output:

datetime
2021-05-03 08:00:00    False
2021-05-03 08:03:00    False
2021-05-03 08:04:00    False
2021-05-03 08:06:00    False
2021-05-03 08:08:00    False
2021-05-03 08:09:00    False
2021-05-03 08:11:00    False
2021-05-03 08:29:00    False
2021-05-03 08:34:00    False
2021-05-03 08:41:00    False
2021-05-03 08:52:00    False
2021-05-03 08:55:00    False
2021-05-03 09:07:00    False
2021-05-03 09:13:00    False
2021-05-03 09:14:00    False
2021-05-03 09:26:00    False
2021-05-03 09:27:00    False
2021-05-03 09:34:00    False
2021-05-03 09:44:00    False
2021-05-03 09:57:00    False
2021-05-03 10:14:00    False
2021-05-03 10:31:00    False
2021-05-03 10:38:00    False
2021-05-03 10:41:00    False
...
2022-03-02 23:49:00    False
dtype: bool
<class 'pandas.core.series.Series'>
len of entries 141615

I'm then getting the current day and I'm interested in, and the hours between 930-1600 only:

    display(df_market_hours)
    print(type(df_market_hours))
    print(f'len of df_market_hours {len(df_market_hours)}')

Output:

volume  open    close   high    low time    date
datetime                            
2021-08-13 09:30:00 200.0   15.0000 15.0000 15.00   15.0000 09:30:00    2021-08-13
2021-08-13 10:01:00 310.0   15.0000 15.0000 15.00   15.0000 10:01:00    2021-08-13
2021-08-13 10:26:00 1031.0  14.9000 14.8900 14.90   14.8900 10:26:00    2021-08-13
2021-08-13 10:29:00 223.0   14.8900 14.8900 14.89   14.8900 10:29:00    2021-08-13
2021-08-13 11:00:00 285.0   14.8900 14.8900 14.89   14.8900 11:00:00    2021-08-13
... ... ... ... ... ... ... ...
2021-08-13 15:55:00 11489.0 18.0200 18.0100 18.04   17.9700 15:55:00    2021-08-13
2021-08-13 15:56:00 28828.0 18.0400 17.9700 18.18   17.9300 15:56:00    2021-08-13
2021-08-13 15:57:00 5720.0  17.9600 18.0412 18.05   17.9600 15:57:00    2021-08-13
2021-08-13 15:58:00 33329.0 18.0401 18.1800 18.31   18.0294 15:58:00    2021-08-13
2021-08-13 15:59:00 51872.0 18.1400 18.1650 18.18   18.0600 15:59:00    2021-08-13
184 rows × 7 columns

<class 'pandas.core.frame.DataFrame'>
len of df_market_hours 184

What I'm trying to do is return a new series, where the entries have the same index and shape as df_market_hours. How do I achieve this?

What I've tried so far:

getting the index of the current session time:

df_market_hours_idx = df_current_day.between_time('9:30', '16:00', include_end=False).index
df_market_hours = df_current_day.loc[df_market_hours_idx]

And then trying to create a mask with it:

df_copy = entries[entries == df_market_hours.index]

When I do that, I get the following error:

ValueError: ('Lengths must match to compare', (141615,), (54991,))

I think I need to use reindex and fill_value like shown here but I'm struggling to figure out how this applies to my use case

CodePudding user response:

As I understand, you want to filter entries dataframe by df_market_hours index

considered_indice = df_market_hours.index.values.tolist()
mask = entries.index.isin(considered_indice)
df_copy = entries[mask]

CodePudding user response:

I am not too sure to understand the question but if you want to keep the initial indexing of the original pd.DataFrame, while extracting a pd.Series ? Maybe you can play with the indexing option from the pd.Series() function.

EDIT:

What you can do is create a list of your entries, then filter using the isin() function to extract the relevant data in your df_market_hours:

lst_entries = list(entries)
masked_df_market_hours = df_market_hours[df_market_hours["datetime"].isin(lst_entries)]
  • Related