Apologies in advance for bad description—I'm struggling to articulate this well. Basically, I have the following data in a Pandas dataframe:
id symbols symbol_dates
1 ABC 20070103:29991231
2 DEF;GH 20100307:20141215;20141216:29991231
3 IJ;KLM;NO 20040107:20051105;20051106:20180316;20180317:29991231
4 PQ 20080103:20090613
5 RST;UV 20080206:20150603;20150604:29991231
6 WXY 20070103:20130516
In the above, the data in 'symbols' and 'symbol_dates' are delimited by ';'. So for example, for id 2, the symbol for the date range of '20141216:29991231' is 'GH'.
I want to be able to input a date in Python, and then return a new dataframe that chooses the correct symbol for the inputted date. If the inputted date does not exist in any date range, then drop the row. For example, an input date of '20141220' should result in the following new dataframe:
id symbol
1 ABC
2 GH
3 KLM
5 RST
I'm stuck on where to even begin with this—any ideas would be appreciated.
CodePudding user response:
# Split on ';':
df.symbols = df.symbols.str.split(';')
df.symbol_dates = df.symbol_dates.str.split(';')
# Explode the columns:
df = df.explode(['symbols', 'symbol_dates'])
# Split on ':' and expand:
df[['date_start', 'date_end']] = df.symbol_dates.str.split(':', expand=True)
# Drop now unneeded column:
df = df.drop('symbol_dates', axis=1)
# Convert to timestamps... you have some invalid ones.
for col in ['date_start', 'date_end']:
df[col] = pd.to_datetime(df[col], format='%Y%M%d', errors='coerce')
# Fill with infinity:
df = df.fillna(np.inf)
print(df)
Output:
id symbols date_start date_end
0 1 ABC 2007-01-03 inf
1 2 DEF 2010-03-07 2014-01-15 00:12:00
1 2 GH 2014-12-16 inf
2 3 IJ 2004-01-07 2005-01-05 00:11:00
2 3 KLM 2005-11-06 2018-01-16 00:03:00
2 3 NO 2018-03-17 inf
3 4 PQ 2008-01-03 2009-01-13 00:06:00
4 5 RST 2008-02-06 2015-01-03 00:06:00
4 5 UV 2015-06-04 inf
5 6 WXY 2007-01-03 2013-01-16 00:05:00
CodePudding user response:
You can use the following:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
'symbols': ['ABC', 'DEF;GH', 'IJ;KLM;NO', 'PQ', 'RST;UV', 'WXY',],
'symbol_dates': [
'20070103:29991231',
'20100307:20141215;20141216:29991231',
'20040107:20051105;20051106:20180316;20180317:29991231',
'20080103:20090613',
'20080206:20150603;20150604:29991231',
'20070103:20130516',
]
})
# Get each symbol and date range on its own row
df = df.apply(lambda x: x.str.split(';').explode())
#Split the range into start and end columns with a date type
df['symbol_date_start'] = pd.to_datetime(df['symbol_dates'].str.split(':').str[0])
df['symbol_date_end'] = pd.to_datetime(df['symbol_dates'].str.split(':').str[1], errors='coerce')
# Choose your filter date
filter_date = datetime(2014, 12, 20)
# Filter based on filter date
df = df.loc[(df['symbol_date_start'] <= filter_date) & (df['symbol_date_end'].isnull()) | (df['symbol_date_end'] >= filter_date)]
Output
symbols symbol_dates symbol_date_start symbol_date_end
0 ABC 20070103:29991231 2007-01-03 NaT
1 GH 20141216:29991231 2014-12-16 NaT
2 KLM 20051106:20180316 2005-11-06 2018-03-16
4 RST 20080206:20150603 2008-02-06 2015-06-03