Pandas choose a single value from delimited list in column based on date-CodePudding

Apologies in advance for bad description—I'm struggling to articulate this well. Basically, I have the following data in a Pandas dataframe:

id symbols   symbol_dates
1  ABC       20070103:29991231
2  DEF;GH    20100307:20141215;20141216:29991231
3  IJ;KLM;NO 20040107:20051105;20051106:20180316;20180317:29991231
4  PQ        20080103:20090613
5  RST;UV    20080206:20150603;20150604:29991231
6  WXY       20070103:20130516

In the above, the data in 'symbols' and 'symbol_dates' are delimited by ';'. So for example, for id 2, the symbol for the date range of '20141216:29991231' is 'GH'.

I want to be able to input a date in Python, and then return a new dataframe that chooses the correct symbol for the inputted date. If the inputted date does not exist in any date range, then drop the row. For example, an input date of '20141220' should result in the following new dataframe:

id symbol
1 ABC
2 GH
3 KLM
5 RST

I'm stuck on where to even begin with this—any ideas would be appreciated.

CodePudding user response：

# Split on ';':
df.symbols = df.symbols.str.split(';')
df.symbol_dates = df.symbol_dates.str.split(';')

# Explode the columns:
df = df.explode(['symbols', 'symbol_dates'])

# Split on ':' and expand:
df[['date_start', 'date_end']] = df.symbol_dates.str.split(':', expand=True)

# Drop now unneeded column:
df = df.drop('symbol_dates', axis=1)

# Convert to timestamps... you have some invalid ones.
for col in ['date_start', 'date_end']:
    df[col] = pd.to_datetime(df[col], format='%Y%M%d', errors='coerce')

# Fill with infinity:
df = df.fillna(np.inf)
print(df)

Output:

   id symbols date_start             date_end
0   1     ABC 2007-01-03                  inf
1   2     DEF 2010-03-07  2014-01-15 00:12:00
1   2      GH 2014-12-16                  inf
2   3      IJ 2004-01-07  2005-01-05 00:11:00
2   3     KLM 2005-11-06  2018-01-16 00:03:00
2   3      NO 2018-03-17                  inf
3   4      PQ 2008-01-03  2009-01-13 00:06:00
4   5     RST 2008-02-06  2015-01-03 00:06:00
4   5      UV 2015-06-04                  inf
5   6     WXY 2007-01-03  2013-01-16 00:05:00

CodePudding user response：

You can use the following:

import pandas as pd
from datetime import datetime

df = pd.DataFrame({
    'symbols': ['ABC', 'DEF;GH', 'IJ;KLM;NO', 'PQ', 'RST;UV', 'WXY',],
    'symbol_dates': [
        '20070103:29991231',
        '20100307:20141215;20141216:29991231',
        '20040107:20051105;20051106:20180316;20180317:29991231',
        '20080103:20090613',
        '20080206:20150603;20150604:29991231',
        '20070103:20130516',
    ]
})
# Get each symbol and date range on its own row 
df = df.apply(lambda x: x.str.split(';').explode())

#Split the range into start and end columns with a date type
df['symbol_date_start'] = pd.to_datetime(df['symbol_dates'].str.split(':').str[0])
df['symbol_date_end'] = pd.to_datetime(df['symbol_dates'].str.split(':').str[1], errors='coerce')

# Choose your filter date
filter_date = datetime(2014, 12, 20)

# Filter based on filter date
df = df.loc[(df['symbol_date_start'] <= filter_date) & (df['symbol_date_end'].isnull()) | (df['symbol_date_end'] >= filter_date)]

Output

  symbols       symbol_dates symbol_date_start symbol_date_end
0     ABC  20070103:29991231        2007-01-03             NaT
1      GH  20141216:29991231        2014-12-16             NaT
2     KLM  20051106:20180316        2005-11-06      2018-03-16
4     RST  20080206:20150603        2008-02-06      2015-06-03