I have a dataframe of several different dates of financial information, which looks like this.
symbol split_date date_start date_end marketCap Date Adj Close Volume
0 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-02-28 6.300000190734862 22010.0
1 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-03-02 6.0999999046325675 7300.0
2 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-03-03 6.5 9700.0
3 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-03-04 7.0 13340.0
4 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-03-05 6.9000000953674325 5510.0
5 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-03-06 6.4000000953674325 5790.0
6 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-03-09 6.300000190734862 10080.0
7 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-03-10 6.0 14630.0
8 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-03-11 5.300000190734863 20160.0
9 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-03-12 4.699999809265137 17390.0
In this dataframe, I have 577 different symbols (more than just USAU
). All dates in Date
column are different ranges, as they're all centered around a specific event happening.
I am trying to create a column which, for each ticker in the dataset with unique date range, assigns an integer (1-10) to the dates for grouping.
Pseudocode:
for symbol in symbols_in_dataframe:
convert date of symbol to integer
# IE
# Symbol: A YYYY/MM/01 -> 1
# Symbol: A YYYY/MM/02 -> 2
# .....
# Symbol: A YYYY/MM/10 -> 10
This loops through each symbol.
# Symbol: B YYYY/MM/01 -> 1
# Symbol: B YYYY/MM/02 -> 2
# .....
# Symbol: B YYYY/MM/10 -> 10
I gave this a shot but this is not the outcome I am wanting.
def to_integer(dt_time):
return 10000*dt_time.year 100*dt_time.month dt_time.day
blank_df = pd.dataframe()
for symbol in symbols_in_dataframe:
blank_df.map(symbol, to_integer)
Any help here would be greatly appreciated.
EDIT:
Expected output of dataframe:
symbol split_date date_start date_end marketCap Date Adj Close Volume Num_seq
0 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-02-28 6.3 22010.0 1
1 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-03-02 6.1 7300.0 2
2 USAU 2020-03-20 2020-02-28 2020-05-01 45773136 2020-03-03 6.5 9700.0 3
44 INLX 2020-03-20 2020-02-28 2020-05-01 19112184 2020-02-21 4.50 0.0 1
45 INLX 2020-03-20 2020-02-28 2020-05-01 19112184 2020-02-24 4.00 600.0 2
46 INLX 2020-03-20 2020-02-28 2020-05-01 19112184 2020-02-25 4.00 20.0 3
95 OMWS 2020-03-19 2020-02-27 2020-04-30 92829984 2021-04-12 0.25 1000.0 1
96 OMWS 2020-03-19 2020-02-27 2020-04-30 92829984 2021-04-14 0.25 6502.0 2
97 OMWS 2020-03-19 2020-02-27 2020-04-30 92829984 2021-04-16 0.25 0.0 3
Essentially, for each instance of symbol
, I am trying to assign a count of each symbol in a the dataframe, here seen under Num_seq
for number in sequence, if that makes sense.
You should be able to use the above dataframe as a reference.
EDIT with code sample from comment:
It seems it's making duplicates and sequencing on every other index entry on dataframe in an alternating format. I'm not entirely sure what is being edited with the code, though, as split_date
, date_start
, and date_end
should not be changed.
symbol split_date date_start date_end marketCap Date Adj Close Volume seq
44 INLX 2020-03-20 2020-02-28 2020-05-01 19112184 2020-02-21 4.500000 0.0 1
68 INLX 2020-03-13 2020-02-21 2020-04-24 19112184 2020-02-21 4.500000 0.0 1
45 INLX 2020-03-20 2020-02-28 2020-05-01 19112184 2020-02-24 4.000000 600.0 2
69 INLX 2020-03-13 2020-02-21 2020-04-24 19112184 2020-02-24 4.000000 600.0 2
46 INLX 2020-03-20 2020-02-28 2020-05-01 19112184 2020-02-25 4.000000 20.0 3
70 INLX 2020-03-13 2020-02-21 2020-04-24 19112184 2020-02-25 4.000000 20.0 3
47 INLX 2020-03-20 2020-02-28 2020-05-01 19112184 2020-02-26 4.000000 0.0 4
71 INLX 2020-03-13 2020-02-21 2020-04-24 19112184 2020-02-26 4.000000 0.0 4
CodePudding user response:
Here's one way to do it using groupby
and cumsum
to assign increasing numbers to each unique Date in each unique symbol. Not sure if this is exactly what you want or not, please comment if I misinterpreted
Here is an example with a similar dataset to the one you posted
df = pd.DataFrame({
'symbol':['a','a','a','b','b','b','c','c'],
'values':[8,9,9,9,10,11,12,13]
})
df = df.sort_values(['symbol','values']) #if table is not already sorted. for you it would be ['symbol','Date']
df['x'] = (
df.groupby('symbol')['values'] #group by symbol
.transform(lambda v: (~v.duplicated()).cumsum()) #assign numbers, avoiding duplicate counting
)
print(df)
Output:
symbol values x
0 a 8 1
1 a 9 2
2 a 9 2
3 b 9 1
4 b 10 2
5 b 11 3
6 c 12 1
7 c 13 2