Home > database >  convert range of dates into sequence of integers
convert range of dates into sequence of integers

Time:02-16

I have a dataframe of several different dates of financial information, which looks like this.

    symbol  split_date  date_start  date_end    marketCap   Date        Adj Close        Volume
0   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-02-28  6.300000190734862 22010.0
1   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-03-02  6.0999999046325675   7300.0
2   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-03-03  6.5 9700.0
3   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-03-04  7.0 13340.0
4   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-03-05  6.9000000953674325  5510.0
5   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-03-06  6.4000000953674325  5790.0
6   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-03-09  6.300000190734862   10080.0
7   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-03-10  6.0 14630.0
8   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-03-11  5.300000190734863   20160.0
9   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-03-12  4.699999809265137   17390.0

In this dataframe, I have 577 different symbols (more than just USAU). All dates in Date column are different ranges, as they're all centered around a specific event happening.

I am trying to create a column which, for each ticker in the dataset with unique date range, assigns an integer (1-10) to the dates for grouping.

Pseudocode:

for symbol in symbols_in_dataframe:
    convert date of symbol to integer

# IE
# Symbol: A YYYY/MM/01 -> 1
# Symbol: A YYYY/MM/02 -> 2
# .....
# Symbol: A YYYY/MM/10 -> 10
    This loops through each symbol.

# Symbol: B YYYY/MM/01 -> 1
# Symbol: B YYYY/MM/02 -> 2
# .....
# Symbol: B YYYY/MM/10 -> 10

I gave this a shot but this is not the outcome I am wanting.

def to_integer(dt_time):
    return 10000*dt_time.year   100*dt_time.month   dt_time.day

blank_df = pd.dataframe()
for symbol in symbols_in_dataframe:
    blank_df.map(symbol, to_integer)

Any help here would be greatly appreciated.


EDIT:

Expected output of dataframe:

    symbol  split_date  date_start  date_end    marketCap   Date    Adj Close   Volume   Num_seq
0   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-02-28  6.3     22010.0    1
1   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-03-02  6.1     7300.0     2
2   USAU    2020-03-20  2020-02-28  2020-05-01  45773136    2020-03-03  6.5     9700.0     3
44  INLX    2020-03-20  2020-02-28  2020-05-01  19112184    2020-02-21  4.50    0.0        1
45  INLX    2020-03-20  2020-02-28  2020-05-01  19112184    2020-02-24  4.00    600.0      2
46  INLX    2020-03-20  2020-02-28  2020-05-01  19112184    2020-02-25  4.00     20.0      3
95  OMWS    2020-03-19  2020-02-27  2020-04-30  92829984    2021-04-12  0.25    1000.0     1
96  OMWS    2020-03-19  2020-02-27  2020-04-30  92829984    2021-04-14  0.25    6502.0     2
97  OMWS    2020-03-19  2020-02-27  2020-04-30  92829984    2021-04-16  0.25    0.0        3

Essentially, for each instance of symbol, I am trying to assign a count of each symbol in a the dataframe, here seen under Num_seq for number in sequence, if that makes sense.

You should be able to use the above dataframe as a reference.


EDIT with code sample from comment:

It seems it's making duplicates and sequencing on every other index entry on dataframe in an alternating format. I'm not entirely sure what is being edited with the code, though, as split_date, date_start, and date_end should not be changed.

    symbol  split_date  date_start  date_end    marketCap   Date    Adj Close   Volume  seq
44  INLX    2020-03-20  2020-02-28  2020-05-01  19112184    2020-02-21  4.500000    0.0     1
68  INLX    2020-03-13  2020-02-21  2020-04-24  19112184    2020-02-21  4.500000    0.0     1
45  INLX    2020-03-20  2020-02-28  2020-05-01  19112184    2020-02-24  4.000000    600.0   2
69  INLX    2020-03-13  2020-02-21  2020-04-24  19112184    2020-02-24  4.000000    600.0   2
46  INLX    2020-03-20  2020-02-28  2020-05-01  19112184    2020-02-25  4.000000    20.0    3
70  INLX    2020-03-13  2020-02-21  2020-04-24  19112184    2020-02-25  4.000000    20.0    3
47  INLX    2020-03-20  2020-02-28  2020-05-01  19112184    2020-02-26  4.000000    0.0     4
71  INLX    2020-03-13  2020-02-21  2020-04-24  19112184    2020-02-26  4.000000    0.0     4

CodePudding user response:

Here's one way to do it using groupby and cumsum to assign increasing numbers to each unique Date in each unique symbol. Not sure if this is exactly what you want or not, please comment if I misinterpreted

Here is an example with a similar dataset to the one you posted

df = pd.DataFrame({
    'symbol':['a','a','a','b','b','b','c','c'],
    'values':[8,9,9,9,10,11,12,13]
})

df = df.sort_values(['symbol','values']) #if table is not already sorted. for you it would be ['symbol','Date']

df['x'] = (
    df.groupby('symbol')['values'] #group by symbol
      .transform(lambda v: (~v.duplicated()).cumsum()) #assign numbers, avoiding duplicate counting
)
print(df)

Output:

  symbol  values  x
0      a       8  1
1      a       9  2
2      a       9  2
3      b       9  1
4      b      10  2
5      b      11  3
6      c      12  1
7      c      13  2
  • Related