Here's the thing.
I'm building a dataframe so my students can use it in an exercise. This is what I was looking for:
id date n
0 2022-01-01 10
0 2022-01-02 30
0 2022-01-03 40
.
.
.
1 2022-01-01 0
1 2022-01-02 5
1 2022-01-03 16
.
.
.
2 2022-01-01 99
2 2022-02-01 100
2 2022-03-01 3300
It's a pretty simple idea. I have a panel data in which the first column is the customer id, the second one is the purchase date and n is the amount of things purchased. However, I couldn't get the second column in a fine order.
This is the farthest I got:
ids = pd.Series(range(100)).repeat(10).reset_index(drop=True)
dates = pd.date_range(start = pd.Timestamp('2022-01-01'),end = pd.Timestamp('2022-01-10')).repeat(100)
purchases = pd.Series(np.random.randint(low=0, high=500, size=100*10))
df = pd.DataFrame(data={'ids':ids, 'dates':dates, 'purchases': purchases}, index=None)
As you can see, I couldn't get it right. This is the result:
ids dates n
0 2022-01-01 10
0 2022-01-01 30
0 2022-01-01 40
.
.
.
Any ideas on how can I change the column dates in order to start with 2022-01-01, but in the second row we need to get 2022-01-02. So I get the correct order once and repeat.
This looks like something really easy, but I couldn't find the answer.
CodePudding user response:
This'll do what it seems like you want:
start_date = pd.Timestamp("2022-01-01")
purchase_low = 0
purchase_high = 500
num_ids = 100
num_days = 10
ids = np.repeat(np.arange(num_ids), num_days)
dates = np.tile(pd.date_range("2022-01-01", periods=num_days), num_ids)
purchases = np.random.randint(purchase_low, purchase_high, num_ids * num_days)
df = pd.DataFrame({"ids": ids, "dates": dates, "purchases": purchases})
Output:
ids dates purchases
0 0 2022-01-01 193
1 0 2022-01-02 362
2 0 2022-01-03 218
3 0 2022-01-04 395
4 0 2022-01-05 172
.. ... ... ...
995 99 2022-01-06 454
996 99 2022-01-07 342
997 99 2022-01-08 356
998 99 2022-01-09 423
999 99 2022-01-10 104
[1000 rows x 3 columns]
The issue with your approach was that .repeat()
repeats each element n
times, whereas what you needed to do with the dates was repeat each sequence of dates n
times. Hence why I use .repeat
for ids
, but .tile
for dates
.