I'm pretty new to Python and I would like to create a DataFrame from scratch that will look basically like this:
| date | time_gap |
| ------------- | -------------- |
| 2018/12/05 | 0 |
| 2018/12/05 | 1 |
| 2018/12/05 | ... |
| 2018/12/05 | 97 |
| 2018/12/06 | 0 |
| 2018/12/06 | ... |
| 2018/12/06 | 97 |
| 2018/12/07 | 0 |
| 2018/12/07 | ... |
| 2018/12/07 | 97 |
...
So my first column would have 98 times the same date, while my second column goes from 0 to 97. I've tried this:
import pandas as pd
import datetime as dt
start_date = dt.datetime.strptime("2018/12/05", "%Y/%m/%d")
end_date = dt.datetime.strptime("2018/12/08", "%Y/%m/%d")
df_dates = pd.date_range(start_date, end_date).to_list()
dict_generated = {}
for date in df_dates:
for n in range(0, 98):
dict_generated["date"] = date
dict_generated["time_gap"] = [x for x in range(0,98)]
df_from_dict = pd.DataFrame.from_dict(dict_generated)
But it only generated it for the last date of my list of dates. Besides, I've been told a nested for loop could be pretty slow to generated (should the list of dates be quite long). Is there a quicker and more pythonic way to do?
Thank you for your help!
CodePudding user response:
Try using date_range
to create the dates and np.arange
to create the values
import numpy as np
import pandas as pd
df = pd.DataFrame({'date': pd.date_range("2018/12/05", "2018/12/08").repeat(98),
'time_gap': np.resize(np.arange(0, 98), 98*4)})
# the date range extends four days so we do 98*4 in the resize
date time_gap
0 2018-12-05 0
1 2018-12-05 1
2 2018-12-05 2
3 2018-12-05 3
4 2018-12-05 4
.. ... ...
387 2018-12-08 93
388 2018-12-08 94
389 2018-12-08 95
390 2018-12-08 96
391 2018-12-08 97
Here is an updated solution by creating a function
def create_df(start: str, end: str, repeat: int) -> pd.DataFrame:
dates = pd.date_range(start, end) # create the date range
df = pd.DataFrame({'date': dates.repeat(repeat), # repeat the date range by the number specified in the param repeat
'time_gap': np.resize(np.arange(0, repeat), repeat*len(dates))}) # resize the array
return df
create_df('2022-01-01', '2022-01-06', 123)
date time_gap
0 2022-01-01 0
1 2022-01-01 1
2 2022-01-01 2
3 2022-01-01 3
4 2022-01-01 4
... ... ...
733 2022-01-06 118
734 2022-01-06 119
735 2022-01-06 120
736 2022-01-06 121
737 2022-01-06 122
We use resize because when we do np.arange
it creates an array 0 to n. When we resize it creates an array 0 to n that is repeated x times.
CodePudding user response:
Try this:
import datetime as dt
start_date = dt.datetime.strptime("2018/12/05", "%Y/%m/%d")
end_date = dt.datetime.strptime("2018/12/08", "%Y/%m/%d")
df_dates = pd.DataFrame(pd.date_range(start_date, end_date))
one_to_hundred = pd.DataFrame(pd.Series(range(0,98)))
df = pd.merge(df_dates,one_to_hundred,how='cross')
print(df)
CodePudding user response:
This solution is like It_is_Chris's but without using Numpy
start_date = dt.datetime.strptime("2018/12/05", "%Y/%m/%d")
dates = pd.DataFrame({'date': pd.date_range(start_date, periods = 4).strftime("%Y/%m/%d").repeat(98),
'time_gap':(list(range(0,98))*4)})