I want to create a pandas DataFrame with five columns: date, house type, neighborhood, realtor and zip code.
- A date range of August 1, 2023 to September 1, 2023
- There are 4 house types
- 6 realtors
- Only one neighborhood
- Only one zipcode So each the date needs to be repeated 16 times for each house type, in this only neighborhood, for each realtor and for one unique zip code. So there would be 16 rows per day.
I can add the consecutive dates to the date column, however, I have trouble in adding the rest of the data, which I think should be nested in for loops. So far, I have these lines:
import pandas as pd
from datetime import date,timedelta
delta = timedelta(days=1)
start = date(2023,8,1)
end=date(2023,9,1)
loop_date = start
df = pd.DataFrame(columns={"Date", "House type", "Neighborhood", "Realtor", "Zipcode"})
while loop_date<=end:
df = df.append({"Date":loop_date},ignore_index=True)
loop_date =delta
The intended output for each day would be the following:
Date | House type | Neighborhood | Realtor | Zipcode |
---|---|---|---|---|
8/1/23 | 1 | a | Alex | 10018 |
8/1/23 | 2 | a | Alex | 10018 |
8/1/23 | 3 | a | Alex | 10018 |
8/1/23 | 4 | a | Alex | 10018 |
8/1/23 | 1 | a | Maggie | 10018 |
8/1/23 | 2 | a | Maggie | 10018 |
8/1/23 | 3 | a | Maggie | 10018 |
8/1/23 | 4 | a | Maggie | 10018 |
8/1/23 | 1 | a | Ginger | 10018 |
8/1/23 | 2 | a | Ginger | 10018 |
8/1/23 | 3 | a | Ginger | 10018 |
8/1/23 | 4 | a | Ginger | 10018 |
8/1/23 | 1 | a | Pat | 10018 |
8/1/23 | 2 | a | Pat | 10018 |
8/1/23 | 3 | a | Pat | 10018 |
8/1/23 | 4 | a | Pat | 10018 |
Any help would be much appreciated, thank you!
CodePudding user response:
It sounds like a Cartesian product. I think product
in itertools
can help:
from itertools import product
list(product(['a', 'b', 'c'], ['Alex', 'Maggie', 'Ginger', 'Pat'], range(3)))
[('a', 'Alex', 0),
('a', 'Alex', 1),
('a', 'Alex', 2),
('a', 'Maggie', 0),
('a', 'Maggie', 1),
('a', 'Maggie', 2),
('a', 'Ginger', 0),
('a', 'Ginger', 1),
('a', 'Ginger', 2),
('a', 'Pat', 0),
('a', 'Pat', 1),
('a', 'Pat', 2),
('b', 'Alex', 0),
('b', 'Alex', 1),
('b', 'Alex', 2),
('b', 'Maggie', 0),
...
]
CodePudding user response:
You may create your dataframe using two nested loops:
- First
while
loop iterates over the dates (you already got that one covered) - Second
for
loop iterates over a list of realtors
You can simply create temporary dataframes df_temp
containing the date, realtor and house types (read from a numpy array), append them to a list and finally concatenate this list of dataframes into your output dataframe df
:
import numpy as np
import pandas as pd
from datetime import date,timedelta
delta = timedelta(days=1)
start = date(2023,8,1)
end=date(2023,9,1)
loop_date = start
realtors = ['Alex', 'Maggie', 'Ginger', 'Pat']
list_of_dataframes = []
while loop_date<=end:
for realtor in realtors:
df_temp = pd.DataFrame(columns={"Date", "House type", "Realtor"})
df_temp['House type'] = pd.Series(np.arange(1,5))
df_temp['Date'] = loop_date
df_temp['Realtor'] = realtor
list_of_dataframes.append(df_temp)
loop_date =delta
df = pd.concat(list_of_dataframes)
Once this is done, the only thing left to do is to populate columns Neighborhood
and Zipcode
!
CodePudding user response:
try this:
import pandas as pd
from itertools import product
house_types = [1,2,3,4]
realtors = [*'ABCDEF']
neighborhood = ['a']
zip_code = ['10018']
daily_data = product(house_types, realtors, neighborhood, zip_code)
dates = pd.date_range('2023-8-1', '2023-9-1')
data = product(dates, daily_data)
idx, data = zip(*data)
cols = ["House_type", "Realtor", "Neighborhood", "Zipcode"]
df = pd.DataFrame(data, index=idx, columns=cols).rename_axis('Date')
print(df)
>>>
House_type Realtor Neighborhood Zipcode
Date
2023-08-01 1 A a 10018
2023-08-01 1 B a 10018
2023-08-01 1 C a 10018
2023-08-01 1 D a 10018
2023-08-01 1 E a 10018
... ... ... ... ...
2023-09-01 4 B a 10018
2023-09-01 4 C a 10018
2023-09-01 4 D a 10018
2023-09-01 4 E a 10018
2023-09-01 4 F a 10018