basically I've got dataframe with loads of data about real estate. Every day, new data are added for each real estate, most importantly its price, region where the real estate is and the date, that this real estate was added to the dataframe. For every region I want to calculate development of the price every day. I get my dataframe from database like this:
data1 = pd.read_sql_query(
"SELECT REAL_ESTATE.UNIQUE_RE_NUMBER, REAL_ESTATE.TYP_ID, ADDRESS.ADDRSS, ADDRESS.LOCATION, PRICE.RE_PRICE, MAX(PRICE.UPDATE_DATE) AS UPDATE_DATE, HOUSEINFO.RE_POLOHA, HOUSEINFO.RE_DRUH, HOUSEINFO.RE_TYP, HOUSEINFO.RE_UPLOCHA "
"FROM REAL_ESTATE INNER JOIN ADDRESS, PRICE, HOUSEINFO ON REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=HOUSEINFO.INF_ID GROUP BY REAL_ESTATE.ID ",
conn)
data2 = pd.read_sql_query(
"SELECT REAL_ESTATE.UNIQUE_RE_NUMBER, REAL_ESTATE.TYP_ID, ADDRESS.ADDRSS, ADDRESS.LOCATION, PRICE.RE_PRICE, MAX(PRICE.UPDATE_DATE) AS UPDATE_DATE, FLATINFO.RE_DISPOZICE, FLATINFO.RE_DRUH, FLATINFO.RE_PPLOCHA "
"FROM REAL_ESTATE INNER JOIN ADDRESS, PRICE, FLATINFO ON REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=FLATINFO.INF_ID GROUP BY REAL_ESTATE.ID ",
conn)
data3 = pd.read_sql_query(
"SELECT REAL_ESTATE.UNIQUE_RE_NUMBER, REAL_ESTATE.TYP_ID, ADDRESS.ADDRSS, ADDRESS.LOCATION, PRICE.RE_PRICE, MAX(PRICE.UPDATE_DATE) AS UPDATE_DATE, LANDINFO.RE_PLOCHA, LANDINFO.RE_DRUH, LANDINFO.RE_SITE, LANDINFO.RE_KOMUNIKACE "
"FROM REAL_ESTATE INNER JOIN ADDRESS, PRICE, LANDINFO ON REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=LANDINFO.INF_ID GROUP BY REAL_ESTATE.ID ",
conn)
df = [data1, data2, data3]
dff = pd.concat(df)
dff = dff.reset_index(drop=True)
For calculating average I have this command:
dff['LOC_DATE_AVG'] = dff.groupby(['LOCATION', 'UPDATE_DATE'])['RE_PRICE'].transform('mean')
This only shows, what average price was added each day, but I would like to calculate the whole average of every real estate added to a particular date. So when I have data from 1.1.2021, 2.1.2021, 3.1.2021 and when I want to know average to the date 2.1.2021, it would calculate average from 1.1.2021 and also from 2.1.2021. Is it possible?
CodePudding user response:
It's a bit hard to say without an mre (see also here). Please add one.
You could try:
dff["UPDATE_DATE"] = pd.to_datetime(dff["UPDATE_DATE"]) # Just to make sure
result = (dff[dff["UPDATE_DATE"] <= pd.Timestamp(year=2021, month=1, day=2)]
.groupby("LOCATION")["RE_PRICE"]
.mean())
Regarding your comment: With the sample dataframe (mre :))
df = pd.DataFrame(
{
"LOCATION": ["A", "A", "A", "B", "B"],
"UPDATE_DATE": ["2021-01-01", "2021-01-02", "2021-01-03",
"2021-01-01", "2021-01-02"],
"RE_PRICE": [1, 2, 3, 1, 2]
}
)
df["UPDATE_DATE"] = pd.to_datetime(df["UPDATE_DATE"])
LOCATION UPDATE_DATE RE_PRICE
0 A 2021-01-01 1
1 A 2021-01-02 2
2 A 2021-01-03 3
3 B 2021-01-01 1
4 B 2021-01-02 2
this
def cum_mean(sdf):
return pd.DataFrame(
sdf.query("UPDATE_DATE <= @day")["RE_PRICE"].mean()
for day in sdf['UPDATE_DATE'].values
)
df["CUM_MEAN"] = df.groupby("LOCATION").apply(cum_mean).reset_index(drop=True)
produces
LOCATION UPDATE_DATE RE_PRICE CUM_MEAN
0 A 2021-01-01 1 1.0
1 A 2021-01-02 2 1.5
2 A 2021-01-03 3 2.0
3 B 2021-01-01 1 1.0
4 B 2021-01-02 2 1.5
If the UPDATE_DATE
column is sorted in ascending order (groupwise is enough) you could also do
grouped = df.groupby("LOCATION")
df["CUM_MEAN"] = grouped["RE_PRICE"].cumsum() / (grouped.cumcount() 1)
which would probably be faster than the other version.