I'm trying to calculate the average expenses of a particular site over a given period during which a client was active. The data held in 2 separate tables
df1=
ClientID | DateStart | DateEnd | Site |
---|---|---|---|
1 | 2020-02-13 | 2020-02-15 | Albury |
2 | 2020-02-14 | 2020-02-15 | wodonga |
df2=
Site | Date | Expense |
---|---|---|
Albury | 2020-02-13 | 4 |
Albury | 2020-02-14 | 5 |
Albury | 2020-02-15 | 7 |
Albury | 2020-02-16 | 4 |
Albury | 2020-02-17 | 4 |
Wodonga | 2020-02-14 | 100 |
Wodonga | 2020-02-15 | 120 |
I was hoping for this
ClientID | DateStart | DateEnd | Site | Avg_exp |
---|---|---|---|---|
1 | 2020-02-13 | 2020-02-16 | Albury | 5.3 |
2 | 2020-02-14 | 2020-02-15 | wodonga | 110 |
I'm using SQLDF in R, and have come up with following steps:
sqldf("select avg(Expense) from df1,df2 WHERE df2.Site= 'Albury' AND df2.Date >= df1.DateStart AND df2.Date <= df1.DateEnd")
This gives me the average expense for Albury for the period that Client 1 was active - which is what i would like to attach to ClientID 1.
To do this, I try to add a new column in df1 that returns the average expense if df1.Site = df2.Site
df1$Avg_exp = sqldf("select avg(Expense) from df1,df2 WHERE df2.Site= df1.Site AND df2.Date >= df1.DateStart AND df2.Date <= df1.DateEnd")
This just calculates the Average expense for the entire of df2.
I've also tried just putting out a new df where I have grouped the average expense by clientID, with the aim of then just joining it back to DF1
df3=sqldf("select avg(Expense) from df1,df2 WHERE df2.Site= df1.Site AND df2.Date >= df1.DateStart AND df2.Date <= df1.DateEnd GROUP BY df1.ClientID")
But I get the Error "Parameter 5 does not have length 2"
Any suggestions?
Thanks!
CodePudding user response:
What you actually need is to use a left join for the two tables. So you would get something like that:
SELECT df1.ClientId, df1.DateStart, df1.DateEnd, df1.Site, AVG(df1.Expense) as Avg_exp
FROM df1
LEFT JOIN df2 ON df2.Site = df1.Site
WHERE df2.Date BETWEEN df1.DateStart AND df1.DateEnd
GROUP BY df1.ClientId, df1.DateStart, df1.DateEnd, df1.Site
Doing the above with a comparison over strings may not give you the expected result though. For example, 'wodonga' is not the same as 'Wodonga'.