I have posts dataframe and authors dataframe
I need to calculate favCount of the authors of posts in each day
posts_columns = ["postId", "authorId", "date"]
posts = [("1", "1", "2020-10-10"),
("2", "2", "2020-10-10"),
("3", "2", "2020-10-10"),
("4", "2", "2020-10-11")
("5", "3", "2020-10-11")
("6", "3", "2020-10-11")]
authors_columns = ["authorId", "favCount"]
authors = [
("1", "5"),
("2", "3"),
("3", "12")]
After (inner) joining Posts and Authors dataframes on (posts.authorId=author.authorId) and I got this
---------- -------- ---------------
| date|authorId| favCount|
---------- -------- ---------------
|2020-10-10| 1| 5|
|2020-10-10| 2| 3|
|2020-10-10| 2| 3|
|2020-10-11| 2| 3|
|2020-10-11| 3| 12|
|2020-10-11| 3| 12|
---------- -------- ---------------
Now I want to calculate sum of favCount of authors in each day, the final result should be like this
---------- -------------
| date|sum(favCount)|
---------- -------------
|2020-10-10| 8|
|2020-10-11| 15|
---------- -------------
On OCT, 10th I have two authors (1 and 2) with total of 8 favCount (5 3)
On OCT, 11th I have two authors (2 and 3) with total of 15 favCount (3 12)
P.S: I don't want to count favCount for duplicated authors, favCount for each author should be counted once only in each day
P.S(2): I'm using PySpark and Dataframes but I don't mind answers in Pandas or even SQL
CodePudding user response:
Considering df1
as Posts and df2
as Authors dataframes
result = df1.merge(df2, how= 'inner').drop_duplicates(subset=['date','authorId'])
final = result.groupby([result.date])['favCount'].sum()
CodePudding user response:
If you want to try spark, you can try this
scala code:
df1.join(df2, Seq("authorId"), "inner").groupBy("date", "authorId").sum()
or python:
df1.join(df2, ["authorId"], "inner").groupBy("date", "authorId").sum()