How to select unique values of column A in each group B and get sum of values in column C for these-CodePudding

I have posts dataframe and authors dataframe

I need to calculate favCount of the authors of posts in each day

    posts_columns = ["postId", "authorId", "date"]
    posts = [("1", "1", "2020-10-10"),
             ("2", "2", "2020-10-10"),
             ("3", "2", "2020-10-10"),
             ("4", "2", "2020-10-11")
             ("5", "3", "2020-10-11")
             ("6", "3", "2020-10-11")]

    authors_columns = ["authorId", "favCount"]
    authors = [
        ("1", "5"),
        ("2", "3"),
        ("3", "12")]

After (inner) joining Posts and Authors dataframes on (posts.authorId=author.authorId) and I got this

 ---------- -------- ---------------                                   
|      date|authorId|       favCount|
 ---------- -------- --------------- 
|2020-10-10|       1|              5|
|2020-10-10|       2|              3|
|2020-10-10|       2|              3|
|2020-10-11|       2|              3|
|2020-10-11|       3|             12|
|2020-10-11|       3|             12|
 ---------- -------- ---------------

Now I want to calculate sum of favCount of authors in each day, the final result should be like this

 ---------- -------------                                   
|      date|sum(favCount)|
 ---------- ------------- 
|2020-10-10|            8|
|2020-10-11|           15|
 ---------- -------------

On OCT, 10th I have two authors (1 and 2) with total of 8 favCount (5 3)

On OCT, 11th I have two authors (2 and 3) with total of 15 favCount (3 12)

P.S: I don't want to count favCount for duplicated authors, favCount for each author should be counted once only in each day

P.S(2): I'm using PySpark and Dataframes but I don't mind answers in Pandas or even SQL

CodePudding user response：

Considering df1 as Posts and df2 as Authors dataframes

result = df1.merge(df2, how= 'inner').drop_duplicates(subset=['date','authorId'])
final = result.groupby([result.date])['favCount'].sum()

CodePudding user response：

If you want to try spark, you can try this

scala code:

df1.join(df2, Seq("authorId"), "inner").groupBy("date", "authorId").sum()

or python:

df1.join(df2, ["authorId"], "inner").groupBy("date", "authorId").sum()