Home > Enterprise >  Pandas how to merge duplicates rows into one but without modifying string columns
Pandas how to merge duplicates rows into one but without modifying string columns

Time:06-18

I got problems when processing data in an xlsx file. my leader told me first to merge the same rows' values according to uid. then I should get the sum of "content post volume", "views" and "exploration volume" according to the column "Account type" which consists of types "content" and "social".

For example, there are two rows recording the data of uid 1680260000, now merge them into one. the later three columns should add, turn into 5364, 3710029, and 300478. But don't modify the former three columns. The user_level should keep it as F2.(order like F1, F2, F3, F4...prefer the front value)

uid          user_level    Account type   content post volum  exploration volum    Views
1680260000  F2                content          112                   934318            118
2209220000  F1                social           628                  147623896     20160351
1680260000  F5                content          5252                  2775711        300290
5390800000  F3                content          127                   1235530          8554
6017200000  F2                social           142                    649046         43144
7054610000  F2                social           23                    1226520        232074
1682390000  F2                content          1162                 18025639           722
3136670000  F2                content          6                    2123571         189379
3136670000  F6                content          0                         6              0
5393860000  F2                social           60                    3246476           17
6017200000  F3                content          677                   8855471        277229
6017200000  F2                social           737                  11854463             0
1685250000  F2                content          96                    2211002          5942

The expected result:
uid          user_level    Account type   content post volum    exploration volum    Views
1680260000  F2                content         5364                  3710029       20160469
2209220000  F1                social           628                  147623896     20160351
5390800000  F3                content          127                   1235530          8554
6017200000  F2                social           1556                  21358980       320373
7054610000  F2                social           23                    1226520        232074
1682390000  F2                content          1162                 18025639          722
3136670000  F2                content          6                    2123577         189379
5393860000  F2                social           60                    3246476           17
1685250000  F2                content          96                    2211002          5942

Now the problem is if I use df.groupby("uid").sum(), the column "Account type"(which type is string) will add together too. This isn't what I want, because later I need to extract data depending on it. For example, after I merge the rows with duplicated uid, I need to get rows which "Account type" is in ["F0", "F1", "F2"]. But groupby will make cell value turn to "F1F3", "F4F1" so it's hard to distinguish it. I did try to split the string when extracting, such as

file[file.Account_social_type.str.split("F").isin(["1", "2", "3"])]
ps: after df.str.split("F"), the "F1F3" will turn to ["1", "3"]

but somehow at this time .str.split("F") won't work for every cell but the whole column!
So in the last, I use a stupid method. First I use

taruid = file[file.uid.duplicated(keep="first")].uid.to_list()
# somehow this statement still left repeated value in list ^
taruid = list(set(taruid))

to get all repeated uid. Then use

def changeOne(rows : pd.DataFrame):
    rows = rows.sort_values(by = "F_level")
    rows.content_post_volume.iloc[0] = rows["content_post_volume"].sum()/(92)
    rows.views.iloc[0] = rows["views"].sum()/92
    rows.exploration_volume.iloc[0] = rows["exploration_volume"].sum()/92
    return rows

replaceOne : pd.DataFrame = pd.DataFrame()
for items in taruid:
    goals = file[file.uid == items]
    replaceOne = replaceOne.append(changeOne(goals.copy()).iloc[0])

to get the sum value of specified columns and store it in the first rows. in the last use

file = file.drop_duplicates(subset = "uid", keep=False)
# drop all repeated rows
file = pd.concat([file, replaceOne], axis=0, ignore_index=True)

to get the final integrated data. And the flaws are very significant, near 1500 data cost 3s. There must be a much easier and more efficient method to use groupby or some advanced pandas function to slove this problem.
WHAT I WANNA ask is, how can we merge/add duplicated/specified rows without modify string column or the columns I specified.
I spend half a day to optimize this but I failed, really appreciate it if you guys can figure it out.

CodePudding user response:

The expected format is unclear, but you can use different functions to aggregate the data.

Let's form comma separated strings of the unique values for "user_level" and "Account type":

string_agg = lambda s: ','.join(dict.fromkeys(s))

out = (df.groupby('uid', as_index=False)
         .agg({'user_level': string_agg, 'Account type': string_agg,
               'content post volum': 'sum',
       'exploration volum': 'sum', 'Views': 'sum'})
      )

Output:

          uid user_level    Account type  content post volum  exploration volum     Views
0  1680260000      F2,F5         content                5364            3710029    300408
1  1682390000         F2         content                1162           18025639       722
2  1685250000         F2         content                  96            2211002      5942
3  2209220000         F1          social                 628          147623896  20160351
4  3136670000      F2,F6         content                   6            2123577    189379
5  5390800000         F3         content                 127            1235530      8554
6  5393860000         F2          social                  60            3246476
7  6017200000      F2,F3  social,content                1556           21358980    320373
8  7054610000         F2          social                  23            1226520    232074

For the demo, here is an alternative aggregation function to count the duplicates:

from collections import Counter
string_agg = lambda s: ','.join([f'{k}({v})' if v>1 else k for k,v in Counter(s).items()])

Output:

          uid user_level       Account type  content post volum  exploration volum     Views
0  1680260000      F2,F5         content(2)                5364            3710029    300408
1  1682390000         F2            content                1162           18025639       722
2  1685250000         F2            content                  96            2211002      5942
3  2209220000         F1             social                 628          147623896  20160351
4  3136670000      F2,F6         content(2)                   6            2123577    189379
5  5390800000         F3            content                 127    
6  5393860000         F2             social                  60            3246476        17
7  6017200000   F2(2),F3  social(2),content                1556           21358980    320373
8  7054610000         F2             social                  23            1226520    232074

CodePudding user response:

df = df.groupby(['uid','Account type', 'user_level '])[['content post volume', 'exploration volume', 'views']].sum()
df2 = df.groupby(['uid','Account type'])[['content post volume', 'exploration volume', 'views']].sum()
df = df.reset_index()
df = df.drop_duplicates('uid')
df = df[['uid', 'Account type', 'user_level ']]
df = df.merge(df2, on=['uid'])
df

output

enter image description here

  • Related