Home > Software design >  How to sum rows that start with the same string
How to sum rows that start with the same string

Time:01-29

I used pandas to clean a csv:

import pandas as pd 
import numpy as np
df = pd.read_csv(r'C:\Users\Leo90\Downloads\data-export.csv',encoding='utf-8', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
df=df.iloc[:,[0,1,2,3,4,5,6,7]]
df=df.replace(to_replace='None',value=np.nan).dropna()
df=df.reset_index(drop=True)
columnNames = df.iloc[0] 
df = df[1:] 
df.columns = columnNames
df.groupby('path').head()

The processed data like the screenshot below

Original data

I want to use python to make this dataframe look like this

enter image description here

I know I could use str.contain to match these strings but it can only return bool, so I can't sum the A&B columns. Are there are any solutions for this problem?

I tried str.contain to match these strings but I can't sum A&B.

CodePudding user response:

Use groupby.sum based on the first 12 chars of the path column:

start = df['path'].str[:12]  # first 12 chars of `path`
out = df.groupby(start).sum()

#               A   B
# path               
# google.com/A  5  10
# google.com/B  4   8
# google.com/C  4   8

Or based on a str.extract regex that captures up to the 2nd slash:

start = df['path'].str.extract(r'^(.*/.*)/')[0]  # up to the 2nd slash of `path`
out = df.groupby(start).sum()

#               A   B
# path               
# google.com/A  5  10
# google.com/B  4   8
# google.com/C  4   8

CodePudding user response:

Group by the path whilst disregarding the final subpath and aggregate (sum) the other columns.

df['path'] = df["path"].apply(lambda x: "/".join(x.split("/")[:-1]))
df.groupby("path").sum()
  • Related