Home > OS >  Sum of multiple data type columns in pandas
Sum of multiple data type columns in pandas

Time:12-08

Consider:

Name   Location code  ID  Details  Column5        Code
Kirsh  HD12           76  Age:25   127.3          Yes
Kirsh  HD12           76  Age:25   12758.56       Yes
Kirsh  HD12           76  Age:25   No data found  No
John   HD12           87  Age:25;  76.45          Yes
John   HD12           87  Age:25;  No data found  No
John   HD12           87  Age:25;                 No
John   HD12           87  Age:25;  16578.56       Yes
Ward   HD12           87  Age:25;                 No
Ward   HD12           87  Age:25;                 No

I want this as:

Name   Location code  ID  Details  Column5    Code
Kirsh  HD12           76  Age:25   12962.31   Yes
John   HD12           87  Age:25;  16655.01   No
Ward   HD12           87  Age:25;             No

CodePudding user response:

import pandas as pd

data = {'Name': ['Kirsh', 'Kirsh', 'Kirsh', 'John', 'John', 'John', 'John', 'Ward', 'Ward'],
        'Location code': ['HD12', 'HD12', 'HD12', 'HD12', 'HD12', 'HD12', 'HD12', 'HD12', 'HD12'],
        'ID': [76, 76, 76, 87, 87, 87, 87, 87, 87],
        'Details': ['Age:25', 'Age:25', 'Age:25', 'Age:25;', 'Age:25;', 'Age:25;', 'Age:25;', 'Age:25;', 'Age:25;'],
        'Column5': [127.3, 12758.56, 'No data found', 76.45, 'No data found', 'No', 16578.56, 'No', 'No'],
        'Code': ['Yes', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'No', 'No']
        }

df = pd.DataFrame(data)

# Sum the numerical values
summation = df.loc[(df['Column5'] != 'No data found') &
                   (df['Code'] != 'No'), 'Column5'].sum()

# Create a new column with the summation
df['Column5_Sum'] = summation

# Group the data by name and location code
grouped_data = df.groupby(['Name', 'Location code', 'ID', 'Details']).agg(
    {'Column5': 'first', 'Code': 'first', 'Column5_Sum': 'first'}).reset_index()

# Drop the temporary column
grouped_data.drop(columns='Column5_Sum', inplace=True)

print(grouped_data)

CodePudding user response:

Here is the solution. Please replace those "No data found" with either 0 of nan:

import pandas as pd
import numpy as np

data = {
  "Name": ["Kirsh", "Kirsh", "Kirsh", "John", "John", "John", "John", "Ward", "Ward"],
  "Location": ["HD12", "HD12", "HD12","HD12", "HD12", "HD12", "HD12","HD12", "HD12"],
  "Code": [76, 76, 76, 87, 87, 87,87, 87,87],
    "Id": ["Age:25", "Age:25", "Age:25", "Age:25", "Age:25", "Age:25", "Age:25", "Age:25", "Age:25"],
    "Details":[127.3, 12758.56, np.nan, 76.45, np.nan, np.nan, 16578.56, np.nan, np.nan],
    "Column5": ["Yes", "Yes", "No", "Yes", "No", "No", "Yes", "No", "No"]

}

df = pd.DataFrame(data)
df['Details'] = df.groupby(['Name'])['Details'].transform('sum')

gk = df.groupby("Name")
gk.first()

Please see https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/ for groupby information

Also, Please see the comment of @John Stud, he already answered the best solution.

  • Related