Home > front end >  Summing up all repeated values in a dataset
Summing up all repeated values in a dataset

Time:12-01

I have a dataset in which in a column I have the name of a person and in another column I have the amount she was paid for a given service. I'd like to build a list with the names of all people ordained by the total amount they were paid regardless of the service they performed. Example:

Ann     100
John    200
Matt    150
John    150
John    150
Ann     300
Erik    150

===========
John    500
Ann     400
Matt    150
Erik    150

I figured this involves looking for all repeated instances of the name of the person and then storing the value paid on the column, eventually summing up everything. The problem is I have too big of a list and can't check individual names. That is, I can't define a particular string for each name to be checked, rather I'd like the program to figure the repeated instances by itself and return the ordained list in the manner that I described. Is there any way to do this? I know a bit of Python and R so any method described in these languages would be particularly helpful.

CodePudding user response:

If you made your dataset into a pandas dataframe this is easily done with groupby


import pandas as pd
df = pd.DataFrame({'name':names, 'paid':paid})


total_pay = df.groupby(by='name').sum()

CodePudding user response:

Using the mtcars data in R, a base R way would be to use the tapply() function and isolate vectors of your name and value data (cyl and hp respectively here)

tapply(mtcars$hp, mtcars$cyl, sum)

I'd personally use a data.table approach. The data.table is in essence a data.frame on steroids — it can do way more than the base R data.frame and far more efficiently. Variables can be referred to by name (e.g. cyl instead of mtcars$cyl) from within the square braces. The notation dt[i, j, by] describes the basic functionality — i is used to sort or subset data (dtcars[order(mpg)]), j is used to select or process variables (dtcars[, mean(mpg)]) and by (and keyby to sort the output) allow the j-operation to be done by grouping variables. Note you'll first need to convert the data.frame to a data.table using setDT() or data.table().

library(data.table)
dtcars <- data.table(mtcars, keep.rownames=TRUE)
dtcars[, sum(hp), keyby=cyl]

Or you could used a tidyverse approach, piping data from statement to statement.

library(tidyverse)
mtcars %>% 
  group_by(cyl) %>%
  summarize(sum(hp))
  • Related