I have a dataset in which in a column I have the name of a person and in another column I have the amount she was paid for a given service. I'd like to build a list with the names of all people ordained by the total amount they were paid regardless of the service they performed. Example:
Ann 100
John 200
Matt 150
John 150
John 150
Ann 300
Erik 150
===========
John 500
Ann 400
Matt 150
Erik 150
I figured this involves looking for all repeated instances of the name of the person and then storing the value paid on the column, eventually summing up everything. The problem is I have too big of a list and can't check individual names. That is, I can't define a particular string for each name to be checked, rather I'd like the program to figure the repeated instances by itself and return the ordained list in the manner that I described. Is there any way to do this? I know a bit of Python and R so any method described in these languages would be particularly helpful.
CodePudding user response:
If you made your dataset into a pandas dataframe this is easily done with groupby
import pandas as pd
df = pd.DataFrame({'name':names, 'paid':paid})
total_pay = df.groupby(by='name').sum()
CodePudding user response:
Using the mtcars
data in R, a base R way would be to use the tapply()
function and isolate vectors of your name and value data (cyl and hp respectively here)
tapply(mtcars$hp, mtcars$cyl, sum)
I'd personally use a data.table approach. The data.table is in essence a data.frame on steroids — it can do way more than the base R data.frame and far more efficiently. Variables can be referred to by name (e.g. cyl
instead of mtcars$cyl
) from within the square braces. The notation dt[i, j, by]
describes the basic functionality — i
is used to sort or subset data (dtcars[order(mpg)]
), j
is used to select or process variables (dtcars[, mean(mpg)]
) and by
(and keyby
to sort the output) allow the j
-operation to be done by grouping variables. Note you'll first need to convert the data.frame to a data.table using setDT()
or data.table()
.
library(data.table)
dtcars <- data.table(mtcars, keep.rownames=TRUE)
dtcars[, sum(hp), keyby=cyl]
Or you could used a tidyverse approach, piping data from statement to statement.
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarize(sum(hp))