I am trying to do a simple group_by and count of a database of customer to find out how many customers are buying our products.
The code is something like thing
customer_grouped <- customer_dataset %>% group_by(customer_name) %>% summarise(n=n())
However, I notice that the same customer can have his name written in different way. The result looked like this
customer_name | n |
---|---|
Will Smith | 5 |
Will smith | 3 |
will Smith | 3 |
will smith | 15 |
will smith the actor | 15 |
Will Smith the actor, 1990 | 15 |
I know all of them are the same customer i.e Will Smith, just that the staffs have inputted his name under different format. How will I use group_by to find the number of orders for each customer ?
CodePudding user response:
Ideally, you can avoid this step if there is a unique customer identifier. De-duping customer lists is a whole industry, so there is unfortunately not a general-purpose code solution. There are many edge cases, including capitalization (sometimes meaningful for distinctions), misspelling, middle names, cultural variations in order of names, punctuation, aliases, many people change their names over time, honorifics, etc etc etc etc. And there are many people who share their name with others.
For your example data, you could convert to lower case and take the first two words, like
library(stringr);
customer_dataset %>%
count(customer_name = customer_name %>% str_to_lower %>% word(1,2))
btw, count(x)
is a shortcut for the common pattern group_by(x) %>% summarize(n = n())
, and both count
and group_by
allow you to perform manipulations on the grouping variable before it is grouped.