Home > Enterprise >  Using group_by in a dataset where values of a variable has different name
Using group_by in a dataset where values of a variable has different name

Time:06-09

I am trying to do a simple group_by and count of a database of customer to find out how many customers are buying our products.

The code is something like thing

customer_grouped <- customer_dataset %>% group_by(customer_name) %>% summarise(n=n())

However, I notice that the same customer can have his name written in different way. The result looked like this

customer_name n
Will Smith 5
Will smith 3
will Smith 3
will smith 15
will smith the actor 15
Will Smith the actor, 1990 15

I know all of them are the same customer i.e Will Smith, just that the staffs have inputted his name under different format. How will I use group_by to find the number of orders for each customer ?

CodePudding user response:

Ideally, you can avoid this step if there is a unique customer identifier. De-duping customer lists is a whole industry, so there is unfortunately not a general-purpose code solution. There are many edge cases, including capitalization (sometimes meaningful for distinctions), misspelling, middle names, cultural variations in order of names, punctuation, aliases, many people change their names over time, honorifics, etc etc etc etc. And there are many people who share their name with others.

For your example data, you could convert to lower case and take the first two words, like

library(stringr); 
customer_dataset %>% 
  count(customer_name = customer_name %>% str_to_lower %>% word(1,2))

btw, count(x) is a shortcut for the common pattern group_by(x) %>% summarize(n = n()), and both count and group_by allow you to perform manipulations on the grouping variable before it is grouped.

  • Related