I want to sum sales of similar country-CodePudding

I have this data in a spreadsheet

Country       Sales
Spain 1       1000
Spain 2       200
France        300
Nigeria 1     500
Nigeria 2     700

I want the sum of this country's sales stored a seperate dataframe.

I tried using the dplyr function but the result is not what i want

Here is the output I want please

Country   Sum_of_sales

Spain     1200
France    300
Nigeria   1200

Is there a way I can run this on R that will give me this above output stored in a separate dataframe please.

CodePudding user response：

Strip the trailing numbers (and spaces) off the country labels, then do a normal grouped sum:

library(dplyr)
df %>%
  mutate(Country= gsub(pattern = " *[0-9] ", replacement = "", x = Country)) %>%
  group_by(Country) %>%
  summarize(Sum_of_Sales = sum(Sales))
# # A tibble: 3 × 2
#   Country Sum_of_Sales
#   <chr>          <int>
# 1 France           300
# 2 Nigeria         1200
# 3 Spain           1200

Using this sample input:

df = read.table(text = "Country       Sales
'Spain 1'       1000
'Spain 2'       200
'France'        300
'Nigeria 1'     500
'Nigeria 2'     700", header = T)

CodePudding user response：

If you wanna keep it tidyverse, you can use extract:

library(tidyr)
library(dplyr)
df %>% 
  extract(Country, "Country") %>% 
  group_by(Country) %>% 
  summarise(Sum_of_sales = sum(Sales))

# A tibble: 3 × 2
  Country Sum_of_sales
  <chr>          <int>
1 France           300
2 Nigeria         1200
3 Spain           1200

For more complex cases, you can use:

extract(Country, "Country", "([A-Za-z -]*)\\s*[0-9]*")

CodePudding user response：

An option with str_remove

library(dplyr)
library(stringr)
df %>% 
  group_by(Country = str_remove(Country, "\\s \\d ")) %>% 
  summarise(Sum_of_sales = sum(Sales))

-output

# A tibble: 3 × 2
  Country Sum_of_sales
  <chr>          <int>
1 France           300
2 Nigeria         1200
3 Spain           1200

CodePudding user response：

library(dplyr)

df <- data.frame(
  Country = c("Spain","Spain","France","Nigeria","Nigeria"),
  Sales = c(1000,200,300,500,700)
)    

df_by_country <-
  df %>% 
  group_by(Country) %>% 
  summarise(Sales = sum(Sales,na.rm = TRUE))

df_by_country

# A tibble: 3 x 2
  Country Sales
  <chr>   <dbl>
1 France    300
2 Nigeria  1200
3 Spain    1200