Home > Net >  Removing duplicate words in R and cleaning similar words in r
Removing duplicate words in R and cleaning similar words in r

Time:08-21

I have this table in R.In this table the data in measure column "Zero-Loss Condensate Drain
Zero Loss Condensate Drain, Zero-Loss Condensate Drains " and "Wi-Fi Thermostat
Wi-Fi thermostats " is same but R treats this different and count differently. I want that Wi-Fi Thermostat and Wi-Fi thermostats should be treated same and give count 4 not 1,2,1 respectively. Similar results I want for Zero Zero Loss Condensate Drain.

measure Freq
Thermostatic Radiator Valves (TRVs) 45
Smart Thermostatic Radiator Enclosure 42
Smart Thermostats 4
Thermostatic radiator valves 3
Wi-Fi Enabled Thermostats 2
Wi-Fi Thermostats 1
Smart Thermostat 2
Thermostatic and Float Steam Traps 1
Thermostatic Radiator Valves 2
Dual Fuel Thermostat 1
Programmable Setback Thermostats 1
Wi-Fi Thermostat 1
Wi-Fi thermostats 2
Zero-Loss Condensate Drain 1
Zero Loss Condensate Drain 1
Zero-Loss Condensate Drains 2

CodePudding user response:

You need to tidy up your measure values before summarizing:

library(tidyverse)
df %>%
  # tidy up values in `measure`:
  mutate(
    # get rid of plural -s:
    measure = str_replace(measure, "(?<=thermostat|Drain)s", ""),
    # capitalize "thermostat"
    measure = str_replace(measure, "thermostat", "Thermostat"),
    # remove hyphen:
    measure = str_replace(measure, "(?<=Zero)-(?=Loss)", " ")) %>%
  # for each `measure` value...:
  group_by(measure) %>%
  # ...give frequency:
  summarise(Frequ = n())
# A tibble: 2 × 2
  measure                    Frequ
  <chr>                      <int>
1 Wi-Fi Thermostat               2
2 Zero Loss Condensate Drain     3

Data:

df <- data.frame(
  measure = c("Wi-Fi Thermostat", "Wi-Fi thermostats",
  "Zero-Loss Condensate Drain","Zero Loss Condensate Drain","Zero-Loss Condensate Drains")
)

CodePudding user response:

For this example we could do:

library(dplyr)
library(stringr)

df %>% 
  mutate(helper = toupper(measure),
         helper = ifelse(str_ends(helper, 'S'), substring(helper,1, nchar(helper)-1), helper),
         helper = str_replace(helper, '\\-', ' ')) %>% 
  group_by(helper) %>% 
  mutate(measure = first(measure)) %>% 
  group_by(measure) %>% 
  summarise(Freq = sum(Freq)) %>% 
  arrange(-Freq)
 measure                                Freq
   <chr>                                 <dbl>
 1 Thermostatic Radiator Valves (TRVs)      45
 2 Smart Thermostatic Radiator Enclosure    42
 3 Smart Thermostats                         6
 4 Thermostatic radiator valves              5
 5 Wi-Fi Thermostats                         4
 6 Zero-Loss Condensate Drain                4
 7 Wi-Fi Enabled Thermostats                 2
 8 Dual Fuel Thermostat                      1
 9 Programmable Setback Thermostats          1
10 Thermostatic and Float Steam Traps        1
  •  Tags:  
  • r
  • Related