Home > other >  Efficiency way to clean data in R
Efficiency way to clean data in R

Time:09-03

Input is

enter image description here

the row 3 and row 5 had incorrtct format, if I want

sale_date produst_model store_code
20210208 ASUS_DE552 AAE_08072
20210305 ASUS_AC693 AAE_08072
20210107 ASUS_DE551 AAR_7461
20210325 ASUS_DB341 CMHT_654
20210227 ASUS_HG0982 BR_981

If this table have 20,000 rows, Do I have more efficiency way to check every row is match rule?

CodePudding user response:

From looking at the data posted my hunch is that the strings in the three columns were at some point extracted from a composite string such as 20210227_ASUS_HG0982_BR_981 but the extraction seems to have gone wrong in some places. If this assumption is correct then I would recommend going back to the original strings and fixing the extraction, for example like this using the extract function:

library(tidyverse)
data.frame(original) %>%
  extract(original,
          into = c("sale_date", "produst_model", "store_code"),
          regex = "(\\d )_(\\w \\d )_(\\w )")
  sale_date produst_model store_code
1  20210227   ASUS_HG0982     BR_981

Data:

original = "20210227_ASUS_HG0982_BR_981"

Obviously, the regex here is based only on a single string and will likely have to be adapted as soon as you have more strings.

  •  Tags:  
  • r
  • Related