Efficiency way to clean data in R-CodePudding

Input is

the row 3 and row 5 had incorrtct format, if I want

sale_date	produst_model	store_code
20210208	ASUS_DE552	AAE_08072
20210305	ASUS_AC693	AAE_08072
20210107	ASUS_DE551	AAR_7461
20210325	ASUS_DB341	CMHT_654
20210227	ASUS_HG0982	BR_981

If this table have 20,000 rows, Do I have more efficiency way to check every row is match rule?

CodePudding user response：

From looking at the data posted my hunch is that the strings in the three columns were at some point extracted from a composite string such as 20210227_ASUS_HG0982_BR_981 but the extraction seems to have gone wrong in some places. If this assumption is correct then I would recommend going back to the original strings and fixing the extraction, for example like this using the extract function:

library(tidyverse)
data.frame(original) %>%
  extract(original,
          into = c("sale_date", "produst_model", "store_code"),
          regex = "(\\d )_(\\w \\d )_(\\w )")
  sale_date produst_model store_code
1  20210227   ASUS_HG0982     BR_981

Data:

original = "20210227_ASUS_HG0982_BR_981"

Obviously, the regex here is based only on a single string and will likely have to be adapted as soon as you have more strings.