I have a dataset in which my unit of analysis and variable are together in the same column. I would like to split this up into two variables, but not sure how to because it involves regex.
Say I have the following data:
de <- data.frame(year = c(2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012),
company = c("Company A - Variable 1", "Company A - Variable 2", "Company A - Variable 3",
"CompanyB - Variable 1", "CompanyB - Variable 2", "CompanyB - Variable 3",
"Compan- C - Variable 1", "Compan- C - Variable 2", "Compan- C - Variable 3",
"Company A - Variable 1", "Company A - Variable 2", "Company A - Variable 3",
"CompanyB - Variable 1", "CompanyB - Variable 2", "CompanyB - Variable 3",
"Compan- C - Variable 1", "Compan- C - Variable 2", "Compan- C - Variable 3"),
score = c(1,3,5,7,9,11,13,15,17,2,4,6,8,10,12,14,16,18))
Which looks like:
year company score
<dbl> <chr> <dbl>
2010 Company A - Variable 1 1
2010 Company A - Variable 2 3
2010 Company A - Variable 3 5
2010 CompanyB - Variable 1 7
2010 CompanyB - Variable 2 9
2010 CompanyB - Variable 3 11
2010 Compan- C - Variable 1 13
2010 Compan- C - Variable 2 15
2010 Compan- C - Variable 3 17
2012 Company A - Variable 1 2
2012 Company A - Variable 2 4
2012 Company A - Variable 3 6
2012 CompanyB - Variable 1 8
2012 CompanyB - Variable 2 10
2012 CompanyB - Variable 3 12
2012 Compan- C - Variable 1 14
2012 Compan- C - Variable 2 16
2012 Compan- C - Variable 3 18
The company variable is a combination of company names (in all different forms and shapes), always followed by " - " and then the variable name. I want to wrangle the data so that I arrive at the following structure, splitting up the company and the variable into new columns:
year company variable score
2010 Company A Variable 1 1
2010 Company A Variable 2 3
2010 Company A Variable 3 5
2010 CompanyB Variable 1 7
2010 CompanyB Variable 2 9
2010 CompanyB Variable 3 11
2010 Compan- C Variable 1 13
2010 Compan- C Variable 2 15
2010 Compan- C Variable 3 17
2012 Company A Variable 1 2
2012 Company A Variable 2 4
2012 Company A Variable 3 6
2012 CompanyB Variable 1 8
2012 CompanyB Variable 2 10
2012 CompanyB Variable 3 12
2012 Compan- C Variable 1 14
2012 Compan- C Variable 2 16
2012 Compan- C Variable 3 18
I guess then use something to extract the text before " - " (the company name) and after " - " (the variable name). Then sort of split this up into two different variables: company and variable. Any elegant solution would be much appreciated. Thanks
CodePudding user response:
You can use tidyr::separate
with a lookahead regex to split at the blank space before the string "Variable".
library(tidyr)
separate(de, company, sep = "(?=Variable)", into = c('Company', 'Variable'))
#> year Company Variable score
#> 1 2010 Company A - Variable 1 1
#> 2 2010 Company A - Variable 2 3
#> 3 2010 Company A - Variable 3 5
#> 4 2010 CompanyB - Variable 1 7
#> 5 2010 CompanyB - Variable 2 9
#> 6 2010 CompanyB - Variable 3 11
#> 7 2010 Compan- C - Variable 1 13
#> 8 2010 Compan- C - Variable 2 15
#> 9 2010 Compan- C - Variable 3 17
#> 10 2012 Company A - Variable 1 2
#> 11 2012 Company A - Variable 2 4
#> 12 2012 Company A - Variable 3 6
#> 13 2012 CompanyB - Variable 1 8
#> 14 2012 CompanyB - Variable 2 10
#> 15 2012 CompanyB - Variable 3 12
#> 16 2012 Compan- C - Variable 1 14
#> 17 2012 Compan- C - Variable 2 16
#> 18 2012 Compan- C - Variable 3 18