Home > Software engineering >  Remove everything in a string after the first " - " (multiple " - ")
Remove everything in a string after the first " - " (multiple " - ")

Time:08-17

I am struggling to only keep the part before the first " - ". If I try this regex on regex101.com I get the expected output but when I try it in R I get a different output.

authors <- sub("\\s-\\s.*", "", authors)

Input:

[1] "T Dietz, RL Shwom, CT Whitley - Annual Review of Sociology, 2020 - annualreviews.org"         
[2] "L Berrang-Ford, JD Ford, J Paterson - Global environmental change, 2011 - Elsevier"           
[3] "CD Thomas - Diversity and Distributions, 2010 - Wiley Online Library"   

Expected output:

 [1] "T Dietz, RL Shwom, CT Whitley"       
 [2] "L Berrang-Ford, JD Ford, J Paterson"
 [3] "CD Thomas"

Actual output:

 [1] "T Dietz, RL Shwom, CT Whitley - Annual Review of Sociology, 2020"       
 [2] "L Berrang-Ford, JD Ford, J Paterson - Global environmental change, 2011"
 [3] "CD Thomas - Diversity and Distributions, 2010" 

Thanks in advance!

CodePudding user response:

When I run this code alone, I get your expected output:

authors <- c("T Dietz, RL Shwom, CT Whitley - Annual Review of Sociology, 2020 - annualreviews.org",      
"L Berrang-Ford, JD Ford, J Paterson - Global environmental change, 2011 - Elsevier",           
"CD Thomas - Diversity and Distributions, 2010 - Wiley Online Library")

sub("\\s-\\s.*", "", authors)

#[1] "T Dietz, RL Shwom, CT Whitley"       "L Berrang-Ford, JD Ford, J Paterson" "CD Thomas"  

This might have something to do with the fact that you reassign to authors every time you try subbing, which overwrites authors. You might have been doing that as you were developing the regex, and forgot to reassign the authors vector to the original.

CodePudding user response:

You can use this regex. Replace for nothing the result in Notepad for example:

Regex

 -(.*?)$

CodePudding user response:

You can also just split the string on your delimiter (-) and take the first element:

sapply(strsplit(authors, " -", fixed = T), `[[`, 1)
[1] "T Dietz, RL Shwom, CT Whitley"       "L Berrang-Ford, JD Ford, J Paterson"
[3] "CD Thomas" 

You can also use regex greedy matching to remove everything after and including your delimiter. Because it is greedy it will match as much as possible:

stringr::str_remove(authors, " -.*")
[1] "T Dietz, RL Shwom, CT Whitley"       "L Berrang-Ford, JD Ford, J Paterson"
[3] "CD Thomas"
  • Related