I've got a dataframe with a column full of cells that look like this:
"***ORDER LIST***\nCustomer: Lucille\nitem1: apples\nitem2: oranges"
"***ORDER LIST***\nCustomer: Frank and Sally\nitem1: wine\nitem2: milk"
"***ORDER LIST***\n\n\nitem1: wine\nitem2: milk"
I am trying to sanitize each cell, be removing the whole line beginning with the word Customer, or if it's not there, the first blank lines.
I would want to end up with sanitized text data like this:
"***ORDER LIST***\nitem1: apples\nitem2: oranges"
"***ORDER LIST***\nitem1: wine\nitem2: milk"
"***ORDER LIST***\nitem1: wine\nitem2: milk"
Using gsub
is there a way to get rid of both blank lines, and the whole line containing the Customer?
Thanks
CodePudding user response:
Try something like:
text<-c("***ORDER LIST***\nCustomer: Lucille\nitem1: apples\nitem2: oranges",
"***ORDER LIST***\nCustomer: Frank and Sally\nitem1: wine\nitem2: milk",
"***ORDER LIST***\n\n\nitem1: wine\nitem2: milk")
gsub("Customer: .*?\\n|\\n\\n", " ", text)
[1] "***ORDER LIST***\n item1: apples\nitem2: oranges" "***ORDER LIST***\n item1: wine\nitem2: milk"
[3] "***ORDER LIST*** \nitem1: wine\nitem2: milk"
CodePudding user response:
Does this work for you?
gsub("(.*\\*).*?(\nitem.*)", "\\1\\2", text)
[1] "***ORDER LIST***\nitem1: apples\nitem2: oranges" "***ORDER LIST***\nitem1: wine\nitem2: milk"
[3] "***ORDER LIST***\nitem1: wine\nitem2: milk"