I have a dataframe 'df1' with a string column 'Field_notes' of various information that looks like this:
Alt_ID | Field_notes |
---|---|
JMs # 04J0342 | |
JMs # 04J0343 | |
JMs # 04J0344 | |
# broken leg | |
54.2 | |
JMs # 04J0345 |
I would like to extract parts of the strings from the "Field_notes" column for specific rows only to the "Alt_ID" column. In this case, I'd like to subset rows 1,2,3,6 so that the alphanumeric combination after "JMs # " is moved to the "Alt_ID" column, so the result looks like:
Alt_ID | Field_notes |
---|---|
04J0342 | JMs # 04J0342 |
04J0343 | JMs # 04J0343 |
04J0344 | JMs # 04J0344 |
# broken leg | |
54.2 | |
04J0345 | JMs # 04J0345 |
The tricky part is that there are so many combinations of information in Field_notes that I probably can't rely on character patterns and instead have to rely on specifying row names/numbers. In this case, I don't want to extract anything from '# broken leg'.
CodePudding user response:
Here's another solution:
library(dplyr)
library(stringr)
df <- data.frame(
Alt_ID = NA_character_,
Field_notes = c("JMs # 04J0342", "JMs # 04J0343", "JMs # 04J0344",
"# broken leg", "54.2", "JMs # 04J0345")
)
id_pattern <- "(?<=JMs # )\\w "
df %>%
mutate(
Alt_ID = str_extract(Field_notes, id_pattern)
)
# Alt_ID Field_notes
# 1 04J0342 JMs # 04J0342
# 2 04J0343 JMs # 04J0343
# 3 04J0344 JMs # 04J0344
# 4 <NA> # broken leg
# 5 <NA> 54.2
# 6 04J0345 JMs # 04J0345
# Or equivalently:
df$Alt_ID <- str_extract(df$Field_notes, id_pattern)
df
# Alt_ID Field_notes
# 1 04J0342 JMs # 04J0342
# 2 04J0343 JMs # 04J0343
# 3 04J0344 JMs # 04J0344
# 4 <NA> # broken leg
# 5 <NA> 54.2
# 6 04J0345 JMs # 04J0345
The regex "(?<=JMs # )\\w "
is interpreted as find the substring "JMs #
and match the alphanumeric substring that follows.
CodePudding user response:
Update:
Here is another approach: Now you don't rely on #
but now you rely on count of characters:
df %>%
mutate(Alt_ID = ifelse(nchar(Field_notes)==13, str_extract(Field_notes, '\\b\\w $'), ""), .before=1)
Alt_ID Field_notes
1 04J0342 JMs # 04J0342
2 04J0343 JMs # 04J0343
3 04J0344 JMs # 04J0344
4 broken leg
5 54.2
6 04J0345 JMs # 04J0345
First answer:
A combination of regex '\\b\\w $'
and stringr
functions:
library(dplyr)
library(stringr)
df %>%
mutate(Alt_ID= str_extract(Field_notes, '\\b\\w $')) %>%
mutate(Alt_ID = ifelse(str_detect(Field_notes, '\\#'), Alt_ID, ""))
Field_notes Alt_ID
1 JMs # 04J0342 04J0342
2 JMs # 04J0343 04J0343
3 JMs # 04J0344 04J0344
4 broken leg
5 54.2
6 JMs # 04J0345 04J0345
CodePudding user response:
You could try this, using the regex JMs\\s#\\s(\\w )$
and picking out the capturing group:
library(stringr)
library(dplyr)
mutate(df, Alt_ID = stringr::str_match(field_notes, "JMs\\s#\\s(\\w )$")[, 2])
Output:
Alt_ID field_notes
1 04J0342 JMs # 04J0342
2 <NA> # broken leg
3 04J0334 JMs # 04J0334