R Extract Partial String from Specific Rows-CodePudding

I have a dataframe 'df1' with a string column 'Field_notes' of various information that looks like this:

Alt_ID	Field_notes
	JMs # 04J0342
	JMs # 04J0343
	JMs # 04J0344
	# broken leg
	54.2
	JMs # 04J0345

I would like to extract parts of the strings from the "Field_notes" column for specific rows only to the "Alt_ID" column. In this case, I'd like to subset rows 1,2,3,6 so that the alphanumeric combination after "JMs # " is moved to the "Alt_ID" column, so the result looks like:

Alt_ID	Field_notes
04J0342	JMs # 04J0342
04J0343	JMs # 04J0343
04J0344	JMs # 04J0344
	# broken leg
	54.2
04J0345	JMs # 04J0345

The tricky part is that there are so many combinations of information in Field_notes that I probably can't rely on character patterns and instead have to rely on specifying row names/numbers. In this case, I don't want to extract anything from '# broken leg'.

CodePudding user response：

Here's another solution:

library(dplyr)
library(stringr)

df <- data.frame(
  Alt_ID = NA_character_,
  Field_notes = c("JMs # 04J0342", "JMs # 04J0343", "JMs # 04J0344",
                   "# broken leg", "54.2", "JMs # 04J0345")
)

id_pattern <- "(?<=JMs # )\\w "

df %>%
  mutate(
    Alt_ID = str_extract(Field_notes, id_pattern)
  )
#    Alt_ID   Field_notes
# 1 04J0342 JMs # 04J0342
# 2 04J0343 JMs # 04J0343
# 3 04J0344 JMs # 04J0344
# 4    <NA>  # broken leg
# 5    <NA>          54.2
# 6 04J0345 JMs # 04J0345

# Or equivalently:
df$Alt_ID <- str_extract(df$Field_notes, id_pattern)
df
#    Alt_ID   Field_notes
# 1 04J0342 JMs # 04J0342
# 2 04J0343 JMs # 04J0343
# 3 04J0344 JMs # 04J0344
# 4    <NA>  # broken leg
# 5    <NA>          54.2
# 6 04J0345 JMs # 04J0345

The regex "(?<=JMs # )\\w " is interpreted as find the substring "JMs # and match the alphanumeric substring that follows.

CodePudding user response：

Update:

Here is another approach: Now you don't rely on # but now you rely on count of characters:

df %>% 
  mutate(Alt_ID = ifelse(nchar(Field_notes)==13,  str_extract(Field_notes,  '\\b\\w $'), ""), .before=1)

  Alt_ID   Field_notes
1 04J0342 JMs # 04J0342
2 04J0343 JMs # 04J0343
3 04J0344 JMs # 04J0344
4            broken leg
5                  54.2
6 04J0345 JMs # 04J0345

First answer:

A combination of regex '\\b\\w $' and stringr functions:

library(dplyr)
library(stringr)
df %>% 
  mutate(Alt_ID= str_extract(Field_notes,  '\\b\\w $')) %>% 
  mutate(Alt_ID = ifelse(str_detect(Field_notes, '\\#'), Alt_ID, ""))

    Field_notes  Alt_ID
1 JMs # 04J0342 04J0342
2 JMs # 04J0343 04J0343
3 JMs # 04J0344 04J0344
4    broken leg        
5          54.2        
6 JMs # 04J0345 04J0345

CodePudding user response：

You could try this, using the regex JMs\\s#\\s(\\w )$ and picking out the capturing group:

library(stringr)
library(dplyr)

mutate(df, Alt_ID = stringr::str_match(field_notes, "JMs\\s#\\s(\\w )$")[, 2])

Output:

   Alt_ID   field_notes
1 04J0342 JMs # 04J0342
2    <NA>  # broken leg
3 04J0334 JMs # 04J0334