Issue with piping stringr str_detect into str_extract - extract is only pulling text from 1st row: a-CodePudding

I'm trying to create a new column which just contains certain numeric data from an expression.

Here's my data: https://pastebin.com/hYg3zqYz

I just need the numbers that come after Bipolar in column 12.

Here's what works

p <- df %>% 
      select(where(~ any(stringr::str_detect(.x, "Bipolar")))) #returns correct column

Where I try then try to make a new column that pulls just the text, it only ever returns the first row, not sure what I'm doing wrong.

p %>%
      mutate(group = "sr_bipol",
             sr_bipol = as.numeric(stringr::str_extract(., "[0-9].[0-9] "))) %>% 
       select(group, sr_bipol)

# A tibble: 20 × 2
   group    sr_bipol
   <chr>       <dbl>
 1 sr_bipol     7.83
 2 sr_bipol     7.83
 3 sr_bipol     7.83
 4 sr_bipol     7.83
 5 sr_bipol     7.83
.....................

I also get the error code:

 argument is not an atomic vector; coercing

CodePudding user response：

The . refers to the whole dataset (str_extract needs a vector as input and not a data.frame). According to ?str_extract

string - Input vector. Either a character vector, or something coercible to one.

We may need to apply str_extract on the column 12. As the column name for 12 prefix include ... that are unusual column names, use backticks to access the column values

library(dplyr)
library(stringr)
df %>% 
  transmute(group = 'sr_bipol', 
    sr_bipol = as.numeric(str_extract(`...12`, "(?<=Bipolar\\s)[0-9]\\.[0-9] ")))

-output

# A tibble: 20 × 2
   group    sr_bipol
   <chr>       <dbl>
 1 sr_bipol     7.83
 2 sr_bipol     2.34
 3 sr_bipol     1.97
 4 sr_bipol     1.94
 5 sr_bipol     2.85
 6 sr_bipol     2.92
 7 sr_bipol     3.05
 8 sr_bipol     2.80
 9 sr_bipol     3.43
10 sr_bipol     2.11
11 sr_bipol     2.80
12 sr_bipol     1.81
13 sr_bipol     1.84
14 sr_bipol     3.87
15 sr_bipol     1.68
16 sr_bipol     2.21
17 sr_bipol     2.97
18 sr_bipol     3.09
19 sr_bipol     2.84
20 sr_bipol     3.48

The p data is a single column tibble/data.frame. When we use ., it selects the data.frame as such i.e.

> str(p)
tibble [20 × 1] (S3: tbl_df/tbl/data.frame)
 $ ...12: chr [1:20] "Bipolar 7.827 / Unipolar 16.911 / LAT -9.0" "Bipolar 2.34 / Unipolar 9.09 / LAT -10.0" "Bipolar 1.974 / Unipolar 9.219 / LAT -11.0" "Bipolar 1.938 / Unipolar 10.572 / LAT -9.0" ...
> str_extract(p, "[0-9].[0-9] ")
[1] "7.827"
Warning message:
In stri_extract_first_regex(string, pattern, opts_regex = opts(pattern)) :
  argument is not an atomic vector; coercing

It extracts the value from the first instance and this got recycled to create the whole column of 7.8

If there are more than one column having the 'Bipolar' we may loop across (modify the transmute to mutate if we want to keep all other columns from the original data)

df %>% 
  transmute(across(where(~ any(stringr::str_detect(.x, "Bipolar"))), 
   ~ as.numeric(str_extract(.x, "(?<=Bipolar\\s)[0-9]\\.[0-9] ")), 
     .names = "sr_bipol{str_remove(.col, '[.] ')}"))
# A tibble: 20 × 1
   sr_bipol12
        <dbl>
 1       7.83
 2       2.34
 3       1.97
 4       1.94
 5       2.85
 6       2.92
 7       3.05
 8       2.80
 9       3.43
10       2.11
11       2.80
12       1.81
13       1.84
14       3.87
15       1.68
16       2.21
17       2.97
18       3.09
19       2.84
20       3.48

CodePudding user response：

Here is an alternative approach:

library(tidyverse)

df %>% 
  select(...12) %>% 
  separate(...12, into="group", sep = "\\/") %>%
  mutate(sr_bipol = parse_number(group),
         group= str_extract(group, '[A-Za-z] '))

   group   sr_bipol
   <chr>      <dbl>
 1 Bipolar     7.83
 2 Bipolar     2.34
 3 Bipolar     1.97
 4 Bipolar     1.94
 5 Bipolar     2.85
 6 Bipolar     2.92
 7 Bipolar     3.05
 8 Bipolar     2.80
 9 Bipolar     3.43
10 Bipolar     2.11
11 Bipolar     2.80
12 Bipolar     1.81
13 Bipolar     1.84
14 Bipolar     3.87
15 Bipolar     1.68
16 Bipolar     2.21
17 Bipolar     2.97
18 Bipolar     3.09
19 Bipolar     2.84
20 Bipolar     3.48