Extract all percentage numbers from a data frame column-CodePudding

I have a data.frame df with a character column text that contains text. From that column, I would like to extract all percentage numbers (say, 1.2% and -2.3%) but not the ordinary numbers (say, 123 and 1.2) into a character vector.

A small example:

df <- data.frame(text = c("this text is 1.3% this is  1.4% and this -1.5%",
                          "this text is 123.3% this 123.3 and this 1234.5"))

Required output:

[1] "1.3%" "-1.4%"  "-1.5%" "123.3%"

Is that possible?

CodePudding user response：

Probably not the most robust general-purpose solution, but works for your example:

unlist(stringr::str_extract_all(df$text, "[ \\-]?[0-9\\.] %"))
#[1] "1.3%"   " 1.4%"  "-1.5%"  "123.3%"

## or using R's native forward pipe operator, since R 4.1.0
stringr::str_extract_all(df$text, "[ \\-]?[0-9\\.] %") |> unlist()
#[1] "1.3%"   " 1.4%"  "-1.5%"  "123.3%"

This meets your expected output (i.e., a character vector). But in case you are thinking about storing the results to a new data frame column, you don't really want to unlist(). Just do:

df$percentages <- stringr::str_extract_all(df$text, "[ \\-]?[0-9\\.] %")
df
#                                            text        percentages
#1 this text is 1.3% this is  1.4% and this -1.5% 1.3%,  1.4%, -1.5%
#2 this text is 123.3% this 123.3 and this 1234.5             123.3%

The new column percentages itself is a list:

str(df$percentages)
#List of 2
# $ : chr [1:3] "1.3%" " 1.4%" "-1.5%"
# $ : chr "123.3%"

CodePudding user response：

Here is an alternative tidyverse way:

First we extract the numbers with parse_number from readr package,and then within an ifelse statement we specify the combination of number and percent. Finally pull for vector output.

library(tidyverse)

df %>% 
  mutate(x = parse_number(text),
         x = ifelse(str_detect(text, "%"), paste0(x,"%"), NA_character_)) %>% 
  pull(x)

1] "1.3%"   "1.4%"   "-1.5%"  "123.3%" NA       NA