Home > Blockchain >  What would be the best way to extract certain numbers after a specific phrase from a string in a col
What would be the best way to extract certain numbers after a specific phrase from a string in a col

Time:09-30

In the dataframe I'm working with, I have a column called 'weather' with weather data that looks like this:

Sunny Temp: 78F, Humidity: 63%, Wind: SSW 6 mph
Sunny Temp: 103� F, Humidity: 7%, Wind: 16 SW mph
Temp: 88F, Humidity: 43%, Wind: S 12 mph
Cloudy Temp: 81� F, Humidity: 90%, Wind: SW 5 mph

I'd like to use dplyr's mutate function to make create new columns for the temperature and wind speed that is contained in the 'weather' column. For the temperature column, I'm thinking a function that looks at the first 3 characters after "Temp: " and extracts any numbers should work. For wind, as you can see, sometimes the wind direction comes before the number. So a similar function to the temperature column, but something that looks at maybe the first 6-7 characters and extracts any numbers would work.

I have read up on sub, gsub, substr and str_extract and tried to implement each of these for my specific dilemma. I'm just not able to select the specific string that I've described as above. For example I tried:

mutate(temperature = sub('.*Temp: ', '',weather)) %>%
mutate(temperature = substr(temp, 1, 2))

but this does not work when the temperature is 1 or 3 digits.

Any help is greatly appreciated!

CodePudding user response:

We may use str_extract with a regex lookaround i.e. look for one or more digits (\\d ) after the substring "Temp: "

library(dplyr)
library(stringr)
df1 %>%
  mutate(temperature = str_extract(weather, "(?<=Temp: )\\d "),
         Wind = str_extract(weather, "(?<=Wind: [A-Z] )\\d "))

CodePudding user response:

library(dplyr)
library(stringr)
df %>%
  mutate(Temperature = str_extract(x, "(?<=Temp:\\s)\\d "),
         Wind = str_extract(x, "(?<=Wind:\\s[A-Z]{0,5}\\s?)\\d "))
                                                  x Temperature Wind
1 Sunny Temp: 78F, Humidity: 63%, Wind: SSW 6 mph          78    6
2 Sunny Temp: 103� F, Humidity: 7%, Wind: 16 SW mph         103   16
3        Temp: 88F, Humidity: 43%, Wind: S 12 mph          88   12
4 Cloudy Temp: 81� F, Humidity: 90%, Wind: SW 5 mph          81    5

While the first lookbehind (?<=Temp: ) is relatively straightforward, the second lookbehind (?<=Wind:\\s[A-Z]{0,5}\\s?) is more complex. This is because of the upper-case letters plus the whitespace thereafter that can potentially intervene between Wind: and the digits.

Data:

df <- data.frame(
  x = c("Sunny Temp: 78� F, Humidity: 63%, Wind: SSW 6 mph",
  "Sunny Temp: 103� F, Humidity: 7%, Wind: 16 SW mph",
  "Temp: 88� F, Humidity: 43%, Wind: S 12 mph",
  "Cloudy Temp: 81� F, Humidity: 90%, Wind: SW 5 mph")
)
  • Related