Home > Software design >  Removing all rows that do not meet criteria in R?
Removing all rows that do not meet criteria in R?

Time:07-22

I can't seem to find such a solution on here similar to my scenario. Here is a column in my example dataset:

How_do_you_feel

Excited, Hopeful, Prepared, good    
Unsure, confused, anxious, curious  
Co operations, Teamwork, communication, critical thinking   
a   
First, team work, nervous, curious  
Interesting. New. Exciting. Develop 
perplexed,anxious,embarrassed,bit excited 
Novel, Unknown, Challenging, Useful 
Worried, excited, self-doubt, motivated 
Excited,curious,nervous,worried

The correct format should be 4 words, separated by a comma in between like this 'Excited, Hopeful, Prepared, good'.

How do I clean my data in a way that it removes all the rows that have the wrong format, such as 'Interesting. New. Exciting. Develop' or 'perplexed,anxious,embarrassed,bit excited'?

So the result would look something like this:

How_do_you_feel

Excited, Hopeful, Prepared, good    
Unsure, confused, anxious, curious  
Co operations, Teamwork, communication, critical thinking 
First, team work, nervous, curious
Novel, Unknown, Challenging, Useful 
Worried, excited, self-doubt, motivated 

Thanks!

CodePudding user response:

You said that this is a column of your dataset. So I am assuming data structure:

How_do_you_feel <- c("", "Excited, Hopeful, Prepared, good", "Unsure, confused, anxious, curious", 
"Co operations, Teamwork, communication, critical thinking", 
"a", "First, team work, nervous, curious", "Interesting. New. Exciting. Develop", 
"perplexed,anxious,embarrassed,bit excited", "Novel, Unknown, Challenging, Useful", 
"Worried, excited, self-doubt, motivated", "Excited,curious,nervous,worried"
)

Just keep those with three commas:

How_do_you_feel[stringr::str_count(How_do_you_feel, ",") == 3]
#[1] "Excited, Hopeful, Prepared, good"                         
#[2] "Unsure, confused, anxious, curious"                       
#[3] "Co operations, Teamwork, communication, critical thinking"
#[4] "First, team work, nervous, curious"                       
#[5] "Novel, Unknown, Challenging, Useful"                      
#[6] "Worried, excited, self-doubt, motivated"                  
#[7] "Excited,curious,nervous,worried" 

You can also trim white spaces using trimws, if necessary.

CodePudding user response:

Here is one potential solution:

library(tidyverse)

lines <- c("Excited, Hopeful, Prepared, good",
"Unsure, confused, anxious, curious",
"Co operations, Teamwork, communication, critical thinking",
"a",
"First, team work, nervous, curious",
"Interesting. New. Exciting. Develop",
"perplexed,anxious,embarrassed,bit excited",
"Novel, Unknown, Challenging, Useful",
"Worried, excited, self-doubt, motivated",
"Excited,curious,nervous,worried")

df <- data.frame(How_do_you_feel = lines)
df
#>                                              How_do_you_feel
#> 1                           Excited, Hopeful, Prepared, good
#> 2                         Unsure, confused, anxious, curious
#> 3  Co operations, Teamwork, communication, critical thinking
#> 4                                                          a
#> 5                         First, team work, nervous, curious
#> 6                        Interesting. New. Exciting. Develop
#> 7            perplexed,anxious,embarrassed,bit excited
#> 8                        Novel, Unknown, Challenging, Useful
#> 9                    Worried, excited, self-doubt, motivated
#> 10                           Excited,curious,nervous,worried

df %>%
  mutate(How_do_you_feel = str_extract(
    How_do_you_feel,
    "[[:alpha:][:punct:] ] , [[:alpha:][:punct:] ] , [[:alpha:][:punct:] ] , [[:alpha:][:punct:] ] "
    )) %>%
  filter(!is.na(How_do_you_feel))
#>                                             How_do_you_feel
#> 1                          Excited, Hopeful, Prepared, good
#> 2                        Unsure, confused, anxious, curious
#> 3 Co operations, Teamwork, communication, critical thinking
#> 4                        First, team work, nervous, curious
#> 5                       Novel, Unknown, Challenging, Useful
#> 6                   Worried, excited, self-doubt, motivated

Created on 2022-07-22 by the reprex package (v2.0.1)

CodePudding user response:

One generalized rule which appears to apply to your situation is that having three commas followed by a space (not just a comma as in previous answers) means a good match. Try this:

library(tidyverse)

df %>%
  filter(str_count(How_do_you_feel, ", ") == 3)

#   How_do_you_feel                                               
#   <chr>                                                         
# 1 "Excited, Hopeful, Prepared, good    "                        
# 2 "Unsure, confused, anxious, curious  "                        
# 3 "Co operations, Teamwork, communication, critical thinking   "
# 4 "First, team work, nervous, curious  "                        
# 5 "Novel, Unknown, Challenging, Useful "                        
# 6 "Worried, excited, self-doubt, motivated "  
  • Related