So, this is a rookie question.
I've got a data frame which includes all responses of an online survey, totaling in 89 columns.
As online surveys are sometimes filled out by people who do not really care and who simply put in response values that are easy to fill in, I would like to filter out implausible rows where someone simple hit an extreme value over and over again.
I would like to filter out rows where these columns ALL have the value "9", or ALL have the value "1":
- 'sociald_ties_strong'
- 'sociald_ties_weak'
- 'sociald_ties_secondorder'
- 'sociald_identity_lifestyle'
- 'sociald_identity_politics'
- 'sociald_vertical_socialcapital'
- 'sociald_vertical_networkcapital'
So this is my code (including the tidyverse and dplyr packages):
data-cleaned <- data_raw %>%
fdaten_bereinigt <- rohdaten_basis %>%
filter(sociald_ties_strong == 9 & sociald_ties_weak == 9 & sociald_ties_secondorder == 9 & sociald_identity_lifestyle == 9 & sociald_identity_politics == 9 & sociald_vertical_socialcapital == 9 & sociald_vertical_networkcapital == 9) %>%
filter(sociald_ties_strong == 1 & sociald_ties_weak == 1 & sociald_ties_secondorder == 1 & sociald_identity_lifestyle == 1 & sociald_identity_politics == 1 & sociald_vertical_socialcapital == 1 & sociald_vertical_networkcapital == 1)
But, I seem to be missing something in my logic and / or syntax, as this filters out way to many rows.
My data cleaning would include more conditional strings like the above to exclude faulty or automated rows, but first I want to learn how to get it right.
Possibly two (or more) filters piped together is not a good idea the way I did it? Any suggestions welcome!
I expected to filter out the few rows that meet all the conditions, possibly 0.5% of the total observations.
CodePudding user response:
There are a few issues with your code. First, you have two assignments in your pipeline:
cleaned <- data_raw %>%
fdaten_bereinigt <- rohdaten_basis %>%
# ...
I’m assuming this is an error; otherwise, it’s not clear what you’re trying to do.
Second, filter()
keeps values meeting the condition and removes everything else. So currently your first filter()
keeps only rows where all values are 9
. Then your second filter()
looks for only rows where all values are 1
- of which there are none, since these were removed by the previous step - so you end up with an empty dataframe.
To get what you want, negate the condition by wrapping in !()
:
library(dplyr)
fdaten_bereinigt <- rohdaten_basis %>%
filter(!(sociald_ties_strong == 9 & sociald_ties_weak == 9 & sociald_ties_secondorder == 9 & sociald_identity_lifestyle == 9 & sociald_identity_politics == 9 & sociald_vertical_socialcapital == 9 & sociald_vertical_networkcapital == 9)) %>%
# ...
Third, you can do this more succinctly using dplyr::if_all()
:
fdaten_bereinigt <- rohdaten_basis %>%
filter(
!if_all(sociald_ties_strong:sociald_vertical_networkcapital, \(x) x == 9),
!if_all(sociald_ties_strong:sociald_vertical_networkcapital, \(x) x == 1)
)
CodePudding user response:
You could create a vector of your variables that you want to check. Then use filter(if_all(all_of(VECTOR), ~.x != 1) & if_all(all_of(VECTOR), ~.x != 9))
qs <- c("A", "B", "C")
dat <- data.frame("A" = 1:9,
"B" = 1:9,
"C" = 1:9)
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
dat |>
filter(if_all(all_of(qs), ~.x != 1) & if_all(all_of(qs), ~.x != 9))
A B C
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
CodePudding user response:
When it's hard to figure out what to do in dplyr
, often it's easier to reshape the data into a different form.
You can use pivot_longer
to put the data in a long form:
dat <- data.frame("A" = 1:9,
"B" = 1:9,
"C" = 1:9)
dat_long <-
dat %>%
mutate(id = row_number()) %>%
pivot_longer(
cols = c(A, B, C),
names_to = 'column',
values_to = 'value'
)
> dat_long
# A tibble: 27 × 3
id column value
<int> <chr> <int>
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 2
5 2 B 2
6 2 C 2
7 3 A 3
8 3 B 3
9 3 C 3
10 4 A 4
# … with 17 more rows
Now it's reasonably straight forward with a group by and a filter:
dat_filtered <-
dat_long %>%
group_by(id) %>%
filter(!all(value == 1)) %>%
filter(!all(value == 9))
> dat_filtered
# A tibble: 21 × 3
# Groups: id [7]
id column value
<int> <chr> <int>
1 2 A 2
2 2 B 2
3 2 C 2
4 3 A 3
5 3 B 3
6 3 C 3
7 4 A 4
8 4 B 4
9 4 C 4
10 5 A 5
# … with 11 more rows
And you can finish by moving the data back to the original form:
dat_filtered %>%
pivot_wider(
names_from = column,
values_from = value
)
A tibble: 7 × 4
# Groups: id [7]
id A B C
<int> <int> <int> <int>
1 2 2 2 2
2 3 3 3 3
3 4 4 4 4
4 5 5 5 5
5 6 6 6 6
6 7 7 7 7
7 8 8 8 8