I have a dataframe with identifiers and storm categories. Right now, the categories are in one column, but I want to add columns for each category with a 1 or 0 value. I don't think I want to reshape the data as wide, because in the actual dataset there are a number of long format variables I want to keep. I am using a series of ifelse statements currently, but it feels like there is probably a much better way:
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D"),
cat = c("TS", NA, NA, "TS", "1", "1", NA, NA, "2", NA, NA, NA)
)
df$cat_TS <- ifelse(df$cat == "TS", 1, 0) %>% replace_na(., 0)
df$cat_1 <- ifelse(df$cat == "1", 1, 0) %>% replace_na(., 0)
df$cat_2 <- ifelse(df$cat == "2", 1, 0) %>% replace_na(., 0)
CodePudding user response:
We may use pivot_wider
- create a sequence column 'rn', and then use pivot_wider
to reshape to wide with values_fn
as length
and values_fill
as 0
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number(), cat1 = cat) %>%
pivot_wider(names_from = cat1, values_from = cat1,
values_fn = length, values_fill = 0, names_prefix = "cat_")%>%
select(-cat_NA, -rn)
-output
# A tibble: 12 × 5
ID cat cat_TS cat_1 cat_2
<chr> <chr> <int> <int> <int>
1 A TS 1 0 0
2 B <NA> 0 0 0
3 C <NA> 0 0 0
4 D TS 1 0 0
5 A 1 0 1 0
6 B 1 0 1 0
7 C <NA> 0 0 0
8 D <NA> 0 0 0
9 A 2 0 0 1
10 B <NA> 0 0 0
11 C <NA> 0 0 0
12 D <NA> 0 0 0
or use fastDummies
library(fastDummies)
df %>%
dummy_cols("cat", remove_selected_columns = FALSE, ignore_na = TRUE) %>%
mutate(across(starts_with('cat_'), ~ replace_na(.x, 0)))
-output
ID cat cat_1 cat_2 cat_TS
1 A TS 0 0 1
2 B <NA> 0 0 0
3 C <NA> 0 0 0
4 D TS 0 0 1
5 A 1 1 0 0
6 B 1 1 0 0
7 C <NA> 0 0 0
8 D <NA> 0 0 0
9 A 2 0 1 0
10 B <NA> 0 0 0
11 C <NA> 0 0 0
12 D <NA> 0 0 0
CodePudding user response:
An idea using base R
First, get all unique category names
cats <- unique(df$cat[!is.na(df$cat)])
cats
[1] "TS" "1" "2"
Then look for matches in column cat
for each entry in cats
. PS, I left the cat column in to show the matching is right. Remove it by using df$ID
instead of df
as the first argument in cbind
.
cbind(df, setNames(data.frame(sapply(seq_along(cats), function(x)
df$cat %in% cats[x]) * 1), cats))
ID cat TS 1 2
1 A TS 1 0 0
2 B <NA> 0 0 0
3 C <NA> 0 0 0
4 D TS 1 0 0
5 A 1 0 1 0
6 B 1 0 1 0
7 C <NA> 0 0 0
8 D <NA> 0 0 0
9 A 2 0 0 1
10 B <NA> 0 0 0
11 C <NA> 0 0 0
12 D <NA> 0 0 0