Scenario:
If I have this table, let's call it df
:
survey_answer_1___1 | survey_answer_1___2 | survey_answer_1___3 | survey_answer_2___1 | survey_answer_2___2 |
---|---|---|---|---|
1 | 1 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 |
1 | 1 | 1 | 0 | 0 |
Using R or Python, how do I split and transform df
into survey_answer_1
and survey_answer_2
like this:
survey_answer_1
:
1 | 2 | 3 |
---|---|---|
2 | 3 | 1 |
survey_answer_2
:
1 | 2 |
---|---|
2 | 0 |
Where the column names of the new tables are extracted from df
column names after '___'
. The values in the new cells is the count of 1s in each column in df
. This should be done automatically (tables should not be "hard-coded"), as there are many other columns in my data file that this should be applied on as well.
split()
can be used to extract the numbers after '___'
for column names. I tried implementing the rest using a dictionary, but it is not working.
CodePudding user response:
Here an R example where the new columns can be arbitrary values
df <- as.data.frame(matrix(c(1,0,0,1,1,1,0,1,0,0,0,1,1,0,1,0,0,0,0,0), 4, 5, dim=list(
1:4, paste0("survey_answer_", c(1,1,1,2,2), "__", c(1,2,3,1,5)) )))
df
#> survey_answer_1__1 survey_answer_1__2 survey_answer_1__3 survey_answer_2__1
#> 1 1 1 0 1
#> 2 0 1 0 0
#> 3 0 0 0 1
#> 4 1 1 1 0
#> survey_answer_2__5
#> 1 0
#> 2 0
#> 3 0
#> 4 0
var <- Map(c, names(df), strsplit(names(df), "__"))
result <- tapply(var, sapply(var,"[", 2), \(x)
setNames(colSums(df[sapply(x,"[",1)]) , sapply(x,"[",3)))
#to assign the resuilt list to new datafrae variables:
list2env(result, environment())
survey_answer_1
#> 1 2 3
#> 2 3 1
survey_answer_2
#> 1 5
#> 2 0
CodePudding user response:
Using R / tidyverse, first summarize all columns to sums; then pivot to long; then split()
by survey_answer
; then map()
over the resulting list to drop all-NA
columns :
survey_dfs <- df %>%
summarize(across(everything(), sum)) %>%
pivot_longer(
everything(),
names_to = c("survey_answer", ".value"),
names_sep = "___"
) %>%
split(.$survey_answer, drop = TRUE) %>%
map(\(d) select(d, where(\(col) !all(is.na(col))) & !survey_answer))
survey_dfs
$survey_answer_1
# A tibble: 1 × 3
`1` `2` `3`
<dbl> <dbl> <dbl>
1 2 3 1
$survey_answer_2
# A tibble: 1 × 2
`1` `2`
<dbl> <dbl>
1 2 0
This gives you a named list of dataframes, which is best practice in most cases. If you really want the resulting dataframes loose in the global environment, you can replace the map()
call with an assign()
call within purrr::iwalk()
:
df %>%
summarize(across(everything(), sum)) %>%
pivot_longer(
everything(),
names_to = c("survey_answer", ".value"),
names_sep = "___"
) %>%
split(.$survey_answer, drop = TRUE) %>%
iwalk(\(d, dname) {
d <- select(d, where(\(col) !all(is.na(col))) & !survey_answer)
assign(dname, d, pos = 1)
})
survey_answer_1
# A tibble: 1 × 3
`1` `2` `3`
<dbl> <dbl> <dbl>
1 2 3 1
CodePudding user response:
For answer 1, you could do the following:
# grab correct columns
df_answer_1 = df[[col for col in df.columns if col.startswith('survey_answer_1')]]
# change column names
df_answer_1.columns = [col[-1] for col in df_answer_1.columns]
# sum up columns
answer_1_sums = df_answer_1.sum()
You can do the same for answer 2.
CodePudding user response:
Assuming data is in csv
:
survey_answer_1___1,survey_answer_1___2,survey_answer_1___3,survey_answer_2___1,survey_answer_2___2
1,1,0,1,0
0,1,0,0,0
0,0,0,1,0
1,1,1,0,0
Read data:
import csv
with open('input.csv') as csvfile:
reader = csv.DictReader(csvfile)
df = [row for row in reader]
Process data:
from collections import defaultdict, Counter
dd = defaultdict(Counter)
for row in df:
for k, v in row.items():
key1, key2 = k.split('___')
dd[key1][int(key2)] = int(v)
Print result:
for k in dd:
print(k, sorted(dd[k].items()))
CodePudding user response:
In Python:
# raw data
df = {"survey_answer_1___1":[1,0,0,1], "survey_answer_1___2":[1,1,0,1], "survey_answer_1___3":[0,0,0,1], "survey_answer_2___1":[1,0,1,0], "survey_answer_2___2":[0,0,0,0]}
# sum up the answers
for k in df:
sum_df[k] = sum(df[k])
# extract answer_1
survey_answer_1 = {[k[-1]:sum_df[k] for k in sum_df if k.startswith("survey_answer_1")]}
survey_answer_1
{'1': 2, '2': 3, '3': 1}
# extract answer_2
survey_answer_2 = {k[-1]:sum_df[k] for k in sum_df if k.startswith("survey_answer_2")}
survey_answer_2
{'1': 2, '2': 0}