How to combine columns into a new table

Scenario:

If I have this table, let's call it df:

survey_answer_1___1	survey_answer_1___2	survey_answer_1___3	survey_answer_2___1
1	1	0	1
0	1	0	0
0	0	0	1
1	1	1	0

Using R or Python, how do I split and transform df into survey_answer_1 and survey_answer_2 like this:

survey_answer_1:

1	2	3
2	3	1

survey_answer_2:

1	2
2	0

Where the column names of the new tables are extracted from df column names after '___'. The values in the new cells is the count of 1s in each column in df. This should be done automatically (tables should not be "hard-coded"), as there are many other columns in my data file that this should be applied on as well.

split() can be used to extract the numbers after '___' for column names. I tried implementing the rest using a dictionary, but it is not working.

CodePudding user response：

Here an R example where the new columns can be arbitrary values

df <- as.data.frame(matrix(c(1,0,0,1,1,1,0,1,0,0,0,1,1,0,1,0,0,0,0,0), 4, 5, dim=list(
  1:4, paste0("survey_answer_", c(1,1,1,2,2), "__", c(1,2,3,1,5)) )))

df 
#>   survey_answer_1__1 survey_answer_1__2 survey_answer_1__3 survey_answer_2__1
#> 1                  1                  1                  0                  1
#> 2                  0                  1                  0                  0
#> 3                  0                  0                  0                  1
#> 4                  1                  1                  1                  0
#>   survey_answer_2__5
#> 1                  0
#> 2                  0
#> 3                  0
#> 4                  0

var <- Map(c, names(df), strsplit(names(df), "__"))

result <- tapply(var, sapply(var,"[", 2), \(x) 
       setNames(colSums(df[sapply(x,"[",1)]) , sapply(x,"[",3)))

#to assign the resuilt list to new  datafrae variables:
list2env(result, environment())


survey_answer_1
#> 1 2 3 
#> 2 3 1
survey_answer_2
#> 1 5 
#> 2 0

CodePudding user response：

Using R / tidyverse, first summarize all columns to sums; then pivot to long; then split() by survey_answer; then map() over the resulting list to drop all-NA columns :

survey_dfs <- df %>% 
  summarize(across(everything(), sum)) %>% 
  pivot_longer(
    everything(), 
    names_to = c("survey_answer", ".value"), 
    names_sep = "___"
  ) %>% 
  split(.$survey_answer, drop = TRUE) %>% 
  map(\(d) select(d, where(\(col) !all(is.na(col))) & !survey_answer))

survey_dfs

$survey_answer_1
# A tibble: 1 × 3
    `1`   `2`   `3`
  <dbl> <dbl> <dbl>
1     2     3     1

$survey_answer_2
# A tibble: 1 × 2
    `1`   `2`
  <dbl> <dbl>
1     2     0

This gives you a named list of dataframes, which is best practice in most cases. If you really want the resulting dataframes loose in the global environment, you can replace the map() call with an assign() call within purrr::iwalk():

df %>% 
  summarize(across(everything(), sum)) %>% 
  pivot_longer(
    everything(), 
    names_to = c("survey_answer", ".value"), 
    names_sep = "___"
  ) %>% 
  split(.$survey_answer, drop = TRUE) %>% 
  iwalk(\(d, dname) {
    d <- select(d, where(\(col) !all(is.na(col))) & !survey_answer)
    assign(dname, d, pos = 1)
  })

survey_answer_1

# A tibble: 1 × 3
    `1`   `2`   `3`
  <dbl> <dbl> <dbl>
1     2     3     1

CodePudding user response：

For answer 1, you could do the following:

# grab correct columns
df_answer_1 = df[[col for col in df.columns if col.startswith('survey_answer_1')]] 

# change column names
df_answer_1.columns = [col[-1] for col in df_answer_1.columns]

# sum up columns
answer_1_sums = df_answer_1.sum()

You can do the same for answer 2.

CodePudding user response：

Assuming data is in csv:

survey_answer_1___1,survey_answer_1___2,survey_answer_1___3,survey_answer_2___1,survey_answer_2___2
1,1,0,1,0
0,1,0,0,0
0,0,0,1,0
1,1,1,0,0

Read data:

import csv

with open('input.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    df = [row for row in reader]

Process data:

from collections import defaultdict, Counter

dd = defaultdict(Counter)
for row in df:
    for k, v in row.items():
        key1, key2 = k.split('___')
        dd[key1][int(key2)]  = int(v)

Print result:

for k in dd:
    print(k, sorted(dd[k].items()))

CodePudding user response：

In Python:

# raw data
df = {"survey_answer_1___1":[1,0,0,1], "survey_answer_1___2":[1,1,0,1], "survey_answer_1___3":[0,0,0,1], "survey_answer_2___1":[1,0,1,0], "survey_answer_2___2":[0,0,0,0]}
# sum up the answers
for k in df:
    sum_df[k] = sum(df[k])
# extract answer_1
survey_answer_1 = {[k[-1]:sum_df[k] for k in sum_df if k.startswith("survey_answer_1")]}
survey_answer_1
{'1': 2, '2': 3, '3': 1}
# extract answer_2
survey_answer_2 = {k[-1]:sum_df[k] for k in sum_df if k.startswith("survey_answer_2")}
survey_answer_2
{'1': 2, '2': 0}