Consider you have the following data structure:
ID | Column.2 | Group 1 | Column.4 | Group 2 | Column.6 |
---|---|---|---|---|---|
4 | Text | 500 | 0 | 200 | 0 |
9 | Word | 700 | 0 | 300 | 0 |
122 | Text - 1 | 150 | 0 | 60 | 0 |
123 | Text - 2 | 100 | 0 | 40 | 0 |
124 | Text - 3 | 80 | 0 | 50 | 0 |
125 | Text - 4 | 70 | 0 | 30 | 0 |
126 | Text - 5 | 100 | 0 | 20 | 0 |
254 | Word - 1 | 350 | 0 | 50 | 0 |
255 | Word - 2 | 70 | 0 | 75 | 0 |
256 | Word - 3 | 80 | 0 | 75 | 0 |
257 | Word - 4 | 150 | 0 | 40 | 0 |
258 | Word - 5 | 50 | 0 | 60 | 0 |
- Column.2 contains characters, whereby the upper part of the table (Text, Word) refers to variables and the lower part of the table refers to the values (Text - 1, Text - 2, ...) of these variables.
- Column.3 and Column.5 refer to two subgroups of the variables, whereby the upper part of the table contains the overall group size (e.g. the variable Text occurs 500 times in Group 1 and 200 times in Group 2). The lower part of the table contains the number of the values of the variables (distribution) for each subgroup (e.g. the value Text - 1 occurs 150 times in Group 1 and 60 times in Group 2). Therefore, the sum of the values of a subgroup of the lower part of the table is the same as the number of the upper side of the table (e.g. Text - Group 1 = 500 = 150 100 80 70 100).
In general, I would like to achieve the following:
- Check the upper part of the table for a variable name (Text)
- If this variable name occurs again in the same column (Text - 1, Text - 2, Text - 3, Text - 4, Text - 5), divide the number of the corresponding group by the total number of the group and save the result in the adjecent column (e.g. the value in Column.4 for Text - 1 should be 150/500).
As a result, the final table should look like this:
ID | Column.2 | Group 1 | Column.4 | Group 2 | Column.6 |
---|---|---|---|---|---|
4 | Text | 500 | 0 | 200 | 0 |
9 | Word | 700 | 0 | 300 | 0 |
122 | Text - 1 | 150 | 150/500 | 60 | 60/200 |
123 | Text - 2 | 100 | 100/500 | 40 | 40/200 |
124 | Text - 3 | 80 | 80/500 | 50 | 50/200 |
125 | Text - 4 | 70 | 70/500 | 30 | 30/200 |
126 | Text - 5 | 100 | 100/500 | 20 | 20/200 |
254 | Word - 1 | 350 | 350/700 | 50 | 50/300 |
255 | Word - 2 | 70 | 70/700 | 75 | 75/300 |
256 | Word - 3 | 80 | 80/700 | 75 | 75/300 |
257 | Word - 4 | 150 | 150/700 | 40 | 40/300 |
258 | Word - 5 | 50 | 50/700 | 60 | 60/300 |
I just started programming and my suggestion is to use a for-loop to iterate over Column.2 and compare the variable names with the variable names of the following rows with a regular expression. Every time the condition is met, the calculation should be performed in Column.4 and Column.6. However, I do not only have to save the variable name of Column.2 but also the corresponding number for each subgroup (e.g. for Text the numbers 500 and 200). I am not really sure how to do this, so I am grateful for any advice.
CodePudding user response:
It seems to me that columns Group 1
and Group 2
take on fundamentally different meanings depending on the value of Column.2
. I would break it up into two different tables, connect them with an ID column, merge, and perform necessary operations. You can always reconstruct the original database afterwards if you need to.
#Save for later
orgid <- df$ID
df3 <- df[1:2,]
df2 <- df3
#Prep denominator table
df2 <- subset(df2, select = -c(ID))
colnames(df2) <- revalue(colnames(df2), c('Column.2' = 'ID', 'Group.1' = 'denominator1', 'Group.2' = 'denominator2'))
df2 <- subset(df2, select = -c(Column.4, Column.6))
#Prep main table
df <- df[-c(1:2),]
df$ID <- as.character(do.call("rbind", strsplit(as.character(df$Column.2), "-"))[,1])
df$ID <- str_trim(df$ID, side = c("right"))
#Merge and work
df <- merge(df, df2, all.x = T)
rm(df2)
df$Column.4 <- df$Group.1/df$denominator1
df$Column.6 <- df$Group.2/df$denominator2
#Reconstruct
df <- subset(df, select = -c(ID, denominator1, denominator2))
df <- bind_rows(df3, df)
rm(df3)
df$ID <- orgid
rm(orgid)
print(df)
Note that, in my reconstruction, I renamed your table df
, Group 1
Group.1
, and Group 2
Group.2
.
CodePudding user response:
Not sure if your "Text" and "Word" always represent the totals by group, they are really odd rows in a data table and perhaps not even needed. I assume - as in your example - you just want 2 ratio's based on the values in "Group 1" and "Group 2".
In data.table I would do it like this:
library(data.table)
setDT(df)
df[, c("str", "str_id") := tstrsplit(Column.2, " - ")]
df[!is.na(str_id), `:=` (ratio.1 = `Group 1` / sum(`Group 1`), ratio.2 = `Group 2` / sum(`Group 2`)), by = str]
df
# ID Column.2 Group 1 Column.4 Group 2 Column.6 str str_id ratio.1 ratio.2
# 1: 4 Text 500 0 200 0 Text <NA> NA NA
# 2: 9 Word 700 0 300 0 Word <NA> NA NA
# 3: 122 Text - 1 150 0 60 0 Text 1 0.300000 0.30000
# 4: 123 Text - 2 100 0 40 0 Text 2 0.200000 0.20000
# 5: 124 Text - 3 80 0 50 0 Text 3 0.160000 0.25000
# 6: 125 Text - 4 70 0 30 0 Text 4 0.140000 0.15000
# 7: 126 Text - 5 100 0 20 0 Text 5 0.200000 0.10000
# 8: 254 Word - 1 350 0 50 0 Word 1 0.500000 0.16667
# 9: 255 Word - 2 70 0 75 0 Word 2 0.100000 0.25000
# 10: 256 Word - 3 80 0 75 0 Word 3 0.114286 0.25000
# 11: 257 Word - 4 150 0 40 0 Word 4 0.214286 0.13333
# 12: 258 Word - 5 50 0 60 0 Word 5 0.071429 0.20000