R - Find following rows with (almost) the same content and perform calculation if condition is met-CodePudding

Consider you have the following data structure:

ID	Column.2	Group 1	Group 2
4	Text	500	200
9	Word	700	300
122	Text - 1	150	60
123	Text - 2	100	40
124	Text - 3	80	50
125	Text - 4	70	30
126	Text - 5	100	20
254	Word - 1	350	50
255	Word - 2	70	75
256	Word - 3	80	75
257	Word - 4	150	40
258	Word - 5	50	60

Column.2 contains characters, whereby the upper part of the table (Text, Word) refers to variables and the lower part of the table refers to the values (Text - 1, Text - 2, ...) of these variables.
Column.3 and Column.5 refer to two subgroups of the variables, whereby the upper part of the table contains the overall group size (e.g. the variable Text occurs 500 times in Group 1 and 200 times in Group 2). The lower part of the table contains the number of the values of the variables (distribution) for each subgroup (e.g. the value Text - 1 occurs 150 times in Group 1 and 60 times in Group 2). Therefore, the sum of the values of a subgroup of the lower part of the table is the same as the number of the upper side of the table (e.g. Text - Group 1 = 500 = 150 100 80 70 100).

In general, I would like to achieve the following:

Check the upper part of the table for a variable name (Text)
If this variable name occurs again in the same column (Text - 1, Text - 2, Text - 3, Text - 4, Text - 5), divide the number of the corresponding group by the total number of the group and save the result in the adjecent column (e.g. the value in Column.4 for Text - 1 should be 150/500).

As a result, the final table should look like this:

ID	Column.2	Group 1	Column.4	Group 2	Column.6
4	Text	500	0	200	0
9	Word	700	0	300	0
122	Text - 1	150	150/500	60	60/200
123	Text - 2	100	100/500	40	40/200
124	Text - 3	80	80/500	50	50/200
125	Text - 4	70	70/500	30	30/200
126	Text - 5	100	100/500	20	20/200
254	Word - 1	350	350/700	50	50/300
255	Word - 2	70	70/700	75	75/300
256	Word - 3	80	80/700	75	75/300
257	Word - 4	150	150/700	40	40/300
258	Word - 5	50	50/700	60	60/300

I just started programming and my suggestion is to use a for-loop to iterate over Column.2 and compare the variable names with the variable names of the following rows with a regular expression. Every time the condition is met, the calculation should be performed in Column.4 and Column.6. However, I do not only have to save the variable name of Column.2 but also the corresponding number for each subgroup (e.g. for Text the numbers 500 and 200). I am not really sure how to do this, so I am grateful for any advice.

CodePudding user response：

It seems to me that columns Group 1 and Group 2 take on fundamentally different meanings depending on the value of Column.2. I would break it up into two different tables, connect them with an ID column, merge, and perform necessary operations. You can always reconstruct the original database afterwards if you need to.

#Save for later
orgid <- df$ID
df3 <- df[1:2,]
df2 <- df3
#Prep denominator table
df2 <- subset(df2, select = -c(ID))
colnames(df2) <- revalue(colnames(df2), c('Column.2' = 'ID', 'Group.1' = 'denominator1', 'Group.2' = 'denominator2'))
df2 <- subset(df2, select = -c(Column.4, Column.6))
#Prep main table
df <- df[-c(1:2),]
df$ID <- as.character(do.call("rbind", strsplit(as.character(df$Column.2), "-"))[,1])
df$ID <- str_trim(df$ID, side = c("right"))
#Merge and work
df <- merge(df, df2, all.x = T)
rm(df2)
df$Column.4 <- df$Group.1/df$denominator1
df$Column.6 <- df$Group.2/df$denominator2
#Reconstruct
df <- subset(df, select = -c(ID, denominator1, denominator2))
df <- bind_rows(df3, df)
rm(df3)
df$ID <- orgid
rm(orgid)
print(df)

Note that, in my reconstruction, I renamed your table df, Group 1 Group.1, and Group 2 Group.2.

CodePudding user response：

Not sure if your "Text" and "Word" always represent the totals by group, they are really odd rows in a data table and perhaps not even needed. I assume - as in your example - you just want 2 ratio's based on the values in "Group 1" and "Group 2".

In data.table I would do it like this:

library(data.table)
setDT(df)

df[, c("str", "str_id") := tstrsplit(Column.2, " - ")]
df[!is.na(str_id), `:=` (ratio.1 = `Group 1` / sum(`Group 1`), ratio.2 = `Group 2` / sum(`Group 2`)), by = str]

df

#      ID Column.2 Group 1 Column.4 Group 2 Column.6  str str_id  ratio.1 ratio.2
#  1:   4     Text     500        0     200        0 Text   <NA>       NA      NA
#  2:   9     Word     700        0     300        0 Word   <NA>       NA      NA
#  3: 122 Text - 1     150        0      60        0 Text      1 0.300000 0.30000
#  4: 123 Text - 2     100        0      40        0 Text      2 0.200000 0.20000
#  5: 124 Text - 3      80        0      50        0 Text      3 0.160000 0.25000
#  6: 125 Text - 4      70        0      30        0 Text      4 0.140000 0.15000
#  7: 126 Text - 5     100        0      20        0 Text      5 0.200000 0.10000
#  8: 254 Word - 1     350        0      50        0 Word      1 0.500000 0.16667
#  9: 255 Word - 2      70        0      75        0 Word      2 0.100000 0.25000
# 10: 256 Word - 3      80        0      75        0 Word      3 0.114286 0.25000
# 11: 257 Word - 4     150        0      40        0 Word      4 0.214286 0.13333
# 12: 258 Word - 5      50        0      60        0 Word      5 0.071429 0.20000