Home > Blockchain >  R - Find following rows with (almost) the same content and perform calculation if condition is met
R - Find following rows with (almost) the same content and perform calculation if condition is met

Time:07-26

Consider you have the following data structure:

ID Column.2 Group 1 Column.4 Group 2 Column.6
4 Text 500 0 200 0
9 Word 700 0 300 0
122 Text - 1 150 0 60 0
123 Text - 2 100 0 40 0
124 Text - 3 80 0 50 0
125 Text - 4 70 0 30 0
126 Text - 5 100 0 20 0
254 Word - 1 350 0 50 0
255 Word - 2 70 0 75 0
256 Word - 3 80 0 75 0
257 Word - 4 150 0 40 0
258 Word - 5 50 0 60 0
  • Column.2 contains characters, whereby the upper part of the table (Text, Word) refers to variables and the lower part of the table refers to the values (Text - 1, Text - 2, ...) of these variables.
  • Column.3 and Column.5 refer to two subgroups of the variables, whereby the upper part of the table contains the overall group size (e.g. the variable Text occurs 500 times in Group 1 and 200 times in Group 2). The lower part of the table contains the number of the values of the variables (distribution) for each subgroup (e.g. the value Text - 1 occurs 150 times in Group 1 and 60 times in Group 2). Therefore, the sum of the values of a subgroup of the lower part of the table is the same as the number of the upper side of the table (e.g. Text - Group 1 = 500 = 150 100 80 70 100).

In general, I would like to achieve the following:

  • Check the upper part of the table for a variable name (Text)
  • If this variable name occurs again in the same column (Text - 1, Text - 2, Text - 3, Text - 4, Text - 5), divide the number of the corresponding group by the total number of the group and save the result in the adjecent column (e.g. the value in Column.4 for Text - 1 should be 150/500).

As a result, the final table should look like this:

ID Column.2 Group 1 Column.4 Group 2 Column.6
4 Text 500 0 200 0
9 Word 700 0 300 0
122 Text - 1 150 150/500 60 60/200
123 Text - 2 100 100/500 40 40/200
124 Text - 3 80 80/500 50 50/200
125 Text - 4 70 70/500 30 30/200
126 Text - 5 100 100/500 20 20/200
254 Word - 1 350 350/700 50 50/300
255 Word - 2 70 70/700 75 75/300
256 Word - 3 80 80/700 75 75/300
257 Word - 4 150 150/700 40 40/300
258 Word - 5 50 50/700 60 60/300

I just started programming and my suggestion is to use a for-loop to iterate over Column.2 and compare the variable names with the variable names of the following rows with a regular expression. Every time the condition is met, the calculation should be performed in Column.4 and Column.6. However, I do not only have to save the variable name of Column.2 but also the corresponding number for each subgroup (e.g. for Text the numbers 500 and 200). I am not really sure how to do this, so I am grateful for any advice.

CodePudding user response:

It seems to me that columns Group 1 and Group 2 take on fundamentally different meanings depending on the value of Column.2. I would break it up into two different tables, connect them with an ID column, merge, and perform necessary operations. You can always reconstruct the original database afterwards if you need to.

#Save for later
orgid <- df$ID
df3 <- df[1:2,]
df2 <- df3
#Prep denominator table
df2 <- subset(df2, select = -c(ID))
colnames(df2) <- revalue(colnames(df2), c('Column.2' = 'ID', 'Group.1' = 'denominator1', 'Group.2' = 'denominator2'))
df2 <- subset(df2, select = -c(Column.4, Column.6))
#Prep main table
df <- df[-c(1:2),]
df$ID <- as.character(do.call("rbind", strsplit(as.character(df$Column.2), "-"))[,1])
df$ID <- str_trim(df$ID, side = c("right"))
#Merge and work
df <- merge(df, df2, all.x = T)
rm(df2)
df$Column.4 <- df$Group.1/df$denominator1
df$Column.6 <- df$Group.2/df$denominator2
#Reconstruct
df <- subset(df, select = -c(ID, denominator1, denominator2))
df <- bind_rows(df3, df)
rm(df3)
df$ID <- orgid
rm(orgid)
print(df)

Note that, in my reconstruction, I renamed your table df, Group 1 Group.1, and Group 2 Group.2.

CodePudding user response:

Not sure if your "Text" and "Word" always represent the totals by group, they are really odd rows in a data table and perhaps not even needed. I assume - as in your example - you just want 2 ratio's based on the values in "Group 1" and "Group 2".

In data.table I would do it like this:

library(data.table)
setDT(df)

df[, c("str", "str_id") := tstrsplit(Column.2, " - ")]
df[!is.na(str_id), `:=` (ratio.1 = `Group 1` / sum(`Group 1`), ratio.2 = `Group 2` / sum(`Group 2`)), by = str]

df

#      ID Column.2 Group 1 Column.4 Group 2 Column.6  str str_id  ratio.1 ratio.2
#  1:   4     Text     500        0     200        0 Text   <NA>       NA      NA
#  2:   9     Word     700        0     300        0 Word   <NA>       NA      NA
#  3: 122 Text - 1     150        0      60        0 Text      1 0.300000 0.30000
#  4: 123 Text - 2     100        0      40        0 Text      2 0.200000 0.20000
#  5: 124 Text - 3      80        0      50        0 Text      3 0.160000 0.25000
#  6: 125 Text - 4      70        0      30        0 Text      4 0.140000 0.15000
#  7: 126 Text - 5     100        0      20        0 Text      5 0.200000 0.10000
#  8: 254 Word - 1     350        0      50        0 Word      1 0.500000 0.16667
#  9: 255 Word - 2      70        0      75        0 Word      2 0.100000 0.25000
# 10: 256 Word - 3      80        0      75        0 Word      3 0.114286 0.25000
# 11: 257 Word - 4     150        0      40        0 Word      4 0.214286 0.13333
# 12: 258 Word - 5      50        0      60        0 Word      5 0.071429 0.20000
  • Related