Home > Back-end >  Duplicated dataframe rows returned by for loop in R programming
Duplicated dataframe rows returned by for loop in R programming

Time:02-17

I'm currently working with adding new columns based on a calculation. This is the sample data

REC = c(237, 1781, NA, 3710, 2099)
S = c(2509, 25616, NaN, 19224, 6569)
Industry = c("ABC", "ABC", "ABC",  "CDE", "CDE")
data = data.frame(REC, S, Industry)

I want to apply unit length scaling to newly added columns. For that sake I have written this piece of code

  data2 = data.frame()
  
  foreach(i = unique(data$Industry)) %do% {
    foreach(j = fnames) %do% {
      dataOrg = data
      
      # Calculate unit length per feature
      dataFin = dataOrg[dataOrg[,"Industry"] == i & is.finite(dataOrg[,j]), ] #Filtering only finite data
      data1 = dplyr::filter(dataOrg[!is.finite(dataOrg[,j]), ]) # Filtering the non finite data
      
      dataFin[ , sprintf("%s_uLen", j)] = dataFin[, j] / sqrt(sum(dataFin[, j]^2)) # Calculation
      data2 = data2 %>% 
        dplyr::bind_rows(data1, dataFin)
    }
  }

This is the output after each iteration

[[1]]
[[1]][[1]]
   REC     S Industry  REC_uLen
1   NA   NaN      ABC        NA
2  237  2509      ABC 0.1319085
3 1781 25616      ABC 0.9912619

[[1]][[2]]
   REC     S Industry  REC_uLen     S_uLen
1   NA   NaN      ABC        NA         NA
2  237  2509      ABC 0.1319085         NA
3 1781 25616      ABC 0.9912619         NA
4   NA   NaN      ABC        NA         NA
5  237  2509      ABC        NA 0.09748012
6 1781 25616      ABC        NA 0.99523747


[[2]]
[[2]][[1]]
   REC     S Industry  REC_uLen     S_uLen
1   NA   NaN      ABC        NA         NA
2  237  2509      ABC 0.1319085         NA
3 1781 25616      ABC 0.9912619         NA
4   NA   NaN      ABC        NA         NA
5  237  2509      ABC        NA 0.09748012
6 1781 25616      ABC        NA 0.99523747
7   NA   NaN      ABC        NA         NA
8 3710 19224      CDE 0.8703574         NA
9 2099  6569      CDE 0.4924205         NA

[[2]][[2]]
    REC     S Industry  REC_uLen     S_uLen
1    NA   NaN      ABC        NA         NA
2   237  2509      ABC 0.1319085         NA
3  1781 25616      ABC 0.9912619         NA
4    NA   NaN      ABC        NA         NA
5   237  2509      ABC        NA 0.09748012
6  1781 25616      ABC        NA 0.99523747
7    NA   NaN      ABC        NA         NA
8  3710 19224      CDE 0.8703574         NA
9  2099  6569      CDE 0.4924205         NA
10   NA   NaN      ABC        NA         NA
11 3710 19224      CDE        NA 0.94627897
12 2099  6569      CDE        NA 0.32335136

At each step 3 news are getting added. I want my output to contain the same 5 rows of data but with newly added columns.

This is the expected output

   REC     S Industry  REC_uLen     S_uLen
1  237  2509      ABC 0.1319085 0.09748012
2 1781 25616      ABC 0.9912619 0.99523747
3   NA   NaN      ABC        NA         NA
4 3710 19224      CDE 0.8703574 0.94627897
5 2099  6569      CDE 0.4924205 0.32335136

CodePudding user response:

Here's what I was thinking about in terms of joins and the like:

library(foreach)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
REC = c(237, 1781, NA, 3710, 2099)
S = c(2509, 25616, NaN, 19224, 6569)
Industry = c("ABC", "ABC", "ABC",  "CDE", "CDE")
data = data.frame(REC, S, Industry)

fnames <- c("REC", "S")
out <- NULL
foreach(i = unique(data$Industry)) %do% {
  dataFin = subset(data, Industry == i) #Filtering only finite data
  foreach(j = fnames) %do% {
    dataFin[[sprintf("%s_uLen", j)]] = dataFin[[j]] / sqrt(sum(dataFin[[j]]^2, na.rm=TRUE)) # Calculation
  }
  out <- bind_rows(out, dataFin)
}
#> [[1]]
#>    REC     S Industry  REC_uLen     S_uLen
#> 1  237  2509      ABC 0.1319085 0.09748012
#> 2 1781 25616      ABC 0.9912619 0.99523747
#> 3   NA   NaN      ABC        NA        NaN
#> 
#> [[2]]
#>    REC     S Industry  REC_uLen     S_uLen
#> 1  237  2509      ABC 0.1319085 0.09748012
#> 2 1781 25616      ABC 0.9912619 0.99523747
#> 3   NA   NaN      ABC        NA        NaN
#> 4 3710 19224      CDE 0.8703574 0.94627897
#> 5 2099  6569      CDE 0.4924205 0.32335136
out
#>    REC     S Industry  REC_uLen     S_uLen
#> 1  237  2509      ABC 0.1319085 0.09748012
#> 2 1781 25616      ABC 0.9912619 0.99523747
#> 3   NA   NaN      ABC        NA        NaN
#> 4 3710 19224      CDE 0.8703574 0.94627897
#> 5 2099  6569      CDE 0.4924205 0.32335136

Created on 2022-02-16 by the reprex package (v2.0.1)

  • Related