How to get a dataframe output using mapply-CodePudding

I have a for loop im trying to convert to a mapply, as I have read that it is faster than for (for loop takes about 2 minutes).

The loop does this: it creates subsets filtering by the unique names of column "OrdenFab" and then, it keeps the unrepeated values on the "Valor" column. Then it adds this filtered subset to a new data frame, and it keeps adding them all as the loop goes on, getting a filtered dataframe with no repeated values in column "Valor" for each unique value of the column "OrdenFab".

i<-unique(datapesomolde$OrdenFab)
datapesomoldefiltered<-data.frame()
for (j in i){
  datapesomoldetemp<-datapesomolde%>%
    filter(OrdenFab==j)%>%
    filter(!duplicated(Valor))
  datapesomoldefiltered<-rbind(datapesomoldefiltered,datapesomoldetemp)
}

Original dataframe is this one (first 20 rows, it has 20626):

    > datapesomolde
    PartNumber  OrdenFab     Valor      Fecha_Registro LimInf LimSup Nominal
1     11012501 226549204  14.50000 2022-04-25 07:18:00  12.65  14.71   13.68
2     11012501 226549204  14.50000 2022-04-25 07:18:00  12.65  14.71   13.68
3     11013610 226548648  47.30000 2022-04-25 05:52:00  42.38  49.26   45.82
4     11013047 226548234  15.20000 2022-04-23 02:47:00  14.43  16.77   15.60
5     11013047 226548234  15.20000 2022-04-23 02:47:00  14.43  16.77   15.60
6     11013047 226548234  15.20000 2022-04-23 02:48:00  14.43  16.77   15.60
7     11013047 226548234  15.20000 2022-04-23 02:48:00  14.43  16.77   15.60
8     11013052 226548332  16.30000 2022-04-23 01:49:00  15.63  18.17   16.90
9     11013052 226548332  16.30000 2022-04-23 01:49:00  15.63  18.17   16.90
10    11013052 226548332  16.30000 2022-04-23 01:49:00  15.63  18.17   16.90
11    11013052 226548332  16.30000 2022-04-23 01:49:00  15.63  18.17   16.90
12    11012501 226548204  14.70000 2022-04-23 01:44:00  12.65  14.71   13.68
13    11012501 226548204  14.70000 2022-04-23 01:44:00  12.65  14.71   13.68
14    11012501 226548200  14.55000 2022-04-23 01:43:00  12.65  14.71   13.68
15    11012501 226548200  14.55000 2022-04-23 01:43:00  12.65  14.71   13.68
16    11012501 226548201  14.65000 2022-04-23 01:42:00  12.65  14.71   13.68
17    11012501 226548201  14.65000 2022-04-23 01:42:00  12.65  14.71   13.68
18    11013943 226548154 134.00000 2022-04-23 00:07:00 131.76 153.13  142.44
19    11013943 226547066 144.00000 2022-04-22 23:31:00 131.76 153.13  142.44
20    11013050 226547200  15.10000 2022-04-22 23:27:00  14.34  16.66   15.50

Filtered result is this one (first 10 rows):

>datapesomoldefiltered    
PartNumber  OrdenFab     Valor      Fecha_Registro LimInf  LimSup Nominal
    1     11012501 226549204  14.50000 2022-04-25 07:18:00  12.65   14.71   13.68
    2     11013610 226548648  47.30000 2022-04-25 05:52:00  42.38   49.26   45.82
    3     11013047 226548234  15.20000 2022-04-23 02:47:00  14.43   16.77   15.60
    4     11013052 226548332  16.30000 2022-04-23 01:49:00  15.63   18.17   16.90
    5     11012501 226548204  14.70000 2022-04-23 01:44:00  12.65   14.71   13.68
    6     11012501 226548200  14.55000 2022-04-23 01:43:00  12.65   14.71   13.68
    7     11012501 226548201  14.65000 2022-04-23 01:42:00  12.65   14.71   13.68
    8     11013943 226548154 134.00000 2022-04-23 00:07:00 131.76  153.13  142.44
    9     11013943 226547066 144.00000 2022-04-22 23:31:00 131.76  153.13  142.44
    10    11013050 226547200  15.10000 2022-04-22 23:27:00  14.34   16.66   15.50

Im strugling to convert it to mapply, I am getting a Matrix not a dataframe. I have tried this:

i<-unique(datapesomolde$OrdenFab)
datapesomoldefiltered<-data.frame()
limpiarof<-function(i){
  subset<-filter(datapesomolde,OrdenFab==i)
  datapesomoldetemp<-filter(subset,!duplicated(subset$Valor))
 return(datapesomoldefiltered<-rbind(datapesomoldefiltered,datapesomoldetemp))
  
}
datapesomoldefiltered<-mapply(limpiarof,i)

With this try I get a Matrix of 2.2GB, it just has the value of all the colomns for each unique value of the "OrdenFab" column.

result of mapply

Can you help me please? Thanks in advance.

CodePudding user response：

Here are two ways. The difference is that in the first solution the original rows order is kept in the final result. If this doesn't matter, the 2nd solution skips the creation of a temp list sp.

x <- "    PartNumber  OrdenFab     Valor      Fecha_Registro LimInf LimSup Nominal
1     11012501 226549204  14.50000 '2022-04-25 07:18:00'  12.65  14.71   13.68
2     11012501 226549204  14.50000 '2022-04-25 07:18:00'  12.65  14.71   13.68
3     11013610 226548648  47.30000 '2022-04-25 05:52:00'  42.38  49.26   45.82
4     11013047 226548234  15.20000 '2022-04-23 02:47:00'  14.43  16.77   15.60
5     11013047 226548234  15.20000 '2022-04-23 02:47:00'  14.43  16.77   15.60
6     11013047 226548234  15.20000 '2022-04-23 02:48:00'  14.43  16.77   15.60
7     11013047 226548234  15.20000 '2022-04-23 02:48:00'  14.43  16.77   15.60
8     11013052 226548332  16.30000 '2022-04-23 01:49:00'  15.63  18.17   16.90
9     11013052 226548332  16.30000 '2022-04-23 01:49:00'  15.63  18.17   16.90
10    11013052 226548332  16.30000 '2022-04-23 01:49:00'  15.63  18.17   16.90
11    11013052 226548332  16.30000 '2022-04-23 01:49:00'  15.63  18.17   16.90
12    11012501 226548204  14.70000 '2022-04-23 01:44:00'  12.65  14.71   13.68
13    11012501 226548204  14.70000 '2022-04-23 01:44:00'  12.65  14.71   13.68
14    11012501 226548200  14.55000 '2022-04-23 01:43:00'  12.65  14.71   13.68
15    11012501 226548200  14.55000 '2022-04-23 01:43:00'  12.65  14.71   13.68
16    11012501 226548201  14.65000 '2022-04-23 01:42:00'  12.65  14.71   13.68
17    11012501 226548201  14.65000 '2022-04-23 01:42:00'  12.65  14.71   13.68
18    11013943 226548154 134.00000 '2022-04-23 00:07:00' 131.76 153.13  142.44
19    11013943 226547066 144.00000 '2022-04-22 23:31:00' 131.76 153.13  142.44
20    11013050 226547200  15.10000 '2022-04-22 23:27:00'  14.34  16.66   15.50"
datapesomolde <- read.table(textConnection(x), header = TRUE)


suppressPackageStartupMessages({
  library(dplyr)
  library(purrr)
})

datapesomolde$Fecha_Registro <- as.POSIXct(datapesomolde$Fecha_Registro)

sp <- split(datapesomolde, datapesomolde$OrdenFab)
sp %>%
  map_dfr( ~ .x %>% filter(!duplicated(Valor))) %>%
  arrange(as.integer(row.names(.)))
#>    PartNumber  OrdenFab  Valor      Fecha_Registro LimInf LimSup Nominal
#> 1    11012501 226549204  14.50 2022-04-25 07:18:00  12.65  14.71   13.68
#> 3    11013610 226548648  47.30 2022-04-25 05:52:00  42.38  49.26   45.82
#> 4    11013047 226548234  15.20 2022-04-23 02:47:00  14.43  16.77   15.60
#> 8    11013052 226548332  16.30 2022-04-23 01:49:00  15.63  18.17   16.90
#> 12   11012501 226548204  14.70 2022-04-23 01:44:00  12.65  14.71   13.68
#> 14   11012501 226548200  14.55 2022-04-23 01:43:00  12.65  14.71   13.68
#> 16   11012501 226548201  14.65 2022-04-23 01:42:00  12.65  14.71   13.68
#> 18   11013943 226548154 134.00 2022-04-23 00:07:00 131.76 153.13  142.44
#> 19   11013943 226547066 144.00 2022-04-22 23:31:00 131.76 153.13  142.44
#> 20   11013050 226547200  15.10 2022-04-22 23:27:00  14.34  16.66   15.50

rm(sp)  # tidy up

^{Created on 2022-06-01 by the reprex package (v2.0.1)}

datapesomolde %>%
  group_split(OrdenFab) %>%
  map_dfr( ~ .x %>% filter(!duplicated(Valor))) 
#> # A tibble: 10 × 7
#>    PartNumber  OrdenFab Valor Fecha_Registro      LimInf LimSup Nominal
#>         <int>     <int> <dbl> <dttm>               <dbl>  <dbl>   <dbl>
#>  1   11013943 226547066 144   2022-04-22 23:31:00  132.   153.    142. 
#>  2   11013050 226547200  15.1 2022-04-22 23:27:00   14.3   16.7    15.5
#>  3   11013943 226548154 134   2022-04-23 00:07:00  132.   153.    142. 
#>  4   11012501 226548200  14.6 2022-04-23 01:43:00   12.6   14.7    13.7
#>  5   11012501 226548201  14.6 2022-04-23 01:42:00   12.6   14.7    13.7
#>  6   11012501 226548204  14.7 2022-04-23 01:44:00   12.6   14.7    13.7
#>  7   11013047 226548234  15.2 2022-04-23 02:47:00   14.4   16.8    15.6
#>  8   11013052 226548332  16.3 2022-04-23 01:49:00   15.6   18.2    16.9
#>  9   11013610 226548648  47.3 2022-04-25 05:52:00   42.4   49.3    45.8
#> 10   11012501 226549204  14.5 2022-04-25 07:18:00   12.6   14.7    13.7

^{Created on 2022-06-01 by the reprex package (v2.0.1)}

CodePudding user response：

I would suggest solving this problem using a more abstract approach, using e.g. tidyverse:

This should be much faster and clearer:

library(tidyverse)

datapesomoldefiltered <- 
  datapesomolde |>
  group_by(PartNumber) |>
    distinct(Valor, .keep_all = TRUE) |>
  ungroup()

datapesomoldefiltered