I have data as follows:
dat <- list(nr1 = list(list_of_account_numbers = " 0000000000",
" NL11BANKO0111111111", " NL11BANKO0111111111", " NL11BANKO0111111111",
" NL11BANKO0111111111", " NL11BANKO0111111111", " NL11BANKO0111111113",
" NL11BANKO0111111111", " NL11BANKO0111111112", " NL11BANKO0111111113",
" NL11BANKO0111111111", " NL11BANKO0111111112", " NL11BANKO0111111113",
" NL11BANKO0111111111", " NL11BANKO0111111111", " 0000000000",
" 0000000000"), nr2 = list(list_of_account_numbers = " NL30ABNA0111111111",
" NL31RABO0111111111", " NL30ABNA0111111111", " NL30ABNA0111111111",
" NL30ABNA0111111111", " NL31RABO0111111111", " NL31RABO0111111111",
" NL52RABO0111111111", " NL74INGB0111111111", " NL74INGB0111111111",
" NL30ABNA0111111111", " NL30ABNA0111111111", " NL30ABNA0111111111",
" NL74INGB0111111111", " NL74INGB0111111111", " NL74INGB0111111111",
" NL74INGB0111111111", " NL74INGB0111111111", " NL74INGB0111111111",
" NL16DEUT0111111111"), nr3 = list(
list_of_account_numbers = " NL11BANKO0111111111", " NL11BANKO0111111111",
" NL11BANKO0111111111", " NL11BANKO0111111111", " NL11BANKO0111111113",
" NL11BANKO0111111111", " NL11BANKO0111111111", " NL11BANKO0111111113",
" NL11BANKO0111111111", " NL11BANKO0111111111", " NL11BANKO0111111113",
" NL11BANKO0111111111", " NL11BANKO0111111111"))
I am trying to write a code that for each list item (nr1
,nr2
,nr3
), get the top 3 most occurring values. There are two additional issues.
- Some list items have the value
0000000000
, which should be excluded. - Some list items do not have 3 values, but only one or two.
I thought the first thing to do is to unlist the items and to remove the occurrences of 0000000000
;
IBAN_numbers <- list()
y <- " 0000000000"
for (i in 1:length(dat)) {
IBAN_numbers[[i]] <- unlist(dat[i])
IBAN_numbers[[i]] = IBAN_numbers[[i]][! IBAN_numbers[[i]] %in% y]
}
But I am not sure how achieve the last point.
table(IBAN_numbers[[1]])
# NL11BANKO0111111111 NL11BANKO0111111112 NL11BANKO0111111113
# 9 2 3
table(IBAN_numbers[[2]])
# NL16DEUT0111111111 NL30ABNA0111111111 NL31RABO0111111111 NL52RABO0111111111 NL74INGB0111111111
# 1 7 3 1 8
table(IBAN_numbers[[3]])
# NL11BANKO0111111111 NL11BANKO0111111113
# 10 3
I could do something like:
IBAN_numbers <- list()
y <- " 0000000000"
for (i in 1:length(dat)) {
IBAN_numbers[[i]] <- unlist(dat[i])
IBAN_numbers[[i]] = IBAN_numbers[[i]][! IBAN_numbers[[i]] %in% y]
IBAN_numbers[[i]] = table(IBAN_numbers[[i]])
}
So for the middle table, I would want only three entries (I do not care which option with one occurence it takes, as long as it does not crash).
Could anyone help me with the last step?
CodePudding user response:
You may do this with lapply
-
y <- " 0000000000"
lapply(dat, function(x) {
x <- unlist(x)
head(sort(table(x[x != y]), decreasing = TRUE), 3)
})
#$nr1
#NL11BANKO0111111111 NL11BANKO0111111113 NL11BANKO0111111112
# 9 3 2
#$nr2
# NL74INGB0111111111 NL30ABNA0111111111 NL31RABO0111111111
# 8 7 3
#$nr3
# NL11BANKO0111111111 NL11BANKO0111111113
# 10 3
You may use names(head(sort(table(x[x != y]), decreasing = TRUE), 3))
if you are interested only in names.
CodePudding user response:
Using tidyverse
library(dplyr)
library(purrr)
map(dat, ~ tibble(col1 = flatten_chr(.x)) %>%
filter(col1 != y) %>%
count(col1) %>%
slice_max(n = 3, order_by = n))
-output
$nr1
# A tibble: 3 × 2
col1 n
<chr> <int>
1 " NL11BANKO0111111111" 9
2 " NL11BANKO0111111113" 3
3 " NL11BANKO0111111112" 2
$nr2
# A tibble: 3 × 2
col1 n
<chr> <int>
1 " NL74INGB0111111111" 8
2 " NL30ABNA0111111111" 7
3 " NL31RABO0111111111" 3
$nr3
# A tibble: 2 × 2
col1 n
<chr> <int>
1 " NL11BANKO0111111111" 10
2 " NL11BANKO0111111113" 3