How to count the number of times a value appears in a 160Million by 2 dataframe

I have a data frame that has 160M rows and 2 columns(material name and price). I want to determine how many the frequency at which prices occur.

For example,

the price $10 was given 100 different times. I'd like to sort the values by largest occurrence to smallest occurs (example, $100 was given 1000 times)

There are 2,484,557 unique prices, so a "table" is not the most useful solution.

my issue is I'm dealing with memory issues.

Any suggestions how I can accomplish this?

CodePudding user response：

Here's a 2 GB data frame with 160M rows and about 3M unique prices:

set.seed(42)
n = 160E6
fake_data <- data.frame(material = sample(LETTERS, n, replace = TRUE),
                        price = sample(1:3E6, n, replace = TRUE))

I like dplyr syntax, but for large data with many groups, data.table and collapse offer much better performance.

We could use dtplyr to translate dplyr code to data.table. This takes 22 seconds on my machine, with the result showing how many times each price appears in the data.

library(dplyr)
library(dtplyr)
fake_data %>% 
  lazy_dt() %>%
  count(price, sort = TRUE)

Result

Source: local data table [3,000,000 x 2]
Call:   `_DT2`[, .(n = .N), keyby = .(price)][order(desc(n))]

    price     n
    <int> <int>
1 2586972    97
2 2843789    95
3  753207    92
4  809482    92
5 1735845    92
6  809659    90
# … with 2,999,994 more rows

If you need higher performance and don't mind a heuristic, you could also sample your data to make it 10% or 1% as big; if any placeholder values occur frequently in the whole data, they are also likely to be frequent in a random sample.

CodePudding user response：

I'd probably create price intervals, e.g. $0-50, $51-100, $101-150 etc.

EDIT: more comprehensive solutution

library(tidyverse)

df <- letters %>% 
  expand_grid(., .) %>% 
  rename(v1 = `....1`,
         v2 = `....2`) %>% 
  mutate(name = paste0(v1, v2)) %>% 
  select(name) %>% 
  bind_rows(., ., ., .)

df

n <- nrow(df)

df <- df %>% 
  mutate(price = rnorm(n = n, mean = 1000, sd = 200))

df %>% 
  ggplot(aes(x = price))  
  geom_histogram()

df <- df %>% 
  mutate(price_grp = case_when(price < 500 ~ "$0-500",
                               price > 500 & price <= 1000 ~ "$501-1000",
                               price > 1000 & price <= 1500 ~ "$1001-1500",
                               price > 1500 ~ "  $1500"))

df %>% 
  group_by(price_grp) %>% 
  summarize(occurences = n()) %>% 
  arrange(desc(occurences))