I have a data frame that has 160M rows and 2 columns(material name and price). I want to determine how many the frequency at which prices occur.
For example,
the price $10 was given 100 different times. I'd like to sort the values by largest occurrence to smallest occurs (example, $100 was given 1000 times)
There are 2,484,557 unique prices, so a "table" is not the most useful solution.
my issue is I'm dealing with memory issues.
Any suggestions how I can accomplish this?
CodePudding user response:
Here's a 2 GB data frame with 160M rows and about 3M unique prices:
set.seed(42)
n = 160E6
fake_data <- data.frame(material = sample(LETTERS, n, replace = TRUE),
price = sample(1:3E6, n, replace = TRUE))
I like dplyr
syntax, but for large data with many groups, data.table
and collapse
offer much better performance.
We could use dtplyr
to translate dplyr code to data.table. This takes 22 seconds on my machine, with the result showing how many times each price appears in the data.
library(dplyr)
library(dtplyr)
fake_data %>%
lazy_dt() %>%
count(price, sort = TRUE)
Result
Source: local data table [3,000,000 x 2]
Call: `_DT2`[, .(n = .N), keyby = .(price)][order(desc(n))]
price n
<int> <int>
1 2586972 97
2 2843789 95
3 753207 92
4 809482 92
5 1735845 92
6 809659 90
# … with 2,999,994 more rows
If you need higher performance and don't mind a heuristic, you could also sample your data to make it 10% or 1% as big; if any placeholder values occur frequently in the whole data, they are also likely to be frequent in a random sample.
CodePudding user response:
I'd probably create price intervals, e.g. $0-50, $51-100, $101-150 etc.
EDIT: more comprehensive solutution
library(tidyverse)
df <- letters %>%
expand_grid(., .) %>%
rename(v1 = `....1`,
v2 = `....2`) %>%
mutate(name = paste0(v1, v2)) %>%
select(name) %>%
bind_rows(., ., ., .)
df
n <- nrow(df)
df <- df %>%
mutate(price = rnorm(n = n, mean = 1000, sd = 200))
df %>%
ggplot(aes(x = price))
geom_histogram()
df <- df %>%
mutate(price_grp = case_when(price < 500 ~ "$0-500",
price > 500 & price <= 1000 ~ "$501-1000",
price > 1000 & price <= 1500 ~ "$1001-1500",
price > 1500 ~ " $1500"))
df %>%
group_by(price_grp) %>%
summarize(occurences = n()) %>%
arrange(desc(occurences))