Home > Mobile >  Computing power in Azure Databricks for a 20 GB database
Computing power in Azure Databricks for a 20 GB database

Time:11-15

I'm new to Azure Databricks and would like use it to do some Big Data work. My dataset has around 70 millions of rows and 32 columns, so that corresponds to around 20 GB of memory. Currently I am using the free trial on Azure Databricks and and only use a portion of the data (12 million rows) at the moment. My cluster has the following characteristics:

  • 1 Worker and 1 Driver: 8 GB Memory, 4 Cores.
  • Runtime: 10.4.x-scala2.12

I chose to use R programming language. However, when I run the following code:

df %>% filter(!is.na(estrato), val_fact_cu >= 0) %>%
  select(estrato, val_fact_cu, are_esp_nombre) %>% collect() %>%
  ggplot(aes(x = val_fact_cu, fill = are_esp_nombre))  
  geom_density(alpha = 0.5)  
  theme_bw()  
  facet_wrap(~estrato, nrow = 3)

I get the following error:

Error : java.lang.OutOfMemoryError: GC overhead limit exceeded

Some(<code style = 'font-size:10pt'> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded </code>)

or

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

What computing power should I use to work with the full dataset? I would also like to use Power BI to make some dashboards. Should I also use the premium version of Power BI? (keep in mind that my entire dataset occupies almost 20 GB of memory).

Thanks.

CodePudding user response:

Should I also use the premium version of Power BI? (keep in mind that my entire dataset occupies almost 20 GB of memory).

Power BI Datasets are highly compressed, and you can minimize the required memory by splitting DateTime into separate Date and Time columns, eliminiating unneeded high-cardinality columns like long strings or GUIDs, and possibly by aggregating the dataset.

After designing and loading the data into a proper Power BI model you can evaluate whether it's "too big". But 17M rows and 32 columns isn't really "big data" and may be loaded into a Power BI dataset of only a few GB.

  • Related