Home > Software design >  How to access unaggregated results when aggregation is needed due to dataset size in R
How to access unaggregated results when aggregation is needed due to dataset size in R

Time:11-23

My task is to get total inbound leads for a group of customers, leads by month for the same group of customers and conversion rate of those leads.

The dataset I'm pulling from is 20 million records so I can't query the whole thing. I have successfully done the first step (getting total lead count for each org with this:

inbound_leads <- domo_get_query('6d969e8b-fe3e-46ca-9ba2-21106452eee2',
                                auto_limit = TRUE,
                                query = "select org_id,
                                          COUNT(*)
                                        from table
                                            GROUP BY org_id
                                            ORDER BY org_id"

DOMO is the bi tool I'm pulling from and domo_get_query is an internal function from a custom library my company built. It takes a query argument which is a mysql query)and various others which aren't important right now.

sample data looks like this:

    org_id, inserted_at, lead_converted_at
    1       10/17/2021    2021-01-27T03:39:03
    2       10/18/2021    2021-01-28T03:39:03
    1       10/17/2021    2021-01-28T03:39:03
    3       10/19/2021    2021-01-29T03:39:03
    2       10/18/2021    2021-01-29T03:39:03

I have looked through many aggregation online tutorials but none of them seem to go over how to get data needed pre-aggregation (such as number of leads per month per org, which isn't possible once the aggregation has occurred because in the above sample the aggregation would remove the ability to see more than one instance of org_id 1 for example) from a dataset that needs to be aggregated in order to be accessed in the first place. Maybe I just don't understand this enough to know the right questions to ask. Any direction appreciated.

CodePudding user response:

If you're unable to fit your data in memory, you have a few options. You could process the data in batches (i.e. one year at a time) so that it fits in memory. You could use a package like chunked to help.

But in this case I would bet the easiest way to handle your problem is to solve it entirely in your SQL query. To get leads by month, you'll need to truncate your date column and group by org_id, month.

To get conversion rate for leads in those months, you could add a column (in addition to your count column) that is something like:

sum(case when conversion_date is not null then 1 else 0) as convert_count
  • Related