read data in batches and then find mean-CodePudding

I have a dataset that looks like this:

Value         themes     country   mean
-1.975767     Weather    Brazil
-0.540979     Fruits     China
-2.359127     Fruits     China
-2.815604     Corona     China
-0.712323     Weather    UK
-0.929755     Weather    Brazil

I want to read the table from redshift, calculate mean/std/count values after grouping 2 cols and then insert the findings into a new table that will be written to redshift again.

My table has around 2 million rows, which means it isn't feasible to read all together since the jupyter cell keeps running forever. However, it works if I LIMIT and do something like this:

sql = "SELECT * FROM tablename  LIMIT 50"
    
df = sqlio.read_sql_query(sql, engine)

This is how I was calculating the mean, count etc.

df_calculations = df.groupby(["themes", "country"], as_index = False)["value"].mean()
df_calculations['count'] = df.groupby(["themes", "country"])["value"].count().tolist()
df_calculations['std'] = df.groupby(["themes", "country"])["value"].std().tolist()
df_calculations.head()

This creates a table like this:

country    type       mean     count    std
Brazil     Weather             2
Brazil     Corona              
China      Corona              1
China      Fruits              2

I imagine I could read 50 items in a for loop but then how can I calculate the mean and count such that it updates every 50 rows? and the past data isnt lost?

How can we increment the count each time there's a combination occuring. For example, Weather (theme), Brazil (country)

CodePudding user response：

I do not believe that pandas can do this out of the box, and if you are processing data in chunks e.g. if you are streaming it, then you'll need to step out of pandas.

To do this in python need to look at recursive formulas for computing the mean and variance - this answer on the math stack exchange shows the method https://math.stackexchange.com/questions/374881/recursive-formula-for-variance

However, you are much better off doing this as part of your SQL query in the first place see sql query to summarize a table with mean and variance as an example.

The performance will then depend on your database engine and you can look for solutions there to address that.

CodePudding user response：

I recommend that you have Redshift calculate these values as you "read the table". It is what Redshift is designed to do and 2 million rows is nothing to this database. If running SQL inline with the data read is not possible then have the DBAs set up a view that adds the computed data columns. You can then access the view as you would a table.