I have a dataset that looks like this:
Value themes country mean
-1.975767 Weather Brazil
-0.540979 Fruits China
-2.359127 Fruits China
-2.815604 Corona China
-0.712323 Weather UK
-0.929755 Weather Brazil
I want to read the table from redshift, calculate mean/std/count values after grouping 2 cols and then insert the findings into a new table that will be written to redshift again.
My table has around 2 million rows, which means it isn't feasible to read all together since the jupyter cell keeps running forever. However, it works if I LIMIT and do something like this:
sql = "SELECT * FROM tablename LIMIT 50"
df = sqlio.read_sql_query(sql, engine)
This is how I was calculating the mean, count etc.
df_calculations = df.groupby(["themes", "country"], as_index = False)["value"].mean()
df_calculations['count'] = df.groupby(["themes", "country"])["value"].count().tolist()
df_calculations['std'] = df.groupby(["themes", "country"])["value"].std().tolist()
df_calculations.head()
This creates a table like this:
country type mean count std
Brazil Weather 2
Brazil Corona
China Corona 1
China Fruits 2
I imagine I could read 50 items in a for loop but then how can I calculate the mean and count such that it updates every 50 rows? and the past data isnt lost?
How can we increment the count each time there's a combination occuring. For example, Weather (theme), Brazil (country)
CodePudding user response:
I do not believe that pandas
can do this out of the box, and if you are processing data in chunks e.g. if you are streaming it, then you'll need to step out of pandas.
To do this in python need to look at recursive formulas for computing the mean and variance - this answer on the math stack exchange shows the method https://math.stackexchange.com/questions/374881/recursive-formula-for-variance
However, you are much better off doing this as part of your SQL query in the first place see sql query to summarize a table with mean and variance as an example.
The performance will then depend on your database engine and you can look for solutions there to address that.
CodePudding user response:
I recommend that you have Redshift calculate these values as you "read the table". It is what Redshift is designed to do and 2 million rows is nothing to this database. If running SQL inline with the data read is not possible then have the DBAs set up a view that adds the computed data columns. You can then access the view as you would a table.