I'm relatively new to python, so I'm not that great with for/while loops or functions.
Basically, I have a dataframe that looks like this:
temp | dewpoint | wind | precip_rate_hr | total_snow
-------------------------------------------------
31 20 3 0.2 2.1
29 25 12 0.01 0.7
30 30 17 0.5 4.1
... ... ... ... ...
I've been creating heatmaps using seaborne that compare the first four columns, with total_snow mean between the two variables being shown. Apologies if that doesn't sound right. Here's my code:
snow_data_percentile_10_temp = np.percentile(snow_data['temp'], 10)
snow_data_percentile_50_temp = np.percentile(snow_data['temp'], 50)
snow_data_percentile_75_temp = np.percentile(snow_data['temp'], 75)
snow_data_percentile_10_dewpt = np.percentile(snow_data['dewpoint'], 10)
snow_data_percentile_50_dewpt = np.percentile(snow_data['dewpoint'], 50)
snow_data_percentile_75_dewpt = np.percentile(snow_data['dewpoint'], 75)
snow_data['temp_bin'] = pd.cut(snow_data['temp'], [0, 10.4, 23.5, 28.75, 37], labels=['<10.4', '10.4-23.5', '23.5-28.75', '>28.75'])
snow_data['dewpt_bin'] = pd.cut(snow_data['dewpoint'], [0, 4.1, 15, 19.75, 33], labels=['<4.1', '4.1-15', '15-19.75', '>19.75'])
avg_snow = snow_data.groupby(['temp_bin','dewpt_bin'], as_index=False)['total_snow'].mean()
data_fp = avg_snow.pivot_table(index='temp_bin', columns='dewpt_bin', values='total_snow')
sns.set(font_scale=1.2)
f, ax = plt.subplots(figsize=(25,25))
sns.set(font_scale=2.0)
sns.heatmap(data_fp, annot=True, fmt='g', linewidth=0.5)
ax.set_title('Average Snow Total on Days that Met Specific Temperature and Dewpoint Criteria', fontsize=20)
Here's a capture of the heatmap. The values displayed on the bins are the average snow totals for those bins. Is there a way to streamline this code? I need to create heatmaps for temp vs wind, temp vs precip_rate_hr, dewpoint vs wind, dewpoint vs precip_rate_hr, wind vs precip_rate_hr. I also have another, larger dataset I'll need to work through. Right now, I'm just copying and pasting the code into new files and changing it around some to get the rest of the heatmaps. That doesn't take too long, but I would like to automate it more and prevent myself from having numerous code files. Any help would be appreciated!
CodePudding user response:
Here is one way to create all the charts in a loop. It isn't the cleanest, as there will be repeated charts (but with the axes swapped).
My data:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
snow_data = pd.DataFrame(data={"temp": np.random.randint(20, 40, 50), "dewpoint": np.random.randint(15, 40, 50),
"wind": np.random.randint(0, 20, 50), "precip_rate_hr": np.random.random(50),
"total_snow": np.random.random(50)*10})
Creating the categories (this is not in a loop as the bins are all different):
snow_data['temp_bin'] = pd.cut(snow_data['temp'], [0, 10.4, 23.5, 28.75, 37], labels=['<10.4', '10.4-23.5', '23.5-28.75', '>28.75'])
snow_data['dewpt_bin'] = pd.cut(snow_data['dewpoint'], [0, 4.1, 15, 19.75, 33], labels=['<4.1', '4.1-15', '15-19.75', '>19.75'])
snow_data['wind_bin'] = pd.cut(snow_data['wind'], [0, 5, 10, 15, 20], labels=['<5', '5-10', '10-15', '15-20'])
snow_data['precip_rate_hr_bin'] = pd.cut(snow_data['precip_rate_hr'], [0, 0.25, 0.5, 0.75, 1], labels=['<0.25', '0.25-0.5', '0.5-0.75', '>0.75'])
The loop:
# List of all _bin columns to loop through
bin_cols = ['temp_bin', 'dewpt_bin', 'wind_bin', 'precip_rate_hr_bin']
# First factor
for i in bin_cols:
# Second factor
for j in bin_cols:
# Need to ensure you aren't grouping the data by the same column twice!
if j != i:
# Average now mean for bin groups
avg_snow = snow_data.groupby([i, j], as_index=False)['total_snow'].mean()
# Title for plot
title = 'Average Snow Total on Days that Met Specific ' i[: -4] ' and ' j[: -4] ' Criteria'
# Pivot table
data_fp = avg_snow.pivot_table(index=i, columns=j, values='total_snow')
# Plot
sns.set(font_scale=1.2)
f, ax = plt.subplots(figsize=(25, 25))
sns.set(font_scale=2.0)
sns.heatmap(data_fp, annot=True, fmt='g', linewidth=0.5)
ax.set_title(title, fontsize=20)
plt.show()
I agree with Parfait's comment about not needing the np.percentile
, unless you are using these to find appropriate categories for the _bin columns.