I want to convert a data frame to the format I want by scanning each latitude and longitude in the for loop, but this process takes too long. Is there a way to make the following script faster, such as using multi threads or processing? Can you show me how?
p=0
for i in tqdm(df_wind_monthly["lat"]):
for j in df_wind_monthly["lon"]:
print("lat: " str(i) " lon: " str(j))
for k in range(1948,2017):
rslt_df_wind = df_wind_monthly.loc[(df_wind_monthly['lat'] == i) \
& (df_wind_monthly['lon'] == j) \
& (df_wind_monthly['year'] == k)]; rslt_df_wind = rslt_df_wind.reset_index()
month_columns.loc[p,"lat"]=rslt_df_wind.loc[0,"lat"]
month_columns.loc[p,"lon"]=rslt_df_wind.loc[0,"lon"]
month_columns.loc[p,"years"]=rslt_df_wind.loc[0,"year"]
month_columns.loc[p,"wind_January"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="January"].index.tolist()[0],"wind"]
month_columns.loc[p,"wind_February"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="February"].index.tolist()[0],"wind"]
month_columns.loc[p,"wind_March"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="March"].index.tolist()[0],"wind"]
month_columns.loc[p,"wind_April"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="April"].index.tolist()[0],"wind"]
month_columns.loc[p,"wind_May"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="May"].index.tolist()[0],"wind"]
month_columns.loc[p,"wind_June"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="June"].index.tolist()[0],"wind"]
month_columns.loc[p,"wind_July"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="July"].index.tolist()[0],"wind"]
month_columns.loc[p,"wind_August"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="August"].index.tolist()[0],"wind"]
month_columns.loc[p,"wind_September"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="September"].index.tolist()[0],"wind"]
month_columns.loc[p,"wind_October"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="October"].index.tolist()[0],"wind"]
month_columns.loc[p,"wind_November"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="November"].index.tolist()[0],"wind"]
month_columns.loc[p,"wind_December"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="December"].index.tolist()[0],"wind"]
p =1
These are the inputs and expected output samples:
#df_wind_monthly (INPUT):
time lat lon wind
0 1948-01-16 15.125 15.125 6.509021
1 1948-01-16 15.125 15.375 6.485108
2 1948-01-16 15.125 15.625 6.472615
3 1948-01-16 15.125 15.875 6.472596
4 1948-01-16 15.125 16.125 6.486597
#Expected dataframe columns (OUTPUT):
month_columns=pd.DataFrame(columns=['lat',"lon","years","wind_January","wind_February","wind_March","wind_April","wind_May","wind_June","wind_July","wind_August","wind_September","wind_October","wind_November","wind_December"])
CodePudding user response:
Looping is definetly NOT the way to go. If you type for ... in
while using pandas DataFrame, you're almost always doing it wrong.
What you want is to switch your data from long format (1 row = 1 observation) to wide format (1 row = 12 observations). It is a fairly common usecase, so pandas provides a method to do just that: DataFrame.pivot
.
Starting from your input dataframe, you need to:
- Add a year and month column from the time column
- Select only the years you want (1948-2016)
- Optionaly, if there is more than one measurement by month, compute the average, using
groupby
andmean
- Switch from wide to long format with
pivot
. - Rename the columns
Input
>>> df_wind_monthly
time lat lon wind
0 1948-01-31 15.125 15.125 5.963
1 1948-01-31 15.125 15.375 6.404
2 1948-01-31 15.125 15.625 6.207
3 1948-01-31 15.125 15.875 6.972
4 1948-01-31 15.125 16.125 6.299
... ... ... ... ...
86395 2019-12-31 17.375 16.375 6.514
86396 2019-12-31 17.375 16.625 6.593
86397 2019-12-31 17.375 16.875 6.438
86398 2019-12-31 17.375 17.125 6.394
86399 2019-12-31 17.375 17.375 6.232
Processing
out_df = df_wind_monthly
# Create a DateTime index from the date column to easily extract year / month
date_index = pd.DatetimeIndex(out_df["time"])
# Create a year and month columns to select the years and perform the groupby
out_df = out_df.assign(year=date_index.year, month=date_index.month)
# Select the years you want
out_df = out_df[(out_df["year"] >= 1948) & (out_df["year"] < 2017)]
# If there are multiple measurement for a given year / month, compute the average
# Skip this if you know there is only one measurement per month
out_df = out_df.groupby(["lat", "lon", "year", "month"]).mean().reset_index()
# Switch from long to wide format
out_df = out_df.pivot(index=["lat", "lon", "year"], columns="month", values="wind")
# Rename the columns
out_df = out_df.rename(
columns={
1: "wind_January",
2: "wind_February",
3: "wind_March",
4: "wind_April",
5: "wind_May",
6: "wind_June",
7: "wind_July",
8: "wind_August",
9: "wind_September",
10: "wind_October",
11: "wind_November",
12: "wind_December",
}
)
# Reset Index if you prefer to have the data in columns
out_df = out_df.reset_index()
Output:
>>> out_df
month lat lon year wind_January ... wind_September wind_October wind_November wind_December
0 15.125 15.125 1948 5.963 ... 6.885 6.814 6.131 6.063
1 15.125 15.125 1949 6.304 ... 6.178 6.536 6.426 6.090
2 15.125 15.125 1950 6.207 ... 6.890 6.719 6.875 5.925
3 15.125 15.125 1951 6.100 ... 6.153 6.905 6.034 6.470
4 15.125 15.125 1952 5.951 ... 6.638 6.294 6.434 5.936
... ... ... ... ... ... ... ... ... ...
6895 17.375 17.375 2012 6.674 ... 6.841 6.383 6.685 6.616
6896 17.375 17.375 2013 6.674 ... 6.469 6.940 6.842 6.154
6897 17.375 17.375 2014 6.794 ... 6.251 6.267 6.258 5.942
6898 17.375 17.375 2015 6.760 ... 6.933 6.848 6.765 6.446
6899 17.375 17.375 2016 6.253 ... 6.986 6.490 6.421 6.338