I have several dataframes where I need to reduce the dataframe to a time span for all of them. So that I don't have to reduce the codeblock over and over again, I would like to write a function.
Currently everything is realized without working by the following code:
timerange = (df_a['Date'].max() - pd.DateOffset(months=11))
df_a_12m = df_a.loc[df_a['Date'] >= timerange]
my approach:
def Time_range(Data_1, x,name, column, name):
t = Data_1[column].max() - pd.DateOffset(months=x)
'df'_ name = Data_1.loc[Data_1[column] >= t]
unfortunately this does not work
CodePudding user response:
There are a few mistakes in your approach. Firstly, when you create a new variable you need to specify exactly what it will be called. It is not possible to "dynamically" name a variable like you're trying with 'df_' name = something
.
Second, variable scope dictates that any variable created in a function is only accessible inside that function, and ceases to exist once it finishes executing (unless you play special tricks with global variables). So, even if you did df_name = Data_1.loc[Data_1[column] >= t]
, once Time_range()
finishes running, that variable will be deleted.
What you can do is have the function return the finished DataFrame and assign the result as a new variable from the outside:
def Time_range(Data_1, x, column):
t = Data_1[column].max() - pd.DateOffset(months=x)
return Data_1.loc[Data_1[column] >= t].copy()
df_any_name_you_want = Time_range(df_a, 11, 'Date')
Generally, this is what you want functions to do. Do some operations and return a finished value that you can then use from the outside.
CodePudding user response:
My approach would be:
Store your dataframes in a list e.g.
dfs=[df_a,df_b]
Build a function from your approach. Input: (df, DeltaT=1, colName='Date'), Output: modified DataFrame
def Time_range(df, DeltaT=1, colName='Date'): # Default Values for Delat T and colName. Helpful if constant in most of the cases. t = df[colName].max() - pd.DateOffset(months=DeltaT) df = df.loc[df[colName] >= t].copy() # Good advise to use copy() to ensure that you do not work on your original data by mistake. Espacially with the inplace=True argument you will increase the risk of un-expected behaviour return df # Important: You have to return the result of your function
Call your function with your list
result=[] #list for modified dfs for df in dfs: results.append(Time_range(df, DeltaT=2))
Important code was not tested. Might contain typos
Edit Formatting
Edit 2 Due to the discussion on my comment on the copy()
command a small example with proper formatting:
import pandas as pd
def EmptyDataFrameInplace(df):
df.drop('A', axis=1, inplace=True)
def EmptyDataFrame(df):
df=df.drop('A', axis=1)
dfA=pd.DataFrame({'A':[1,2,3], 'B':[4,5,6]})
dfB=dfA.copy()
print(dfA.head())
EmptyDataFrameInplace(dfA)
EmptyDataFrame(dfB)
print(dfA.head())
print(dfB.head())
The result looks like this:
A B
0 1 4
1 2 5
2 3 6
B
0 4
1 5
2 6
A B
0 1 4
1 2 5
2 3 6
Also see here
Thus, I try always to use copy()
to ensure that I don't modifiy a dataframe without notice.