How to concat a dataframe in a fucntion which gets updated for each file?-CodePudding

I have a folder 'Data' which contain 5 files.Each file goes through the the function 'filter_seq' one by one.This function contain some filters to reduce/filter the data in the file.

def filter_seq(df2,count):

    print('Filter 1.'   str(count))
    T1_df = df2[((df2['Angle'] > (df2['Sim_Angle'] - 1)) & (df2['Angle'] < (df2['Sim_Angle']   1)))]
    T1_df = T1_df[((T1_df['Velocity'] > (T1_df['Sim_Velocity'] - 2)) & (T1_df['Velocity'] < (T1_df['Sim_Velocity']   2)))]

After this filtering i want an another dataframe which contain all the filtered dataframe for all the files.

for example:assuem the shape of T1_df is 100 x 15 after filtering for file 1 and 89 x15 for file 2.I want a final dataframe with shape 189 x 15.

How to get a final dataframe? How can i improve the filtering fucntion?

CodePudding user response：

The simplest solution might be to append all the filtered dataframes to a list, and then use pd.concat function. For example:

import numpy as np
import pandas as pd
def filter_and_append(df, l):
    """
    df is the dataframe to be filtered, and appended to the list.
    l is the list the filtered dataframe will be appended to
    """
    
    df_filtered = df # put you filter logic here
    l.append(df_filtered)
    return l

l = []
for file in range(3):
# here you could load the data, but just create toy df for illustration
    df_tmp = pd.DataFrame(np.random.randn(3,3))
    l = filter_and_append(df_tmp, l)

full_df = pd.concat(l, axis=0)
full_df

If you need to keep track of which file the data is coming from (eg. to ensure the index is unique, which is not the case in my toy example), then you could handle that inside your filter and append function, for example:

def filter_and_append(df, file, l):
        """
        df is the dataframe to be filtered, and appended to the list.
        file is the file (the data was loaded from)
        l is the list the filtered dataframe will be appended to
        """
        
        df_filtered = df # put you filter logic here
        df_filtered['file_name'] = file
        df_filtered.reset_index('file_name', append=True, inplace=True)
        l.append(df_filtered)
        return l

Regarding how to improve your filter function, it is hard to say without knowing what you mean by improve. For example, does it not achieve your desired output? Or is it too slow? Does it throw an error?

As far as general readability goes, it might be worth splitting up some of you logic across multiple lines, but if you are the only one reading the code then it is really just a matter of taste.

CodePudding user response：

If we take a simplified sample of your dataframes here:

import pandas as pd
import numpy as np
df = pd.DataFrame({"Angle": np.random.randint(0, 20, 100),
                   "Sim_Angle": np.random.randint(0, 20, 100),
                   "Velocity": np.random.randint(0, 20, 100),
                   "Sim_Velocity": np.random.randint(0, 20, 100)})
df_2 = pd.DataFrame({"Angle": np.random.randint(0, 20, 100),
                     "Sim_Angle": np.random.randint(0, 20, 100),
                     "Velocity": np.random.randint(0, 20, 100),
                     "Sim_Velocity": np.random.randint(0, 20, 100)})
df_3 = pd.DataFrame({"Angle": np.random.randint(0, 20, 100),
                     "Sim_Angle": np.random.randint(0, 20, 100),
                     "Velocity": np.random.randint(0, 20, 100),
                     "Sim_Velocity": np.random.randint(0, 20, 100)})

Then we can create a list of your dataframes:

files = [df, df_2, df_3]

And a simplified version of your function:

def filter_seq(df2, count):

    print('Filter 1.'   str(count))

    T1_df = df2[(df2["Angle"].between(df2["Sim_Angle"]-1, df2["Sim_Angle"] 1)) &
                (df2["Velocity"].between(df2["Sim_Velocity"]-2, df2["Sim_Velocity"] 2))]
    
    return T1_df

Here I used .between() so that the df2["Angle"] didn't need to be repeated, and I used & as you did, but to combine the two lines of code as one.

Then you can use pd.concat() with a list comprehension of the files passed through your function:

df_all = pd.concat([filter_seq(f, i) for i, f in enumerate(files)], ignore_index=True)