I am having a large dataframe(sample:https://docs.google.com/spreadsheets/d/17Xjc81jkjS-64B4FGZ06SzYDRnc6J27m/edit?usp=sharing&ouid=106137353367530025738&rtpof=true&sd=true).I was filtering the data.The code is below
A = [f"A{i}" for i in range(50)]
B = [f"B{i}" for i in range(50)]
C = [f"C{i}" for i in range(50)]
for i in A:
cond_A = (df[i]>= -0.0423) & (df[i]<=3)
filt_df = df[cond_A]
for i in B:
cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)
filt_df2 = filt_df[cond_B]
for i in C:
cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
filt_df3 = filt_df2[cond_B]
1.When i print filt_df3,i am getting only an empty dataframe,why?
2.How can i improve the above the code,other approaches like some advanced techniques?
the code is written in python.
Thank you very much in advance.
CodePudding user response:
Since the column A1 contains only -0.057 which is outside [-0.0423, 3] everything gets filtered out.
Nevertheless, you seem not to take over the filter in every loop as filt_df{1|2|3} is reset.
This should work:
import pandas as pd
A = [f"A{i}" for i in range(50)]
B = [f"B{i}" for i in range(50)]
C = [f"C{i}" for i in range(50)]
filt_df = df.copy()
for i in A:
cond_A = (df[i] >= -0.0423) & (df[i]<=3)
filt_df = filt_df[cond_A]
filt_df2 = filt_df.copy()
for i in B:
cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)
filt_df2 = filt_df2[cond_B]
filt_df3 = filt_df2.copy()
for i in C:
cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
filt_df3 = filt_df3[cond_B]
print(filt_df3)
Of course you will find a lot of filter tools in the pandas library that can be applied to multiple columns
For example this: https://stackoverflow.com/a/39820329/6139079
CodePudding user response:
It seems to me that there is an issue with your codes when you are using the iteration to do the filtering. For example, filt_df is being overwritten in every iteration of the first loop. When the loop ends, filt_df only contains the data filtered with the conditions set in the last iteration. Is this what you intend to do?
And if you want to do the filtering efficient, you can try to use pandas.DataFrame.query (see documentation here). For example, if you want to filter out all rows with column B0 to B49 containing values between 0 and 200 inclusive, you can try to use the Python codes below (assuming that you have imported the raw data in the variable df below).
condition_list = [f'B{i} >= 0 & B{i} <= 200' for i in range(50)]
filter_str = ' & '.join(condition_list)
subset_df = df.query(filter_str)
print(subset_df)