How can I filter multiple dataframes with multiple conditions (pandas)?-CodePudding

I have this dataframe:

Company	Version	Disp Version	complement	Value
1	1	0	1	100
1	1	0	2	200
1	2	1	1	300
1	2	1	2	400
2	1	1	1	500
2	1	1	2	600
2	2	1	1	700
2	2	1	2	800
3	1	1	1	900
3	1	1	2	1000
4	1	0	1	1100
4	1	0	2	1200
4	2	0	1	1300
4	2	0	2	1400
4	3	0	1	1500
4	3	0	2	1600
5	1	0	1	1700
5	1	0	2	1800
5	2	0	1	1900
5	2	0	2	2000
5	3	0	1	2100
5	3	0	2	2200
5	4	1	1	2300
5	4	1	2	2400
6	1	0	1	2500
6	1	0	2	2600
6	2	0	1	2700
6	2	0	2	2800
7	1	1	1	400
7	1	1	2	400

I want my dataframe to be filtered with some conditions:

If the column 'Company' value has 'Disp Version' equals to 1 or 0, you need to get the rows that have 'Disp Version' equals to 1 and the max of the column 'Version';
If the column 'Company' value has 'Disp Version' equals only to 1, you need to get max of the column 'Version';
If the column 'Company' value has 'Disp Version' equals only to 0, you need to get max of the column 'Version'.

Furthermore, you need to have for each 'Company' value, the values 1 and 2 for the column 'complement'.

Examples:

For the first condition I need a dataframe like this:

Company	Version	Disp Version	complement	Value
1	2	1	1	300
1	2	1	2	400
5	4	1	1	2300
5	4	1	2	2400

For the second condition I need a dataframe like this:

Company	Version	Disp Version	complement	Value
2	2	1	1	700
2	2	1	2	800
3	1	1	1	900
3	1	1	2	1000
7	1	1	1	400
7	1	1	2	400

For the third condition I need a dataframe like this:

Company	Version	complement	Value
4	3	1	1500
4	3	2	1600
6	2	1	2700
6	2	2	2800

I need this output (with is the 3 dataframes together):

Company	Version	Disp Version	complement	Value
1	2	1	1	300
1	2	1	2	400
2	2	1	1	700
2	2	1	2	800
3	1	1	1	900
3	1	1	2	1000
4	3	0	1	1500
4	3	0	2	1600
5	4	1	1	2300
5	4	1	2	2400
6	2	0	1	2700
6	2	0	2	2800
7	1	1	1	400
7	1	1	2	400

CodePudding user response：

To filter a dataframe with multiple conditions, you can use the

pandas.DataFrame.query

function. This function allows you to filter the dataframe using a boolean expression.Here's an example of how you can use the query function to filter the dataframe based on the first condition you mentioned:

df1 = df[(df['Disp Version'] == 1) | (df['Disp Version'] == 0)]
df1 = df1.groupby('Company').apply(lambda x: x[x['Version'] == x['Version'].max()])
df1 = df1[df1['Disp Version'] == 1]
df1 = df1[['Company', 'Version', 'Disp Version', 'complement', 'Value']]

This will create a new dataframe df1 that contains the rows that meet the first condition. The first line filters the rows based on the value of the Disp Version column being 1 or 0. The second line groups the data by the Company column and applies a lambda function to each group that filters the rows to only include the rows with the maximum value in the Version column. The third line filters the data again to only include rows with Disp Version equal to 1. Finally, the fourth line selects the relevant columns and assigns them to the df1 dataframe.

To filter the data based on the second and third conditions, you can use similar code, using the query function to filter the data based on the values of the Disp Version column:

df2 = df[df['Disp Version'] == 1]
df2 = df2.groupby('Company').apply(lambda x: x[x['Version'] == x['Version'].max()])
df2 = df2[['Company', 'Version', 'Disp Version', 'complement', 'Value']]

df3 = df[df['Disp Version'] == 0]
df3 = df3.groupby('Company').apply(lambda x: x[x['Version'] == x['Version'].max()])
df3 = df3[['Company', 'Version', 'Disp Version', 'complement', 'Value']]

To get the final dataframe that includes all the rows that meet the conditions, you can concatenate the three dataframes using the pandas.concat function:

result = pd.concat([df1, df2, df3])

CodePudding user response：

    1. Get maximum version number for each company
    maxv = df.groupby('Company')['Version'].max()

    2. Join with the original dataframe
    merged_df = pd.merge(df, maxv, on=['Company'])

    3. Get companies which satisfy the conditions (have both 1 and 0)
    idx = df.groupby('Company').apply(lambda g: g['Disp Version'].sum() < g['Disp Version'].count() and g['Disp Version'].sum() > 0)
    valids = idx.loc[idx ==True]

    4. Final result
    df.loc[df['Company'].isin(valids.index) & df['Disp Version'] == 1]