I have this dataframe:
Company | Version | Disp Version | complement | Value |
---|---|---|---|---|
1 | 1 | 0 | 1 | 100 |
1 | 1 | 0 | 2 | 200 |
1 | 2 | 1 | 1 | 300 |
1 | 2 | 1 | 2 | 400 |
2 | 1 | 1 | 1 | 500 |
2 | 1 | 1 | 2 | 600 |
2 | 2 | 1 | 1 | 700 |
2 | 2 | 1 | 2 | 800 |
3 | 1 | 1 | 1 | 900 |
3 | 1 | 1 | 2 | 1000 |
4 | 1 | 0 | 1 | 1100 |
4 | 1 | 0 | 2 | 1200 |
4 | 2 | 0 | 1 | 1300 |
4 | 2 | 0 | 2 | 1400 |
4 | 3 | 0 | 1 | 1500 |
4 | 3 | 0 | 2 | 1600 |
5 | 1 | 0 | 1 | 1700 |
5 | 1 | 0 | 2 | 1800 |
5 | 2 | 0 | 1 | 1900 |
5 | 2 | 0 | 2 | 2000 |
5 | 3 | 0 | 1 | 2100 |
5 | 3 | 0 | 2 | 2200 |
5 | 4 | 1 | 1 | 2300 |
5 | 4 | 1 | 2 | 2400 |
6 | 1 | 0 | 1 | 2500 |
6 | 1 | 0 | 2 | 2600 |
6 | 2 | 0 | 1 | 2700 |
6 | 2 | 0 | 2 | 2800 |
7 | 1 | 1 | 1 | 400 |
7 | 1 | 1 | 2 | 400 |
I want my dataframe to be filtered with some conditions:
- If the column 'Company' value has 'Disp Version' equals to 1 or 0, you need to get the rows that have 'Disp Version' equals to 1 and the max of the column 'Version';
- If the column 'Company' value has 'Disp Version' equals only to 1, you need to get max of the column 'Version';
- If the column 'Company' value has 'Disp Version' equals only to 0, you need to get max of the column 'Version'.
Furthermore, you need to have for each 'Company' value, the values 1 and 2 for the column 'complement'.
Examples:
For the first condition I need a dataframe like this:
Company | Version | Disp Version | complement | Value |
---|---|---|---|---|
1 | 2 | 1 | 1 | 300 |
1 | 2 | 1 | 2 | 400 |
5 | 4 | 1 | 1 | 2300 |
5 | 4 | 1 | 2 | 2400 |
For the second condition I need a dataframe like this:
Company | Version | Disp Version | complement | Value |
---|---|---|---|---|
2 | 2 | 1 | 1 | 700 |
2 | 2 | 1 | 2 | 800 |
3 | 1 | 1 | 1 | 900 |
3 | 1 | 1 | 2 | 1000 |
7 | 1 | 1 | 1 | 400 |
7 | 1 | 1 | 2 | 400 |
For the third condition I need a dataframe like this:
Company | Version | Disp Version | complement | Value |
---|---|---|---|---|
4 | 3 | 0 | 1 | 1500 |
4 | 3 | 0 | 2 | 1600 |
6 | 2 | 0 | 1 | 2700 |
6 | 2 | 0 | 2 | 2800 |
I need this output (with is the 3 dataframes together):
Company | Version | Disp Version | complement | Value |
---|---|---|---|---|
1 | 2 | 1 | 1 | 300 |
1 | 2 | 1 | 2 | 400 |
2 | 2 | 1 | 1 | 700 |
2 | 2 | 1 | 2 | 800 |
3 | 1 | 1 | 1 | 900 |
3 | 1 | 1 | 2 | 1000 |
4 | 3 | 0 | 1 | 1500 |
4 | 3 | 0 | 2 | 1600 |
5 | 4 | 1 | 1 | 2300 |
5 | 4 | 1 | 2 | 2400 |
6 | 2 | 0 | 1 | 2700 |
6 | 2 | 0 | 2 | 2800 |
7 | 1 | 1 | 1 | 400 |
7 | 1 | 1 | 2 | 400 |
CodePudding user response:
To filter a dataframe with multiple conditions, you can use the
pandas.DataFrame.query
function. This function allows you to filter the dataframe using a boolean expression.Here's an example of how you can use the query function to filter the dataframe based on the first condition you mentioned:
df1 = df[(df['Disp Version'] == 1) | (df['Disp Version'] == 0)]
df1 = df1.groupby('Company').apply(lambda x: x[x['Version'] == x['Version'].max()])
df1 = df1[df1['Disp Version'] == 1]
df1 = df1[['Company', 'Version', 'Disp Version', 'complement', 'Value']]
This will create a new dataframe df1 that contains the rows that meet the first condition. The first line filters the rows based on the value of the Disp Version column being 1 or 0. The second line groups the data by the Company column and applies a lambda function to each group that filters the rows to only include the rows with the maximum value in the Version column. The third line filters the data again to only include rows with Disp Version equal to 1. Finally, the fourth line selects the relevant columns and assigns them to the df1 dataframe.
To filter the data based on the second and third conditions, you can use similar code, using the query function to filter the data based on the values of the Disp Version column:
df2 = df[df['Disp Version'] == 1]
df2 = df2.groupby('Company').apply(lambda x: x[x['Version'] == x['Version'].max()])
df2 = df2[['Company', 'Version', 'Disp Version', 'complement', 'Value']]
df3 = df[df['Disp Version'] == 0]
df3 = df3.groupby('Company').apply(lambda x: x[x['Version'] == x['Version'].max()])
df3 = df3[['Company', 'Version', 'Disp Version', 'complement', 'Value']]
To get the final dataframe that includes all the rows that meet the conditions, you can concatenate the three dataframes using the pandas.concat function:
result = pd.concat([df1, df2, df3])
CodePudding user response:
1. Get maximum version number for each company
maxv = df.groupby('Company')['Version'].max()
2. Join with the original dataframe
merged_df = pd.merge(df, maxv, on=['Company'])
3. Get companies which satisfy the conditions (have both 1 and 0)
idx = df.groupby('Company').apply(lambda g: g['Disp Version'].sum() < g['Disp Version'].count() and g['Disp Version'].sum() > 0)
valids = idx.loc[idx ==True]
4. Final result
df.loc[df['Company'].isin(valids.index) & df['Disp Version'] == 1]