Home > Blockchain >  How to use pandas to count rows in which two columns must have one specific string from a specified
How to use pandas to count rows in which two columns must have one specific string from a specified

Time:03-28

I have a dataset that includes, among other things, a column for level of education and yearly salary (represented for some godforsaken reason as >50k, >=50k, <50k, etc). I need to figure out how many people with higher education AKA bachelors, masters, and doctorate make more than 50k. That means that I need to select the rows in which there is either a doctorate, bachelors, or masters in the education column, AND the first character of the salary column is '>'. What is the proper syntax for that? Will give more information if needed. Please help.

CodePudding user response:

You can use below statement to filter the dataframe based on condition:

newdf = df[(df.val > 0.5) & (df.val2 == 1)]

OR

you can iter through rows and update the column. Refer the below code:

for index, row in df.iterrows():
    ....

CodePudding user response:

To select only people with higher education you can use isin passing the list of education degree. For the yearly salary, if you test only against the > (e.g. str.startswith('>')) you could end up including the rows where Year_Salary are also equal to 50k.

import pandas as pd
import numpy as np

#setup
np.random.seed(42)
d = {
    'Year_Salary': np.random.choice(['>50k','>=50k','<50k'], size=(50,)),
    'Education': np.random.choice(['doctorate','bachelors','masters','undergraduate'], size=(50,))
}
df = pd.DataFrame(d)

#code
filtered_df = df[df['Education'].isin(['doctorate','bachelors','masters']) \
                & df['Year_Salary'].str.startswith('>')]

print(filtered_df)
print(filtered_df.shape[0]) # 20 (number of matches)

Output from filtered_df

   Year_Salary  Education
1         >50k  doctorate
4         >50k  bachelors
7        >=50k    masters
14       >=50k    masters
...

To get only the rows where Year_Salary is greater than 50k you could use str.match with the regex ^>\d , a string that starts with a literal > follow by one or more digits.

df[df['Education'].isin(['doctorate','bachelors','masters']) & (df['Year_Salary'].str.match(r'^>\d '))]
  • Related