I have a dataset that includes, among other things, a column for level of education and yearly salary (represented for some godforsaken reason as >50k, >=50k, <50k, etc). I need to figure out how many people with higher education AKA bachelors, masters, and doctorate make more than 50k. That means that I need to select the rows in which there is either a doctorate, bachelors, or masters in the education column, AND the first character of the salary column is '>'. What is the proper syntax for that? Will give more information if needed. Please help.
CodePudding user response:
You can use below statement to filter the dataframe based on condition:
newdf = df[(df.val > 0.5) & (df.val2 == 1)]
OR
you can iter through rows and update the column. Refer the below code:
for index, row in df.iterrows():
....
CodePudding user response:
To select only people with higher education you can use isin
passing the list of education degree. For the yearly salary, if you test only against the >
(e.g. str.startswith
('>')
) you could end up including the rows where Year_Salary
are also equal
to 50k.
import pandas as pd
import numpy as np
#setup
np.random.seed(42)
d = {
'Year_Salary': np.random.choice(['>50k','>=50k','<50k'], size=(50,)),
'Education': np.random.choice(['doctorate','bachelors','masters','undergraduate'], size=(50,))
}
df = pd.DataFrame(d)
#code
filtered_df = df[df['Education'].isin(['doctorate','bachelors','masters']) \
& df['Year_Salary'].str.startswith('>')]
print(filtered_df)
print(filtered_df.shape[0]) # 20 (number of matches)
Output from filtered_df
Year_Salary Education
1 >50k doctorate
4 >50k bachelors
7 >=50k masters
14 >=50k masters
...
To get only the rows where Year_Salary
is greater than 50k you could use str.match
with the regex ^>\d
, a string that starts with a literal >
follow by one or more digits.
df[df['Education'].isin(['doctorate','bachelors','masters']) & (df['Year_Salary'].str.match(r'^>\d '))]