I would like to create a cat_month
column in my expeditions
dataframe. This column would contain the mountain category (small, medium or large) and I would like to assign a category according to the height contained in the highpoint_metres
column (with quartiles: small = height lower than the first quartile) but I can't manage to do it.
Data:
import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")
What I've tried :
peaks[cat_monts] =
for peak_id in expeditions :
if "highpoint_metres" < 6226.5 : #1er quartile
return "petite montagne"
elif 6226.5<"highpoint_metres" <7031.25:
return "moyenne montagne"
else :
return "grande montagne"
CodePudding user response:
Use np.select
which accepts a list of conditions, list of their corresponding values, and a default ("else") value.
The conditions are evaluated in order, so you can use this:
conditions = {
'moyenne montagne': expeditions['highpoint_metres'] < 7031.25,
'petite montagne': expeditions['highpoint_metres'] < 6226.5,
}
expeditions['cat_month'] = np.select(conditions.values(), conditions.keys(), default='grande montagne')
Output:
expedition_id ... highpoint_metres ... cat_month
0 ANN260101 ... 7937.0 ... grande montagne
1 ANN269301 ... 7937.0 ... grande montagne
2 ANN273101 ... 7937.0 ... grande montagne
3 ANN278301 ... 7000.0 ... moyenne montagne
4 ANN279301 ... 7160.0 ... grande montagne
... ... ... ... ... ...
10359 PUMO19101 ... 7138.0 ... grande montagne
10360 PUMO19102 ... 7138.0 ... grande montagne
10361 PUTH19101 ... 6350.0 ... moyenne montagne
10362 RATC19101 ... 6600.0 ... moyenne montagne
10363 SANK19101 ... 6452.0 ... moyenne montagne
CodePudding user response:
I think the np.select()
method above is probably better, but I was already working on this so I figured I'd share.
You can make a function and then use df.apply()
to make your new column using that function.
def func(row):
height = row['height_metres'] # your actual dataframe had this called 'highpoint_metres', not 'height_metres'
if height < 6226.5:
return 'petite montagne'
elif height < 7031.25:
return 'moyenne montagne'
else:
return 'grande montagne'
df['cat_monts'] = df.apply(func,axis=1)
Also, notice that new final column is df['cat_monts']
with quotes around the column name. You want the column named that string, not the column to get its name from the value of a variable with that name.
CodePudding user response:
def peaks(x):
if x < 6226.5 :
return "petite montagne"
elif 6226.5 < x < 7031.25:
return "moyenne montagne"
else :
return "grande montagne"
expeditions['cat_month'] = expeditions['highpoint_metres'].apply(lambda x: peaks(x))
CodePudding user response:
One option is the case_when
function from pyjanitor
:
# pip install pyjanitor
import pandas as pd
import janitor
expeditions.case_when(
# condition, value if True
expeditions.highpoint_metres < 7031.25, 'moyenne montagne',
expeditions.highpoint_metres < 6226.5, 'petite montagne',
'grande montagne', # default if False
column_name = 'cat_month')
A faster option for this kind of scenario, than case_when
or np.select
, I believe, would be to use a binning approach, with pd.cut
:
binned_data = pd.cut(expeditions.highpoint_metres,
bins=[0, 6226.5, 7031.25, np.inf],
right = False,
labels = ["petite montagne", "moyenne montagne", "grande montagne"])
expeditions.assign(cat_month = binned_data)
Note that the cat_month
for the binned approach is a categorical column