How to make a new column with conditions on another column?-CodePudding

I would like to create a cat_month column in my expeditions dataframe. This column would contain the mountain category (small, medium or large) and I would like to assign a category according to the height contained in the highpoint_metres column (with quartiles: small = height lower than the first quartile) but I can't manage to do it.

Data:

import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")

What I've tried :

peaks[cat_monts] = 
for peak_id in expeditions : 
 if "highpoint_metres" < 6226.5 : #1er quartile 
  return "petite montagne"
elif 6226.5<"highpoint_metres" <7031.25:
  return "moyenne montagne"
else : 
 return "grande montagne"

CodePudding user response：

Use np.select which accepts a list of conditions, list of their corresponding values, and a default ("else") value.

The conditions are evaluated in order, so you can use this:

conditions = {
    'moyenne montagne': expeditions['highpoint_metres'] < 7031.25,
    'petite montagne': expeditions['highpoint_metres'] < 6226.5,
}
expeditions['cat_month'] = np.select(conditions.values(), conditions.keys(), default='grande montagne')

Output:

      expedition_id  ...  highpoint_metres  ...         cat_month
0         ANN260101  ...            7937.0  ...   grande montagne
1         ANN269301  ...            7937.0  ...   grande montagne
2         ANN273101  ...            7937.0  ...   grande montagne
3         ANN278301  ...            7000.0  ...  moyenne montagne
4         ANN279301  ...            7160.0  ...   grande montagne
...             ...  ...               ...  ...               ...
10359     PUMO19101  ...            7138.0  ...   grande montagne
10360     PUMO19102  ...            7138.0  ...   grande montagne
10361     PUTH19101  ...            6350.0  ...  moyenne montagne
10362     RATC19101  ...            6600.0  ...  moyenne montagne
10363     SANK19101  ...            6452.0  ...  moyenne montagne

CodePudding user response：

I think the np.select() method above is probably better, but I was already working on this so I figured I'd share.

You can make a function and then use df.apply() to make your new column using that function.

def func(row):
    height = row['height_metres'] # your actual dataframe had this called 'highpoint_metres', not 'height_metres'
    if height < 6226.5: 
        return 'petite montagne'
    elif height < 7031.25:
        return 'moyenne montagne'
    else:
        return 'grande montagne'
df['cat_monts'] = df.apply(func,axis=1)

Also, notice that new final column is df['cat_monts'] with quotes around the column name. You want the column named that string, not the column to get its name from the value of a variable with that name.

CodePudding user response：

def peaks(x):
    if x < 6226.5 :
        return "petite montagne"
    elif 6226.5 < x < 7031.25:
        return "moyenne montagne"
    else :
        return "grande montagne"

    
expeditions['cat_month'] = expeditions['highpoint_metres'].apply(lambda x: peaks(x))

CodePudding user response：

One option is the case_when function from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor
expeditions.case_when(
    # condition, value if True
    expeditions.highpoint_metres < 7031.25, 'moyenne montagne',
    expeditions.highpoint_metres < 6226.5, 'petite montagne',
    'grande montagne', # default if False
    column_name = 'cat_month')

A faster option for this kind of scenario, than case_when or np.select, I believe, would be to use a binning approach, with pd.cut:

binned_data = pd.cut(expeditions.highpoint_metres, 
                     bins=[0, 6226.5, 7031.25, np.inf], 
                     right = False, 
                     labels = ["petite montagne", "moyenne montagne", "grande montagne"])

expeditions.assign(cat_month = binned_data)

Note that the cat_month for the binned approach is a categorical column