Home > Software design >  How to perform calculations on a subset of a column in a pandas dataframe?
How to perform calculations on a subset of a column in a pandas dataframe?

Time:12-04

With a dataset such as this:

    famid  birth  age   ht
0       1      1  one  2.8
1       1      1  two  3.4
2       1      2  one  2.9
3       1      2  two  3.8
4       1      3  one  2.2
5       1      3  two  2.9

...where we've got values for a variable ht for different categories of, for example, age , I would like to adjust a subset of the data in df['ht'] where df['age'] == 'one' only. And I would like to do it without creating a new column.

I've tried:

df[df['age']=='one']['ht'] = df[df['age']=='one']['ht']*10**6

But to my mild surprise the numbers don't change. Maybe because the A value is trying to be set on a copy of a slice from a DataFrame warning is triggered in the same run. I've also tried with df.mask() and df.where(). But to no avail. I'm clearly failing at something very basic here, but I'd really like to know how to do this properly. There are similarly sounding questions such as Performing calculations on subset of data frame subset in Python, but the suggested solutions here are pointing towards df.groupby(), and I don't think this necessarily is the right approach here.

Thank you for any suggestions!

Here's a fully reproducible dataset:

import pandas as pd

df = pd.DataFrame({
    'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
    'ht_one': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
    'ht_two': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]
})
df = pd.wide_to_long(df, stubnames='ht', i=['famid', 'birth'], j='age',
                    sep='_', suffix=r'\w ')
df.reset_index(inplace = True)

CodePudding user response:

To adjust a subset of a column in a pandas dataframe, you can use the loc method. The loc method allows you to access a subset of the dataframe by specifying the values in the rows and columns that you want. In your case, you want to adjust the values in the ht column where the age column is equal to one. You can do this with the following code:

df.loc[df['age'] == 'one', 'ht'] = df[df['age'] == 'one']['ht'] * 10**6

The first argument to the loc method is a condition that specifies the rows that you want to select. In this case, the condition is df['age'] == 'one', which selects all rows where the value in the age column is one. The second argument specifies the column or columns that you want to adjust. In this case, we want to adjust the ht column, so the second argument is 'ht'. Finally, the right-hand side of the assignment operator sets the new values for the selected rows and columns. In this case, the right-hand side is df[df['age'] == 'one']['ht'] * 10**6, which multiplies the values in the ht column for rows where the age column is one by 10^6.

After running this code, the values in the ht column where the age column is one will be adjusted as desired. Here's an example of how you could use this code in your case:

import pandas as pd

df = pd.DataFrame({
    'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
    'ht_one': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
    'ht_two': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]
})
df = pd.wide_to_long(df, stubnames='ht', i=['famid', 'birth'], j='age',
                    sep='_', suffix=r'\w ')
df.reset_index(inplace = True)

# Adjust the values in the ht column where the age column is 'one'
df.loc[df['age'] == 'one', 'ht'] = df[df['age'] == 'one']['ht'] * 10**6

# Print the updated dataframe
print(df)

After running this code, the dataframe will have the adjusted values in the ht column where the age column is one.

CodePudding user response:

Opposed square brackets ][ is never a good idea when dealing with pandas operations, thus the SettingWithCopyWarning.

I suggest you to use pandas.DataFrame.loc with boolean indexing :

df.loc[df['age'].eq('one'), 'ht'] = df['ht']*10**6

# Output :

print(df.sample(5))

    famid  birth  age         ht
4       1      3  one  2200000.0
3       1      2  two        3.8
7       2      1  two        3.2
16      3      3  one  2100000.0
6       2      1  one  2000000.0

CodePudding user response:

To perform calculations on a subset of a column in a pandas dataframe, you can use the .loc method to select the subset of the dataframe and then apply the calculation to that subset. For example, to multiply the ht values for rows where age is equal to one by 10^6, you could use the following code:

df.loc[df['age']=='one', 'ht'] = df.loc[df['age']=='one', 'ht'] * 10**6

This code selects the subset of rows where age is one using df.loc[df['age']=='one'], and then applies the multiplication operation to the ht column in that subset using df.loc[df['age']=='one', 'ht'] * 10**6.

You can also use the .loc method to perform calculations on multiple columns in the subset. For example, if you wanted to multiply the ht values by 10^6 and then add 1000 to the birth values for rows where age is one, you could use the following code:

df.loc[df['age']=='one', ['ht', 'birth']] = df.loc[df['age']=='one', ['ht', 'birth']]

CodePudding user response:

To adjust values in a subset of a DataFrame, you can use the loc method and select the rows you want to adjust using a boolean mask. The syntax for modifying values using loc is as follows:

df.loc[mask, column] = new_value

where mask is a boolean array that specifies which rows to adjust, column is the name of the column to adjust, and new_value is the value to assign to the selected rows.

In your case, you want to adjust the ht column for rows where the age column is equal to "one", so your mask would be df['age'] == 'one'. To modify the values in the ht column, you would use the following code:

df.loc[df['age'] == 'one', 'ht'] = df[df['age'] == 'one']['ht'] * 10**6

This code will select the rows where age is equal to "one", and then set the values in the ht column to their current value times 10^6.

Alternatively, you can use the isin method to create your boolean mask, which can make the code more readable. The isin method returns a boolean array where each element is True if the corresponding value in the column is in the given list of values, and False otherwise. So, to create a mask for rows where the age column is equal to "one", you can use the following code:

mask = df['age'].isin(['one'])

Then you can use this mask with the loc method to modify the values in the ht column:

df.loc[mask, 'ht'] = df[mask]['ht'] * 10**6

I hope this helps! Let me know if you have any other questions.

CodePudding user response:

Let's try this:

df.loc[df['age'] == 'one', 'ht'] *= 10**6

Output:

    famid  birth  age         ht
0       1      1  one  2800000.0
1       1      1  two        3.4
2       1      2  one  2900000.0
3       1      2  two        3.8
4       1      3  one  2200000.0
5       1      3  two        2.9
6       2      1  one  2000000.0
7       2      1  two        3.2
8       2      2  one  1800000.0
9       2      2  two        2.8
10      2      3  one  1900000.0
11      2      3  two        2.4
12      3      1  one  2200000.0
13      3      1  two        3.3
14      3      2  one  2300000.0
15      3      2  two        3.4
16      3      3  one  2100000.0
17      3      3  two        2.9

CodePudding user response:

Here is a way:

df.assign(ht = df['ht'].mask(df['age'].isin(['one']),df['ht'].mul(10**6)))

by using isin(), more values from the age column can be added.

Output:

    famid  birth  age         ht
0       1      1  one  2800000.0
1       1      1  two        3.4
2       1      2  one  2900000.0
3       1      2  two        3.8
4       1      3  one  2200000.0
5       1      3  two        2.9
6       2      1  one  2000000.0
7       2      1  two        3.2
8       2      2  one  1800000.0
9       2      2  two        2.8
10      2      3  one  1900000.0
11      2      3  two        2.4
12      3      1  one  2200000.0
13      3      1  two        3.3
14      3      2  one  2300000.0
15      3      2  two        3.4
16      3      3  one  2100000.0
17      3      3  two        2.9
  • Related