Home > Back-end >  Proper way to do this in pandas without using for loop
Proper way to do this in pandas without using for loop

Time:11-13

The question is I would like to avoid iterrows here.

From my dataframe I want to create a new column "unique" that will be based on the condition that if "a" and "b" column values are the same I would give it a value "uniqueN" then for all occurrence of the exact "a" and "b" I would need the same value "uniqueN".

In this case

  • "1", "3" (the first row) from "a" and "b" is the first unique pair, so I give that the value "unique1", and the seventh row will also have the same value which is "unique1" as it is also "1", "3".

  • "2", "2" (the second row) is the next unique "a", "b" pair so I give them "unique2" and the eight row also has "2", "2" so that will also have "unique2".

  • "3", "1" (third row) is the next unique, so "unique3", no more rows in the df is "3", "1" so that value wont repeat.

  • and so on

I have a working code that uses loops but this is not the pandas way, can anyone suggest how I can do this using pandas functions?

Expected Output (My code works, but its not using pandas methods)

   a  b   unique
0  1  3  unique1
1  2  2  unique2
2  3  1  unique3
3  4  2  unique4
4  3  3  unique5
5  4  2  unique4
6  1  3  unique1
7  2  2  unique2

Code

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})

c = 1
seen = {}
for i, j in df.iterrows():
    j = tuple(j)
    if j not in seen:
        seen[j] = 'unique'   str(c)
        c  = 1

for key, value in seen.items():
    df.loc[(df.a == key[0]) & (df.b == key[1]), 'unique'] = value

CodePudding user response:

Let's use groupby ngroup with sort=False to ensure values are enumerated in order of appearance, add 1 so group numbers start at one, then convert to string with astype so we can add the prefix unique to the number:

df['unique'] = 'unique'   \
               df.groupby(['a', 'b'], sort=False).ngroup().add(1).astype(str)

Or with map and format instead of converting and concatenating:

df['unique'] = (
    df.groupby(['a', 'b'], sort=False).ngroup()
        .add(1)
        .map('unique{}'.format)
)

df:

   a  b   unique
0  1  3  unique1
1  2  2  unique2
2  3  1  unique3
3  4  2  unique4
4  3  3  unique5
5  4  2  unique4
6  1  3  unique1
7  2  2  unique2

Setup:

import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]
})

CodePudding user response:

I came up with a slightly different solution. I'll add this for posterity, but the groupby answer is superior.

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})

print(df)
df1 = df[~df.duplicated()]
print(df1)
df1['unique'] = df1.index
print(df1)
df2 = df.merge(df1, how='left')
print(df2)
  • Related