Home > Software design >  Pandas compare one column values to another column to get new column
Pandas compare one column values to another column to get new column

Time:03-10

I have a pandas dataframe like so:

id1  id2  name
11   23   abc xyz 2
22   24   def pqr 5
33   22   qaz wsx 55
44   55   qwe rty 43
55   29   asd zxc 7

Now I want to create a new column which will contain name[0] of ith row, where id1[i] == id2[j]. So in the end I will have:

id1  id2  name        identifier
11   23   abc xyz 2   
22   24   def pqr 5   def
33   22   qaz wsx 55  def
44   55   qwe rty 43  asd
55   29   asd zxc 7   asd

So far, I tried this to get the required id:

df['identifier'] = ''
for index, row in df.iterrows():
    j = df.index[df['id1'] == row['id2']]

Not sure how to proceed further.

CodePudding user response:

IIUC, you could do:

# extract first word and set id1 as name
s = df.set_index('id1')['name'].str.extract('^(\S )', expand=False)
# compute the identifiers
s2 = df['id2'].map(s).fillna(df['id1'].map(s))
# mask the unique identifiers
mask = s2.groupby(s2).transform('size').gt(1)

df2 = df.assign(identifier=s2[mask])

output:

   id1  id2        name identifier
0   11   23   abc xyz 2        NaN
1   22   24   def pqr 5        def
2   33   22  qaz wsx 55        def
3   44   55  qwe rty 43        asd
4   55   29   asd zxc 7        asd

CodePudding user response:

I was able to solve it like so:

df['identifier'] = ''
for index, row in df.iterrows():
    try:
        j = df.index[df['id1'] == row['id2']].to_list()[0]
        df.at[j, 'identifier'] = df.iloc[j]['FamilyName'].split(" ")[0]
    except:
        pass

CodePudding user response:

I create a new column by using loc() and use this conditional statement df['id1'] == df['id2'] on "name" column, and create a new called 'identifier' and invoke pandas.Series.str.split method to separate strings (by each whitespace):

df['identifier']=df.loc[(df['id1']==df['id2']),'name'].str.split()

Now, is probably that will appear some NaN values on this new column, then I replace to ' ', this allows to indexing first value on the string list

df['identifier'].fillna(' ', inplace=True)
df['identifier']=df['identifier'].apply(lambda x: x[0])
  • Related