Home > Software design >  a left join using pandas is populating the data twice for the same row
a left join using pandas is populating the data twice for the same row

Time:11-11

I am trying to use pandas for a data analysis, used merge for performing a vlookup, the two data sets are as below,

data1 =
acc_name    tier   content   group  gcode     acc_ID
abc          3       55        b     111      R-DDD
def          4       45        c     222      X-TTT
xyz          4       60        a     333      S-UUU
abc          4       4         b     112      R-DDD
xyz          4       6         a     331      X-TTT
def          4       10        c     221      S-UUU
data2=
Accountn   type        status
xyz        internal     Active         
def        external     Active
abc        internal     Inactive

the code I used is

data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
data1.rename(columns={'acc_name':'Accountn'},inplace = True)
final = pd.merge(data1,data2[['Accountn','status']],on=['Accountn'],how='left')

The Output I am getting is:

final =
Accountn    tier   content   group  gcode     acc_ID    status
abc          3       55        b     111      R-DDD     Inactive
abc          3       55        b     111      R-DDD     Inactive
abc          3       55        b     111      R-DDD     Inactive
def          4       45        c     222      X-TTT     Active
def          4       45        c     222      X-TTT     Active
def          4       45        c     222      X-TTT     Active
xyz          4       60        a     333      S-UUU     Active
xyz          4       60        a     333      S-UUU     Active
xyz          4       60        a     333      S-UUU     Active
abc          4       4         b     112      R-DDD     Inactive
abc          4       4         b     112      R-DDD     Inactive
abc          4       4         b     112      R-DDD     Inactive
xyz          4       6         a     331      X-TTT     Active
xyz          4       6         a     331      X-TTT     Active
xyz          4       6         a     331      X-TTT     Active
def          4       10        c     221      S-UUU     Active
def          4       10        c     221      S-UUU     Active
def          4       10        c     221      S-UUU     Active

The O/P I want is,

Accountn    tier   content   group  gcode     acc_ID    status
abc          3       55        b     111      R-DDD     Inactive
def          4       45        c     222      X-TTT     Active
xyz          4       60        a     333      S-UUU     Active
abc          4       4         b     112      R-DDD     Inactive
xyz          4       6         a     331      X-TTT     Active
def          4       10        c     221      S-UUU     Active

I dont know what is wrong in my code

CodePudding user response:

Remove duplicates in df2 by column for join, here Accountn, so no duplicates in output:

final = pd.merge(data1,data2[['Accountn','status']].drop_duplicates('Accountn'),
                 on=['Accountn'],
                 how='left')
  • Related