Home > database >  Compare two dataframes to guess gender (Python)
Compare two dataframes to guess gender (Python)

Time:02-22

I've been stuck on a dataframe problem for days now, and I hope you will be able to help me !

I have two dataframes :

  • One named dfCombin, which contains first names (and many other informations which have no interest at this point) ;
  • The other named dfPrenomsFR with first names and their gender, which I got on the French government website.

Here is a sample of dfPrenomsFR :

   Prenom   Genre
0   Aaliyah F
1   Aapeli  M
2   Aapo    M
3   Aaren   M,f
4   Aarne   M
... ... ...
11622   Zvi M
11623   Zvonimir    M
11624   Zvonimira   F
11625   Zvonko  M
11626   Zygmunt M

and here is a sample of dfCombin :

     TITLE              NAME AUTHOR 1    FIRST NAME AUTHOR 1    
0   Accident majeur     Julliard         Jean-François  
1   J'accuse...         Dytar            Jean   
2   Les Mémés farouches Frécon           Sylvain    

My goal is to see if the first names in dfCombin (in the column 'FIRST NAME AUTHOR 1') are present in the column 'Prenom' of dfPrenomsFR.

If it is the case, I would like to create a new column labelled 'GenreAuteur' in dfCombin which takes the gender value of this first name, which I can find in the column 'Genre' of dfPrenomsFR (which can be either 'M', 'F' or 'M,f').

I also would like to fill the other lines of 'GenreAuteur' of (which have no gender information) with "NA".

Thank you for your help !!

CodePudding user response:

Use:

df = pd.DataFrame({'name':['Aaliyah', 'Aapeli', 'Aapo'], 'g':['F','M','M']})
dfCombin = pd.DataFrame({'fna':['Sylvain','Aaliyah']}) 
dfCombin.merge(df, left_on='fna', right_on='name', how='left')[['fna','g']]

output:

enter image description here

CodePudding user response:

You could do the following to find the names in dfPrenomsFr that are also in dfCombin:

import numpy as np
intersection = np.intersect1d(dfPrenomsFR.Prenom, dfCombin['FIRST NAME AUTHOR 1'])

And select the gender of these names and assign it to a new column:

dfCombin['GenreAuteur'] = dfPrenomsFR.loc[dfPrenomsFR.Prenom.isin(intersection).index, 'Genre']
  • Related