Go through every row in a dataframe, search for this values in a second dataframe, if it matches, ge-CodePudding

I have two dataframes:

Researchers: a list of all researcher and their id_number
Samples: a list of samples and all researchers related to it, there may be several researchers in the same cell.

I want to go through every row in the researcher table and check if they occur in each row of the Table Samples. If they do I want to get: a) their id from the researcher table and the sample number from the Samples table.

Table researcher

   id_researcher             full_name
0               1         Jack Sparrow
1               2           Demi moore
2               3              Bickman
3               4       Charles Darwin
4               5            H. Haffer

Table samples

     sample_number                            collector
230  INPA A 231                                  Haffer
231  INPA A 232                          Charles Darwin
232  INPA A 233                                     NaN
233  INPA A 234                                     NaN
234  INPA A 235      Jack Sparrow; Demi Moore ; Bickman

Output I want:

            id_researcher     num_samples
0               5             INPA A 231
1               4             INPA A 232
2               1             INPA A 235
3               2             INPA A 235
4               3             INPA A 235

I was able to it with a loop in regular python with the following code, but it is extremely low and quite long. Does anyone know a faster and simpler way? perhaps with pandas apply?

id_researcher = []
id_num_sample = []
for c in range(len(data_researcher)):
    for a in range(len(data_samples)):
        if pd.isna(data_samples['collector'].iloc[a]) == False and data_researcher['full_name'].iloc[c] in data_samples['collector'].iloc[a]:
                    id_researcher.append(data_researcher['id_researcher'].iloc[c])
                    id_num_sample.append(data_samples['No TEC'].iloc[a])
    
data_researcher_sample = pd.DataFrame.from_dict({'id_pesq': id_researcher, 'num_sample': id_num_sample}).sort_values(by='num_amostra')

CodePudding user response：

You have a few data cleaning job to do such as 'Moore' in lowercase, 'Haffer' with first name initials in one case and none in the other, etc. After normalizing your two dataframes, you can split and explode collections and use merge:

samples['collector'] = samples['collector'].str.split(';')
samples = samples.explode('collector')
samples['collector'] = samples['collector'].str.strip()
out = researchers.merge(samples, right_on='collector', left_on='full_name', how='left')[['id_researcher','sample_number']].sort_values(by='sample_number').reset_index(drop=True)

Output:

   id_researcher sample_number
0              5    INPA A 231
1              4    INPA A 232
2              1    INPA A 235
3              2    INPA A 235
4              3    INPA A 235