I have two dataframes:
Researchers: a list of all researcher and their id_number
Samples: a list of samples and all researchers related to it, there may be several researchers in the same cell.
I want to go through every row in the researcher table and check if they occur in each row of the Table Samples. If they do I want to get: a) their id from the researcher table and the sample number from the Samples table.
Table researcher
id_researcher full_name
0 1 Jack Sparrow
1 2 Demi moore
2 3 Bickman
3 4 Charles Darwin
4 5 H. Haffer
Table samples
sample_number collector
230 INPA A 231 Haffer
231 INPA A 232 Charles Darwin
232 INPA A 233 NaN
233 INPA A 234 NaN
234 INPA A 235 Jack Sparrow; Demi Moore ; Bickman
Output I want:
id_researcher num_samples
0 5 INPA A 231
1 4 INPA A 232
2 1 INPA A 235
3 2 INPA A 235
4 3 INPA A 235
I was able to it with a loop in regular python with the following code, but it is extremely low and quite long. Does anyone know a faster and simpler way? perhaps with pandas apply?
id_researcher = []
id_num_sample = []
for c in range(len(data_researcher)):
for a in range(len(data_samples)):
if pd.isna(data_samples['collector'].iloc[a]) == False and data_researcher['full_name'].iloc[c] in data_samples['collector'].iloc[a]:
id_researcher.append(data_researcher['id_researcher'].iloc[c])
id_num_sample.append(data_samples['No TEC'].iloc[a])
data_researcher_sample = pd.DataFrame.from_dict({'id_pesq': id_researcher, 'num_sample': id_num_sample}).sort_values(by='num_amostra')
CodePudding user response:
You have a few data cleaning job to do such as 'Moore' in lowercase, 'Haffer' with first name initials in one case and none in the other, etc. After normalizing your two dataframes, you can split
and explode
collections
and use merge
:
samples['collector'] = samples['collector'].str.split(';')
samples = samples.explode('collector')
samples['collector'] = samples['collector'].str.strip()
out = researchers.merge(samples, right_on='collector', left_on='full_name', how='left')[['id_researcher','sample_number']].sort_values(by='sample_number').reset_index(drop=True)
Output:
id_researcher sample_number
0 5 INPA A 231
1 4 INPA A 232
2 1 INPA A 235
3 2 INPA A 235
4 3 INPA A 235