I have 2 CSV files in file1 I have list of research groups names. in file2 I have list of the Research full name with location as wall. I want to join these 2 csv file if the have the words matches in them.
Pandas ValueError: "Columns must be same length as key" I am using Jupyter Labs for this.
"df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)"
cvs row size for file1.csv 5000 data, and for file2.csv I have about 15,000
file1.csv
research_groups_names_f1 |
---|
Chinese Academy of Sciences (CAS) |
CAS |
U-M |
UQ |
University of California, Los Angeles |
Harvard University |
file2.csv
research_groups_names_f2 | Locatio_f2 |
---|---|
Chinese Academy of Sciences (CAS) | China |
University of Michigan (U-M) | USA |
The University of Queensland (UQ) | USA |
University of California | USA |
file_output.csv
research_groups_names_f1 | research_groups_names_f2 | Locatio_f2 |
---|---|---|
Chinese Academy of Sciences | Chinese Academy of Sciences(CAS) | China |
CAS | Chinese Academy of Sciences (CAS) | China |
U-M | University of Michigan (U-M) | USA |
UQ | The University of Queensland (UQ) | Australia |
Harvard University | Not found | USA |
University of California, Los Angeles | University of California | USA |
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file1.csv')
df1 = df1.add_prefix('f1_')
df2 = df2.add_prefix('f2_')
def fn(row):
for _, n in df2.iterrows():
if (
n["research_groups_names_f1"] == row["research_groups_names_f2"]
or row["research_groups_names_f1"] in n["research_groups_names_f2"]
):
return n
df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)
df1 = df1.rename(columns={"research_groups_names": "f1_research_groups_names"})
print(df1)
CodePudding user response:
The issue here is that you're trying to merge on some very different values. Fuzzy matching may not help because the distance between CAS and Chinese Academy of Sciences (CAS) is quite large. The two have very little in common. You'll have to develop some custom approach based on your understanding of what the possible group names could be. Here is on approach that gets you most of the way there.
The idea here is to match on the university name OR the abbreviation. So in df2 we can split off the abbreviation and explode into a new row, remove the parenthesis, and in df remove any abbreviation surrounded by parentehsis.
The only leftover value is UCLA, which is the only sample that doesn't follow the same structure as the others. In this case fuzzy matching like I mentioned in my first comment probably would help.
import pandas as pd
df = pd.DataFrame({'research_groups_names_f1':[
'Chinese Academy of Sciences (CAS)',
'CAS',
'U-M',
'UQ',
'University of California, Los Angeles',
'Harvard University']})
df2 = pd.DataFrame({'research_groups_names_f2': ['Chinese Academy of Sciences (CAS)',
'University of Michigan (U-M)',
'The University of Queensland (UQ)',
'University of California'],
'Locatio_f2': ['China', 'USA', 'USA', 'USA']})
df2['key'] = df2['research_groups_names_f2'].str.split('\(')
df2 = df2.explode('key')
df2['key'] = df2['key'].str.replace('\(|\)','', regex=True)
df['key'] = df['research_groups_names_f1'].str.replace('\(.*\)','',regex=True)
df.merge(df2, on='key', how='left').drop(columns='key')
Output
research_groups_names_f1 research_groups_names_f2 Locatio_f2
0 Chinese Academy of Sciences (CAS) Chinese Academy of Sciences (CAS) China
1 CAS Chinese Academy of Sciences (CAS) China
2 U-M University of Michigan (U-M) USA
3 UQ The University of Queensland (UQ) USA
4 University of California, Los Angeles NaN NaN
5 Harvard University NaN NaN
CodePudding user response:
You should consider joining tables using pandas' .join
method.
The documentation and examples can be found here
EDIT:
I didn't see that your columns don't have exact matches;
Maybe you should try str.contains()
which can be found here
If str.contains()
does not help, you may try making Levenstein distance-based string comparison.