Columns must be same length as key Error python-CodePudding

I have 2 CSV files in file1 I have list of research groups names. in file2 I have list of the Research full name with location as wall. I want to join these 2 csv file if the have the words matches in them.

Pandas ValueError: "Columns must be same length as key" I am using Jupyter Labs for this.

"df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)"

cvs row size for file1.csv 5000 data, and for file2.csv I have about 15,000

file1.csv

research_groups_names_f1
Chinese Academy of Sciences (CAS)
CAS
U-M
UQ
University of California, Los Angeles
Harvard University

file2.csv

research_groups_names_f2	Locatio_f2
Chinese Academy of Sciences (CAS)	China
University of Michigan (U-M)	USA
The University of Queensland (UQ)	USA
University of California	USA

file_output.csv

research_groups_names_f1	research_groups_names_f2	Locatio_f2
Chinese Academy of Sciences	Chinese Academy of Sciences(CAS)	China
CAS	Chinese Academy of Sciences (CAS)	China
U-M	University of Michigan (U-M)	USA
UQ	The University of Queensland (UQ)	Australia
Harvard University	Not found	USA
University of California, Los Angeles	University of California	USA

import pandas as pd
    df1 = pd.read_csv('file1.csv')
    
    df2 = pd.read_csv('file1.csv')

    df1 = df1.add_prefix('f1_')
    df2 = df2.add_prefix('f2_')

def fn(row):
    for _, n in df2.iterrows():
        if (
            n["research_groups_names_f1"] == row["research_groups_names_f2"]
            or row["research_groups_names_f1"] in n["research_groups_names_f2"]
        ):
            return n


df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)
df1 = df1.rename(columns={"research_groups_names": "f1_research_groups_names"})
print(df1)

CodePudding user response：

The issue here is that you're trying to merge on some very different values. Fuzzy matching may not help because the distance between CAS and Chinese Academy of Sciences (CAS) is quite large. The two have very little in common. You'll have to develop some custom approach based on your understanding of what the possible group names could be. Here is on approach that gets you most of the way there.

The idea here is to match on the university name OR the abbreviation. So in df2 we can split off the abbreviation and explode into a new row, remove the parenthesis, and in df remove any abbreviation surrounded by parentehsis.

The only leftover value is UCLA, which is the only sample that doesn't follow the same structure as the others. In this case fuzzy matching like I mentioned in my first comment probably would help.

import pandas as pd
df = pd.DataFrame({'research_groups_names_f1':[
'Chinese Academy of Sciences (CAS)',
'CAS',
'U-M',
'UQ',
'University of California, Los Angeles',
'Harvard University']})

df2 = pd.DataFrame({'research_groups_names_f2': ['Chinese Academy of Sciences (CAS)',
  'University of Michigan (U-M)',
  'The University of Queensland (UQ)',
  'University of California'],
 'Locatio_f2': ['China', 'USA', 'USA', 'USA']})



df2['key'] = df2['research_groups_names_f2'].str.split('\(')
df2 = df2.explode('key')
df2['key'] = df2['key'].str.replace('\(|\)','', regex=True)

df['key'] = df['research_groups_names_f1'].str.replace('\(.*\)','',regex=True)

df.merge(df2, on='key', how='left').drop(columns='key')

Output

               research_groups_names_f1           research_groups_names_f2 Locatio_f2
0      Chinese Academy of Sciences (CAS)  Chinese Academy of Sciences (CAS)      China
1                                    CAS  Chinese Academy of Sciences (CAS)      China
2                                    U-M       University of Michigan (U-M)        USA
3                                     UQ  The University of Queensland (UQ)        USA
4  University of California, Los Angeles                                NaN        NaN
5                     Harvard University                                NaN        NaN

CodePudding user response：

You should consider joining tables using pandas' .join method.

The documentation and examples can be found here

EDIT:
I didn't see that your columns don't have exact matches; Maybe you should try str.contains() which can be found here

If str.contains() does not help, you may try making Levenstein distance-based string comparison.