Python & Pandas - merging csv's based on string search-CodePudding

With Python I'm seeking to create a script that compares data in two different csvs. The first csv, filedata.csv, contains strings of filepaths containing information on user names and user ids. The second csv, roster.csv, contains those same fields broken up into different columns. I would like to search through the filepath string in filedata.csv for matches in roster.csv, and then write the columns from roster.csv into filedata.csv. Below are the csv structures, and the desired output.

filedata.csv

filename
C:\johndoe_0001_paper1.doc
C:\janedoe_0002_paper2.doc
C:\johnsmith_0003_paper3.pdf

roster.csv

first_name, last_name, user_id
john, doe, 0001
jane, doe, 0002
john, smith, 0003

Desired output for filedata.csv:

filename, first_name, last_name, user_id
C:\johndoe_0001_paper1.doc, john, doe, 0001
C:\janedoe_0002_paper2.doc, jane, doe, 0002
C:\johnsmith_0003_paper3.pdf, john, smith, 0003

I attempted the following code with Pandas to see if I can search through the strings in filenames.csv for matches from roster.csv:

import pandas as pd

df = pd.read_csv('filenames.csv')
filenames = str(df['filename'])

roster = pd.read_csv('roster.csv')
roster_last_name = str(roster['last_name'])
roster_first_name = str(roster['first_name'])
roster_user_id = str(roster['user_id'])

print(df.loc([filenames]).str.contains([roster_last_name]))

But get the following error:

TypeError: unhashable type: 'list'

Likewise I've tried something simpler, but with no success, as "False" is always returned:

if roster_last_name in filenames:
    print("True")
else:
    print("False")

I'm sure I'm missing something simple, but unsure how to proceed. All suggestions are greatly appreciated.

CodePudding user response：

filename['user_id'] = filename['filename'].str.extract(r'(\d{4})')
new_df = filename.merge(roster, on='user_id')

This solution adds a column to filename that is the four-digit ID (as a string) extracted from the filename, and then merges rows from the two dataframes where the user id is the same.

Your solution does not work because expressions like str(roster['last_name']) take a series and returns one string.

Please let me know if this is not what was wanted.

CodePudding user response：

With the following (df1 from filenames.csv and df2 from roster.csv):

for i in df1.index:
    for c in df2.columns:
        df1.loc[df1.filename.str.contains(df2.loc[i, "last_name"] and df2.loc[i, "user_id"].astype(str)), c] = df2.loc[i, c]

This checks for last_name and user_id, because jane and john both have doe as last_name. This gives you the following:

|    | filename                     | first_name   | last_name   |   user_id |
|---:|:-----------------------------|:-------------|:------------|----------:|
|  0 | C:\johndoe_0001_paper1.doc   | john         | doe         |         1 |
|  1 | C:\janedoe_0002_paper2.doc   | jane         | doe         |         2 |
|  2 | C:\johnsmith_0003_paper3.pdf | john         | smith       |         3 |