With Python I'm seeking to create a script that compares data in two different csvs. The first csv, filedata.csv
, contains strings of filepaths containing information on user names and user ids. The second csv, roster.csv
, contains those same fields broken up into different columns. I would like to search through the filepath string in filedata.csv
for matches in roster.csv
, and then write the columns from roster.csv
into filedata.csv
. Below are the csv structures, and the desired output.
filedata.csv
filename
C:\johndoe_0001_paper1.doc
C:\janedoe_0002_paper2.doc
C:\johnsmith_0003_paper3.pdf
roster.csv
first_name, last_name, user_id
john, doe, 0001
jane, doe, 0002
john, smith, 0003
Desired output for filedata.csv
:
filename, first_name, last_name, user_id
C:\johndoe_0001_paper1.doc, john, doe, 0001
C:\janedoe_0002_paper2.doc, jane, doe, 0002
C:\johnsmith_0003_paper3.pdf, john, smith, 0003
I attempted the following code with Pandas to see if I can search through the strings in filenames.csv
for matches from roster.csv
:
import pandas as pd
df = pd.read_csv('filenames.csv')
filenames = str(df['filename'])
roster = pd.read_csv('roster.csv')
roster_last_name = str(roster['last_name'])
roster_first_name = str(roster['first_name'])
roster_user_id = str(roster['user_id'])
print(df.loc([filenames]).str.contains([roster_last_name]))
But get the following error:
TypeError: unhashable type: 'list'
Likewise I've tried something simpler, but with no success, as "False" is always returned:
if roster_last_name in filenames:
print("True")
else:
print("False")
I'm sure I'm missing something simple, but unsure how to proceed. All suggestions are greatly appreciated.
CodePudding user response:
filename['user_id'] = filename['filename'].str.extract(r'(\d{4})')
new_df = filename.merge(roster, on='user_id')
This solution adds a column to filename
that is the four-digit ID (as a string) extracted from the filename, and then merges rows from the two dataframes where the user id is the same.
Your solution does not work because expressions like str(roster['last_name'])
take a series and returns one string.
Please let me know if this is not what was wanted.
CodePudding user response:
With the following (df1
from filenames.csv
and df2
from roster.csv
):
for i in df1.index:
for c in df2.columns:
df1.loc[df1.filename.str.contains(df2.loc[i, "last_name"] and df2.loc[i, "user_id"].astype(str)), c] = df2.loc[i, c]
This checks for last_name
and user_id
, because jane
and john
both have doe
as last_name
.
This gives you the following:
| | filename | first_name | last_name | user_id |
|---:|:-----------------------------|:-------------|:------------|----------:|
| 0 | C:\johndoe_0001_paper1.doc | john | doe | 1 |
| 1 | C:\janedoe_0002_paper2.doc | jane | doe | 2 |
| 2 | C:\johnsmith_0003_paper3.pdf | john | smith | 3 |