Python noob here. Big respect to you all. Hope you can help.
I have two pandas dataframes, df1
and df2
:
df1
:
style_gid | style_name | |
---|---|---|
0 | 1202357372219055 | DIA |
1 | 1202357372219056 | WVU |
2 | 1202357372219057 | GWE |
df2
:
project_gid | project_name | |
---|---|---|
0 | 1202318051699659 | Project Master |
1 | 1202318100865244 | TEST-DIA-LON-1 |
2 | 1202320766191762 | TEST-WVU-LONMUN-1 |
3 | 1202381389399069 | TEST-DIA-MUN-2 |
4 | 1202381389399070 | Custom Section |
And I have tried so many things, so many. I need to work out how to check project_name
in each row of df2
to see if it matches with one of the style_name
entries from df1
. If that's true, to then somehow bring the style_name
and style_id
into df2
, like in the below table.
The rules are that if the text in style_name
is found anywhere in the text in 'project_name', that's a match and so that project should be assigned the relevant style.
For example, the project names above contain two examples of matches with 'DIA', and one example of a match with 'WVU':
TEST-DIA-LON-1
TEST-WVU-LONMUN-1
TEST-DIA-MUN-2
The dataframe of styles is a moving target, the list will change values and number of entries for style_name
(i.e. we may end up using others 'ABC', 'XYX', whatever as well), but the search will always be for 3-letter strings, in capitals.
Anything that doesn't match a style gets a null or none, or 'Potato', I don't mind, I can handle that later.
I cannot work out how to perform this lookup/insert, despite finding several solutions to test the name. Please can someone help?
So the resulting dataframe df2
ends up looking like this:
project_gid | project_name | style_gid | style_name | |
---|---|---|---|---|
0 | 1202318051699659 | Project Master | None/Null | None/Null |
1 | 1202318100865244 | TEST-DIA-LON-1 | 1202357372219055 | DIA |
2 | 1202320766191762 | TEST-WVU-LONMUN-1 | 1202357372219056 | WVU |
3 | 1202381389399069 | TEST-DIA-MUN-2 | 1202357372219057 | DIA |
4 | 1202381389399070 | Custom Section | None/Null | None/Null |
Or to a new dataframe, just so long as each project gets the style it deserves.
CodePudding user response:
Provided that we are always looking for the first match of the pattern ABC
preceded by one -
and followed by another -
(So, never "LON" in "TEST-DIA-LON-1"), we could implement the following steps:
- add column to
df2
with RegEx match (or not) extracted fromdf2.project_name
; - check if these matches exist in
df1
(redundant, if you can be certain of this); df2.merge
to adddf1.style_gid
(and rearrange the cols to match your
import pandas as pd
import numpy as np
# we have 2 dfs: df1 (style), df2 (project)
# regex to capture first match of pattern like 'ABC' preceded by '-' and followed by '-'
pattern = r"(?<=-)([A-Z]{3})(?=-)"
# use str.extract() to populate new col with all found matches
df2['style_name'] = df2['project_name'].str.extract(pattern)
# lambda to ensure that all your matches actually exist as 'ABC's in df1, else 'NaN'
df2['style_name'] = df2['style_name'].apply(lambda i: i if i in list(df1.style_name) else np.nan)
# merge on df1 to get matches for df1.style_gid
df2 = df2.merge(df1, how='left')
# rearrange last two cols
df2 = df2.iloc[:, [0,1,3,2]]
output print(df2)
:
project_gid project_name style_gid style_name
0 1.202320e 15 Project Master NaN NaN
1 1.202320e 15 TEST-DIA-LON-1 1.202360e 15 DIA
2 1.202320e 15 TEST-WVU-LONMUN-1 1.202360e 15 WVU
3 1.202380e 15 TEST-DIA-MUN-2 1.202360e 15 DIA
4 1.202380e 15 Custom Section NaN NaN