Adding external datraframe data, via substring filter, to another dataframe-CodePudding

Python noob here. Big respect to you all. Hope you can help.

I have two pandas dataframes, df1 and df2:

df1:

	style_gid	style_name
0	1202357372219055	DIA
1	1202357372219056	WVU
2	1202357372219057	GWE

df2:

	project_gid	project_name
0	1202318051699659	Project Master
1	1202318100865244	TEST-DIA-LON-1
2	1202320766191762	TEST-WVU-LONMUN-1
3	1202381389399069	TEST-DIA-MUN-2
4	1202381389399070	Custom Section

And I have tried so many things, so many. I need to work out how to check project_name in each row of df2 to see if it matches with one of the style_name entries from df1. If that's true, to then somehow bring the style_name and style_id into df2, like in the below table.

The rules are that if the text in style_name is found anywhere in the text in 'project_name', that's a match and so that project should be assigned the relevant style.

For example, the project names above contain two examples of matches with 'DIA', and one example of a match with 'WVU':
TEST-DIA-LON-1
TEST-WVU-LONMUN-1
TEST-DIA-MUN-2

The dataframe of styles is a moving target, the list will change values and number of entries for style_name (i.e. we may end up using others 'ABC', 'XYX', whatever as well), but the search will always be for 3-letter strings, in capitals.

Anything that doesn't match a style gets a null or none, or 'Potato', I don't mind, I can handle that later.

I cannot work out how to perform this lookup/insert, despite finding several solutions to test the name. Please can someone help?

So the resulting dataframe df2 ends up looking like this:

	project_gid	project_name	style_gid	style_name
0	1202318051699659	Project Master	None/Null	None/Null
1	1202318100865244	TEST-DIA-LON-1	1202357372219055	DIA
2	1202320766191762	TEST-WVU-LONMUN-1	1202357372219056	WVU
3	1202381389399069	TEST-DIA-MUN-2	1202357372219057	DIA
4	1202381389399070	Custom Section	None/Null	None/Null

Or to a new dataframe, just so long as each project gets the style it deserves.

CodePudding user response：

Provided that we are always looking for the first match of the pattern ABC preceded by one - and followed by another - (So, never "LON" in "TEST-DIA-LON-1"), we could implement the following steps:

add column to df2 with RegEx match (or not) extracted from df2.project_name;
check if these matches exist in df1 (redundant, if you can be certain of this);
df2.merge to add df1.style_gid (and rearrange the cols to match your

import pandas as pd
import numpy as np

# we have 2 dfs: df1 (style), df2 (project)

# regex to capture first match of pattern like 'ABC' preceded by '-' and followed by '-'
pattern = r"(?<=-)([A-Z]{3})(?=-)"

# use str.extract() to populate new col with all found matches
df2['style_name'] = df2['project_name'].str.extract(pattern)

# lambda to ensure that all your matches actually exist as 'ABC's in df1, else 'NaN'
df2['style_name'] = df2['style_name'].apply(lambda i: i if i in list(df1.style_name) else np.nan)

# merge on df1 to get matches for df1.style_gid
df2 = df2.merge(df1, how='left')

# rearrange last two cols
df2 = df2.iloc[:, [0,1,3,2]]

output print(df2):

    project_gid       project_name     style_gid style_name
0  1.202320e 15     Project Master           NaN        NaN
1  1.202320e 15     TEST-DIA-LON-1  1.202360e 15        DIA
2  1.202320e 15  TEST-WVU-LONMUN-1  1.202360e 15        WVU
3  1.202380e 15     TEST-DIA-MUN-2  1.202360e 15        DIA
4  1.202380e 15     Custom Section           NaN        NaN