Home > OS >  Adding external datraframe data, via substring filter, to another dataframe
Adding external datraframe data, via substring filter, to another dataframe

Time:06-03

Python noob here. Big respect to you all. Hope you can help.

I have two pandas dataframes, df1 and df2:

df1:

style_gid style_name
0 1202357372219055 DIA
1 1202357372219056 WVU
2 1202357372219057 GWE

df2:

project_gid project_name
0 1202318051699659 Project Master
1 1202318100865244 TEST-DIA-LON-1
2 1202320766191762 TEST-WVU-LONMUN-1
3 1202381389399069 TEST-DIA-MUN-2
4 1202381389399070 Custom Section

And I have tried so many things, so many. I need to work out how to check project_name in each row of df2 to see if it matches with one of the style_name entries from df1. If that's true, to then somehow bring the style_name and style_id into df2, like in the below table.

The rules are that if the text in style_name is found anywhere in the text in 'project_name', that's a match and so that project should be assigned the relevant style.

For example, the project names above contain two examples of matches with 'DIA', and one example of a match with 'WVU':
TEST-DIA-LON-1
TEST-WVU-LONMUN-1
TEST-DIA-MUN-2

The dataframe of styles is a moving target, the list will change values and number of entries for style_name (i.e. we may end up using others 'ABC', 'XYX', whatever as well), but the search will always be for 3-letter strings, in capitals.

Anything that doesn't match a style gets a null or none, or 'Potato', I don't mind, I can handle that later.

I cannot work out how to perform this lookup/insert, despite finding several solutions to test the name. Please can someone help?

So the resulting dataframe df2 ends up looking like this:

project_gid project_name style_gid style_name
0 1202318051699659 Project Master None/Null None/Null
1 1202318100865244 TEST-DIA-LON-1 1202357372219055 DIA
2 1202320766191762 TEST-WVU-LONMUN-1 1202357372219056 WVU
3 1202381389399069 TEST-DIA-MUN-2 1202357372219057 DIA
4 1202381389399070 Custom Section None/Null None/Null

Or to a new dataframe, just so long as each project gets the style it deserves.

CodePudding user response:

Provided that we are always looking for the first match of the pattern ABC preceded by one - and followed by another - (So, never "LON" in "TEST-DIA-LON-1"), we could implement the following steps:

  1. add column to df2 with RegEx match (or not) extracted from df2.project_name;
  2. check if these matches exist in df1 (redundant, if you can be certain of this);
  3. df2.merge to add df1.style_gid (and rearrange the cols to match your
import pandas as pd
import numpy as np

# we have 2 dfs: df1 (style), df2 (project)

# regex to capture first match of pattern like 'ABC' preceded by '-' and followed by '-'
pattern = r"(?<=-)([A-Z]{3})(?=-)"

# use str.extract() to populate new col with all found matches
df2['style_name'] = df2['project_name'].str.extract(pattern)

# lambda to ensure that all your matches actually exist as 'ABC's in df1, else 'NaN'
df2['style_name'] = df2['style_name'].apply(lambda i: i if i in list(df1.style_name) else np.nan)

# merge on df1 to get matches for df1.style_gid
df2 = df2.merge(df1, how='left')

# rearrange last two cols
df2 = df2.iloc[:, [0,1,3,2]]

output print(df2):

    project_gid       project_name     style_gid style_name
0  1.202320e 15     Project Master           NaN        NaN
1  1.202320e 15     TEST-DIA-LON-1  1.202360e 15        DIA
2  1.202320e 15  TEST-WVU-LONMUN-1  1.202360e 15        WVU
3  1.202380e 15     TEST-DIA-MUN-2  1.202360e 15        DIA
4  1.202380e 15     Custom Section           NaN        NaN

  • Related