Home > Blockchain >  Apply for loop while replacing NaN values in a dataframe with values from another column
Apply for loop while replacing NaN values in a dataframe with values from another column

Time:11-18

Lets say that i have a dataframe like this:

    Col1                       Col2
0  AAA_BBB_123_DD              123
1  AAA_123_BBB_DD              123
2  123_AAA_BBB_DD              123
3  123_AAA_BB_DDD              NaN
4  456_AAA_BBB_DD              456
5  AAA_BBB_456_DD              456
6  AAA_789_BBB_DD              NaN
7  AAA_BBB_789_DD              789
8  AAA_000_BBB_DD              NaN

What I want is, for NaN values in Col2, I want to check the string in Col1, split it by "_" and if it contains something put it to Col2.

In a normal scenario without dataframe if i have a string like this 123_AAA_BB_DDD i would do this:

str = 123_AAA_BB_DDD
values = ['123','456','789']
split_str = str.split("_")
for i in split_str:
    if any(value in i for value in values):
        col2_value = i
    else:
        col2_value = 'Not Found'

My desirable output would look like this:

    Col1                       Col2
0  AAA_BBB_123_DD              123
1  AAA_123_BBB_DD              123
2  123_AAA_BBB_DD              123
3  123_AAA_BB_DDD              123
4  456_AAA_BBB_DD              456
5  AAA_BBB_456_DD              456
6  AAA_789_BBB_DD              789
7  AAA_BBB_789_DD              789
8  AAA_000_BBB_DD           Not Found

EDITED:

The solution worked good for the cases where the values from the list are matched exactly to the string in Col1, eg (123 in list and 123 in Col1 string). But if i have something like this: AAA_PORT123_BBB_DD the soultion will put like 'Not Found' in the Col2, so lets say i have a df like this:

    Col1                       Col2
0  AAA_BBB_PORT123_DD        PORT123
1  AAA_123_BBB_DD              123
2  STD123_AAA_BBB_DD          STD123
3  123_AAA_BB_DDD              NaN
4  456_AAA_BBB_DD              456
5  AAA_BBB_456_DD              456
6  AAA_MAN789_BBB_DD          NaN
7  AAA_BBB_789_DD              789
8  AAA_000_BBB_DD              NaN

My desirable output would be:

    Col1                       Col2
0  AAA_BBB_PORT123_DD        PORT123
1  AAA_123_BBB_DD              123
2  STD123_AAA_BBB_DD          STD123
3  123_AAA_BB_DDD              123
4  456_AAA_BBB_DD              456
5  AAA_BBB_456_DD              456
6  AAA_MAN789_BBB_DD          MAN789
7  AAA_BBB_789_DD              789
8  AAA_000_BBB_DD            Not Found

CodePudding user response:

For rows with missing values in Col2 call custom function for matched first value from list value, for run function only for matched rows use DataFrame.loc with mask in both sides:

values = ['123','456','789']

m = df['Col2'].isna()

f = lambda x: next((y for y in x.split('_') if y in values), 'Not Found')
df.loc[m, 'Col2'] = df.loc[m, 'Col1'].apply(f)
print (df)
             Col1       Col2
0  AAA_BBB_123_DD      123.0
1  AAA_123_BBB_DD      123.0
2  123_AAA_BBB_DD      123.0
3  123_AAA_BB_DDD        123
4  456_AAA_BBB_DD      456.0
5  AAA_BBB_456_DD      456.0
6  AAA_789_BBB_DD        789
7  AAA_BBB_789_DD      789.0
8  AAA_000_BBB_DD  Not Found

CodePudding user response:

df = pd.DataFrame([
    {
        "col1": "456_AAA_BBB_DD",
        "col2": "123",
    },
    {
        "col1": "456_AAA_BBB_DD",
        "col2": np.NaN,
    },
    {
        "col1": "000_AAA_BBB_DD",
        "col2": np.NaN,
    }
])
df["col2"] = df["col2"]
values = ['123','456','789']
df.loc[df['col2'].isnull(), 'col2'] = df['col1'].str.split("_").apply(lambda row: next((x for x in row if x in values), "Not Found"))

initial Dataframe

             col1 col2
0  456_AAA_BBB_DD  123
1  456_AAA_BBB_DD  NaN
2  000_AAA_BBB_DD  NaN

output :

             col1       col2
0  456_AAA_BBB_DD        123
1  456_AAA_BBB_DD        456
2  000_AAA_BBB_DD  Not Found

df.loc[df['col2'].isnull(), 'col2'] will update only the column col2 if col2 is null

first we will split col1 with df['col1'].str.split("_")

Then we search through the list if an element is in values x for x in a row if x in values will return a generator object.

next allows us to take only the first value of the generator. The second parameter of the function is the default value

  • Related