Lets say that i have a dataframe like this:
Col1 Col2
0 AAA_BBB_123_DD 123
1 AAA_123_BBB_DD 123
2 123_AAA_BBB_DD 123
3 123_AAA_BB_DDD NaN
4 456_AAA_BBB_DD 456
5 AAA_BBB_456_DD 456
6 AAA_789_BBB_DD NaN
7 AAA_BBB_789_DD 789
8 AAA_000_BBB_DD NaN
What I want is, for NaN values in Col2, I want to check the string in Col1, split it by "_" and if it contains something put it to Col2.
In a normal scenario without dataframe if i have a string like this 123_AAA_BB_DDD
i would do this:
str = 123_AAA_BB_DDD
values = ['123','456','789']
split_str = str.split("_")
for i in split_str:
if any(value in i for value in values):
col2_value = i
else:
col2_value = 'Not Found'
My desirable output would look like this:
Col1 Col2
0 AAA_BBB_123_DD 123
1 AAA_123_BBB_DD 123
2 123_AAA_BBB_DD 123
3 123_AAA_BB_DDD 123
4 456_AAA_BBB_DD 456
5 AAA_BBB_456_DD 456
6 AAA_789_BBB_DD 789
7 AAA_BBB_789_DD 789
8 AAA_000_BBB_DD Not Found
EDITED:
The solution worked good for the cases where the values from the list are matched exactly to the string in Col1, eg (123 in list and 123 in Col1 string). But if i have something like this: AAA_PORT123_BBB_DD the soultion will put like 'Not Found' in the Col2, so lets say i have a df like this:
Col1 Col2
0 AAA_BBB_PORT123_DD PORT123
1 AAA_123_BBB_DD 123
2 STD123_AAA_BBB_DD STD123
3 123_AAA_BB_DDD NaN
4 456_AAA_BBB_DD 456
5 AAA_BBB_456_DD 456
6 AAA_MAN789_BBB_DD NaN
7 AAA_BBB_789_DD 789
8 AAA_000_BBB_DD NaN
My desirable output would be:
Col1 Col2
0 AAA_BBB_PORT123_DD PORT123
1 AAA_123_BBB_DD 123
2 STD123_AAA_BBB_DD STD123
3 123_AAA_BB_DDD 123
4 456_AAA_BBB_DD 456
5 AAA_BBB_456_DD 456
6 AAA_MAN789_BBB_DD MAN789
7 AAA_BBB_789_DD 789
8 AAA_000_BBB_DD Not Found
CodePudding user response:
For rows with missing values in Col2
call custom function for matched first value from list value
, for run function only for matched rows use DataFrame.loc
with mask in both sides:
values = ['123','456','789']
m = df['Col2'].isna()
f = lambda x: next((y for y in x.split('_') if y in values), 'Not Found')
df.loc[m, 'Col2'] = df.loc[m, 'Col1'].apply(f)
print (df)
Col1 Col2
0 AAA_BBB_123_DD 123.0
1 AAA_123_BBB_DD 123.0
2 123_AAA_BBB_DD 123.0
3 123_AAA_BB_DDD 123
4 456_AAA_BBB_DD 456.0
5 AAA_BBB_456_DD 456.0
6 AAA_789_BBB_DD 789
7 AAA_BBB_789_DD 789.0
8 AAA_000_BBB_DD Not Found
CodePudding user response:
df = pd.DataFrame([
{
"col1": "456_AAA_BBB_DD",
"col2": "123",
},
{
"col1": "456_AAA_BBB_DD",
"col2": np.NaN,
},
{
"col1": "000_AAA_BBB_DD",
"col2": np.NaN,
}
])
df["col2"] = df["col2"]
values = ['123','456','789']
df.loc[df['col2'].isnull(), 'col2'] = df['col1'].str.split("_").apply(lambda row: next((x for x in row if x in values), "Not Found"))
initial Dataframe
col1 col2
0 456_AAA_BBB_DD 123
1 456_AAA_BBB_DD NaN
2 000_AAA_BBB_DD NaN
output :
col1 col2
0 456_AAA_BBB_DD 123
1 456_AAA_BBB_DD 456
2 000_AAA_BBB_DD Not Found
df.loc[df['col2'].isnull(), 'col2']
will update only the column col2
if col2
is null
first we will split col1
with df['col1'].str.split("_")
Then we search through the list if an element is in values
x for x in a row if x in values
will return a generator object.
next
allows us to take only the first value of the generator. The second parameter of the function is the default value