Need help in repeating rows if found in list.
- If found value in list increment count
- If more than one instance found then repeat the row by incrementing count
DF
data = {'First_Name':['Tom', 'Nick', 'Daniel', 'Jack'],
'oter_col':['other_value1', 'other_value2', 'other_value3', 'other_value4']}
df = pd.DataFrame(data)
df["repeation"] = 0
DF looks like
First_Name oter_col count
0 Tom other_value1 0
1 Nick other_value2 0
2 Daniel other_value3 0
3 Jack other_value4 0
Input list:
Full_Name = ["Tom Cruise", "Tom Holland", "Tom Hardy", "Jack black", "Chris Hemsworth"]
In need output like:
First_Name Full_Name oter_col count
0 Tom Tom Cruise other_value1 1
1 Tom Tom Holland other_value1 2
2 Tom Tom Hardy other_value1 3
3 Jack Jack black other_value4 1
4 Nick other_value2 1
5 Daniel other_value3 0
Tried something like this to get first matching index so I can repeat the row but not sure how to do it.
for name in Full_Name:
m = df.First_Name.str.contains(name.split()[0])
first_index_found = m.idxmax() if m.any() else None
if type(first_index_found) == int:
print(first_index_found)
CodePudding user response:
You can craft a DataFrame to merge
and count the non-NA names as cumsum
per group:
df2 = (pd.DataFrame({'Full_Name': Full_Name})
.assign(First_Name=lambda d: d['Full_Name'].str.extract('(\w )'))
)
(df
.merge(df2, on='First_Name', how='left')
.assign(repeation=lambda d: d['Full_Name'].notna().groupby(d['First_Name']).cumsum())
)
output:
First_Name oter_col repeation Full_Name
0 Tom other_value1 1 Tom Cruise
1 Tom other_value1 2 Tom Holland
2 Tom other_value1 3 Tom Hardy
3 Nick other_value2 0 NaN
4 Daniel other_value3 0 NaN
5 Jack other_value4 1 Jack black
CodePudding user response:
You can use:
# Create a regex pattern to extract First_Name from Full_Name
pat = fr"\b({'|'.join(df['First_Name'])})\b"
# Create a dataframe from Full_Name
df1 = pd.DataFrame({'Full_Name': Full_Name})
df1['First_Name'] = df1['Full_Name'].str.extract(pat)
# Merge them on First_Name column
out = df.merge(df1, on='First_Name', how='left')
# Count (choose one)
out['repeation'] = out.groupby('First_Name').cumcount().add(1)
# OR
out['repeation2'] = (out.dropna().groupby('First_Name').cumcount().add(1)
.reindex(out.index, fill_value=0))
Output:
>>> out
First_Name oter_col repeation Full_Name repeation2
0 Tom other_value1 1 Tom Cruise 1
1 Tom other_value1 2 Tom Holland 2
2 Tom other_value1 3 Tom Hardy 3
3 Nick other_value2 1 NaN 0
4 Daniel other_value3 1 NaN 0
5 Jack other_value4 1 Jack black 1
CodePudding user response:
You can create a new dataframe from Full_Name
to merge with your original data, then group by First_Name
and count rows:
data = {'First_Name':['Tom', 'Nick', 'Daniel', 'Jack'],
'oter_col':['other_value1', 'other_value2', 'other_value3', 'other_value4']}
df = pd.DataFrame(data)
Full_Name = ["Tom Cruise", "Tom Holland", "Tom Hardy", "Jack black", "Chris Hemsworth"]
df2 = pd.DataFrame({ 'First_Name' : [n.split()[0] for n in Full_Name], 'Full_Name' : Full_Name })
df = df.merge(df2, on='First_Name', how='left').fillna('')
df['count'] = df.groupby('First_Name').cumcount().add(1)
Output:
First_Name oter_col Full_Name count
0 Tom other_value1 Tom Cruise 1
1 Tom other_value1 Tom Holland 2
2 Tom other_value1 Tom Hardy 3
3 Nick other_value2 1
4 Daniel other_value3 1
5 Jack other_value4 Jack black 1