I'm trying to get social media usernames from a post. This is my code so far.
social_media=['kik']
df['socials'] = df['Request'].str.extract('({})'.format('|'.join(social_media)),
flags=re.IGNORECASE, expand=False).str.lower().fillna('')
print(df['socials'])
This code prints kik
for the rows that have a kik
listed
How can I get it to output the username that's listed after kik
This is a sample of my dataframe:
Request
-------
0 my kik abcd
1 ig bby
2 check out my kik:1234
3 Kik hehehaha
My code outputs this:
socials
-------
0 kik
1
2 kik
3 kik
I'd like it to output this:
socials
-------
0 abcd
1
2 1234
3 hehehaha
CodePudding user response:
Create sample data
import pandas as pd
import re
df = pd.DataFrame()
df['request'] = ['my kik abcd', 'ig bby', 'check out my kik:1234', 'Kik hehehaha']
Filter data looking for only those that contain 'kik', case insensitive
filtered_df = df[df['request'].str.contains('kik', flags=re.IGNORECASE)]
Look through rows that have 'kik' and get everything after it:
df.loc[filtered_df.index, 'test'] = [re.split('kik', row, flags=re.IGNORECASE, maxsplit=1)[-1] for row in filtered_df['request']]
A few things to note
- The 2nd row value still contains the semicolon, you can simply do a
str.replace
to get rid of it - I limit the split to a maximum of once (
maxsplit=1
), therefore it will always take everything after kik[-1]
.
Meaning a string of 'kik junk junk junk'
will result in 'junk junk junk'