Home > other >  Extracting the first mention from retweets in a python dataframe
Extracting the first mention from retweets in a python dataframe

Time:01-08

My df looks like this:

record = { 
 'text_con_RT_t' : ['RT @Blanc_: ramdon text #hashtag quiere ', '@GonM ramdon text', 'RT @IvEc: @GonzM ramdon text', 'hOLA ramdon text ' ], 
 'rt' : ['RT', '' , 'RT','' ]
}
    
# create a dataframe 
dataframe2 = pd.DataFrame(record,
                         columns = ['text_con_RT_t', 'rt']) 

I would like to get something like this:

text_con_RT_t rt usr_rt
RT @Blanc_: ramdon text #hashtag quiere RT blanc
@GonM ramdon text
RT @IvEc: @GonzM ramdon text RT ivec
hOLA ramdon text

But i havent succeded, in cases where there is starts with mention, but not retweet, my results looks like this:

text_con_RT_t rt usr_rt
RT @Blanc_: ramdon text #hashtag quiere RT blanc
@GonM ramdon text gonm ramdon text
RT @IvEc: @GonzM ramdon text RT ivec
hOLA ramdon text NaN

I have tried with this:

try:
   dataframe2["usr_rt"] = dataframe2.text_con_RT_t.str.lower().str.split(':').str[0].str.split('@').str[1]
except dataframe2["rt"]==None: # complicated failed
   dataframe2["usr_rt"] = ""

Also with this

if dataframe2["rt"] == "RT":
  return (dataframe2["usr_rt"] == dataframe2.text_con_RT_t.str.split(':').str[0].str.split('@').str[1])

What am I missing? thanks

CodePudding user response:

You can use numpy.where to conditionally keep values from extracted value:

dataframe2['usr_rt'] = np.where(
  dataframe2.rt == 'RT', 
  dataframe2.text_con_RT_t.str.extract('@(\w )', expand=False).str.lower(), 
  ''
)

dataframe2
                              text_con_RT_t  rt  usr_rt
0  RT @Blanc_: ramdon text #hashtag quiere   RT  blanc_
1                         @GonM ramdon text            
2              RT @IvEc: @GonzM ramdon text  RT    ivec
3                         hOLA ramdon text    

Or if retweets always start with RT, you can use regex RT.*?@(\w ):

dataframe2['usr_rt'] = dataframe2.text_con_RT_t.str.extract('RT.*?@(\w )', expand=False).str.lower()

dataframe2
                              text_con_RT_t  rt  usr_rt
0  RT @Blanc_: ramdon text #hashtag quiere   RT  blanc_
1                         @GonM ramdon text         NaN
2              RT @IvEc: @GonzM ramdon text  RT    ivec
3                         hOLA ramdon text          NaN

CodePudding user response:

I would [personally] find it easier to create values for the new column from record. If you added it into record, you wouldn't need to change the DataFrame after (which I prefer since I'm not great with numpy, so I would just end up extracting the column as list an doing what I've done below anyway).

# allowedChars = '' # ' -.' # add allowed characters
record['usr_rt'] = ['' if not rt == 'RT' else ''.join(
    c for c in txt.split('@', 1)[-1].split(':')[0].lower() 
    if c.isalpha() or c.isdigit() # or c in allowedChars
) for txt, rt in zip(record['text_con_RT_t'], record['rt'])]

if c.isalpha() allows only characters from the alphabet to remain; remove or c.isdigit() if you want to get rid of any numeric character from username as well, and make use of allowedChars and or c in allowedChars if you want to allow some special characters (that includes spaces btw, though I don't think usernames have any).

Anyways, now pd.DataFrame(record) would return a DataFrame that looks like

text_con_RT_t rt usr_rt
RT @Blanc_: ramdon text #hashtag quiere RT blanc
@GonM ramdon text
RT @IvEc: @GonzM ramdon text RT ivec
hOLA ramdon text
  • Related