My df looks like this:
record = {
'text_con_RT_t' : ['RT @Blanc_: ramdon text #hashtag quiere ', '@GonM ramdon text', 'RT @IvEc: @GonzM ramdon text', 'hOLA ramdon text ' ],
'rt' : ['RT', '' , 'RT','' ]
}
# create a dataframe
dataframe2 = pd.DataFrame(record,
columns = ['text_con_RT_t', 'rt'])
I would like to get something like this:
text_con_RT_t | rt | usr_rt |
---|---|---|
RT @Blanc_: ramdon text #hashtag quiere | RT | blanc |
@GonM ramdon text | ||
RT @IvEc: @GonzM ramdon text | RT | ivec |
hOLA ramdon text |
But i havent succeded, in cases where there is starts with mention, but not retweet, my results looks like this:
text_con_RT_t | rt | usr_rt |
---|---|---|
RT @Blanc_: ramdon text #hashtag quiere | RT | blanc |
@GonM ramdon text | gonm ramdon text | |
RT @IvEc: @GonzM ramdon text | RT | ivec |
hOLA ramdon text | NaN |
I have tried with this:
try:
dataframe2["usr_rt"] = dataframe2.text_con_RT_t.str.lower().str.split(':').str[0].str.split('@').str[1]
except dataframe2["rt"]==None: # complicated failed
dataframe2["usr_rt"] = ""
Also with this
if dataframe2["rt"] == "RT":
return (dataframe2["usr_rt"] == dataframe2.text_con_RT_t.str.split(':').str[0].str.split('@').str[1])
What am I missing? thanks
CodePudding user response:
You can use numpy.where
to conditionally keep values from extracted value:
dataframe2['usr_rt'] = np.where(
dataframe2.rt == 'RT',
dataframe2.text_con_RT_t.str.extract('@(\w )', expand=False).str.lower(),
''
)
dataframe2
text_con_RT_t rt usr_rt
0 RT @Blanc_: ramdon text #hashtag quiere RT blanc_
1 @GonM ramdon text
2 RT @IvEc: @GonzM ramdon text RT ivec
3 hOLA ramdon text
Or if retweets always start with RT
, you can use regex RT.*?@(\w )
:
dataframe2['usr_rt'] = dataframe2.text_con_RT_t.str.extract('RT.*?@(\w )', expand=False).str.lower()
dataframe2
text_con_RT_t rt usr_rt
0 RT @Blanc_: ramdon text #hashtag quiere RT blanc_
1 @GonM ramdon text NaN
2 RT @IvEc: @GonzM ramdon text RT ivec
3 hOLA ramdon text NaN
CodePudding user response:
I would [personally] find it easier to create values for the new column from record
. If you added it into record
, you wouldn't need to change the DataFrame after (which I prefer since I'm not great with numpy
, so I would just end up extracting the column as list an doing what I've done below anyway).
# allowedChars = '' # ' -.' # add allowed characters
record['usr_rt'] = ['' if not rt == 'RT' else ''.join(
c for c in txt.split('@', 1)[-1].split(':')[0].lower()
if c.isalpha() or c.isdigit() # or c in allowedChars
) for txt, rt in zip(record['text_con_RT_t'], record['rt'])]
if c.isalpha()
allows only characters from the alphabet to remain; remove or c.isdigit()
if you want to get rid of any numeric character from username as well, and make use of allowedChars
and or c in allowedChars
if you want to allow some special characters (that includes spaces btw, though I don't think usernames have any).
Anyways, now pd.DataFrame(record)
would return a DataFrame that looks like
text_con_RT_t | rt | usr_rt |
---|---|---|
RT @Blanc_: ramdon text #hashtag quiere | RT | blanc |
@GonM ramdon text | ||
RT @IvEc: @GonzM ramdon text | RT | ivec |
hOLA ramdon text |