Can someone help me to extract tweet text of each id using regex in Python as a list-CodePudding

string = 'ID::ID123
PUBLISHED_TWEET::ABC
DEF
GHI
EMPLOYEE_ID::ID234
TWEET::ABC
DEF
GHI
ID::ID345
TWEET::#@ABC
DEF
GHI@.[]
USER_IDD::ID456
TWEET::google.com
123456789'

Required output

id = ['ID123', 'ID234', 'ID345', 'ID456'] - I got this output

Struggling with the tweet text. I need to Extract tweet text using regex python.

tweet-output = ['ABC
DEF
GHI', 
'ABC
DEF
GHI', 
'#@ABC
DEF
GHI@.[]', 
'google.com
123456789']

I tried using regex expressions by getting

# pattern_01 = r'PUBLISHED_TWEET::(.*)EMPLOYEE_ID::'
# pattern_02 = r'PUBLISHED_TWEET::(.*)ID::'
# pattern_03 = r'PUBLISHED_TWEET::(.*)USER_IDD::'

# pattern_04 = r'TWEET::(.*)EMPLOYEE_ID::'
# pattern_05 = r'TWEET::(.*)ID::'
# pattern_06 = r'TWEET::(.*)USER_IDD::

# if(re.findall(pattern = pattern_01, string = str(data))):
#   new_tweet_list.append(re.findall(pattern = pattern_01, string = str(data)))

# elif(re.findall(pattern = pattern_02, string = str(data))):
#   new_tweet_list.append(re.findall(pattern = pattern_02, string = str(data)))

# elif(re.findall(pattern = pattern_03, string = str(data))):
#   new_tweet_list.append(re.findall(pattern = pattern_03, string = str(data)))

# elif(re.findall(pattern = pattern_04, string = str(data))):
#   new_tweet_list.append(re.findall(pattern = pattern_04, string = str(data)))

# elif(re.findall(pattern = pattern_05, string = str(data))):
#   new_tweet_list.append(re.findall(pattern = pattern_05, string = str(data)))

# elif(re.findall(pattern = pattern_06, string = str(data))):
#   new_tweet_list.append(re.findall(pattern = pattern_06, string = str(data)))

but I am not getting any output. The string is empty or it gives the entire string that is passed.

CodePudding user response：

It looks like your matches are pairs of <id, tweet>, whose matching text is separated from the identifying keywords with the double colons ::.

You can use the following regex to retrieve any string that is preceeded by the double colons and followed by a space, without lazy matching.

(?<=::)[^:] (?=\n|$)

You can use it in python with the following code:

import re

pattern = '(?<=::)[^:] (?=\n|$)'

pairs = re.findall(pattern, string)

tweets = {pairs[i]: pairs[i 1] for i in range(0, len(pairs), 2)}

Check the regex demo here and the python demo here.

CodePudding user response：

Your question is not clear. It's hard to understand what you really want, but I'll try to guess.

I believe in this case you should split the string instead of trying search for patterns. It's way much easier.

Use

pattern = r'([A-Z_] )::'
split_string = re.split(pattern, string)

Basically it will split the string in each piece composed by uppercase letters from a to z and underscores followed by double colon.

In this case, the output will be

['',
 'ID',
 'ID123\n',
 'PUBLISHED_TWEET',
 'ABC\nDEF\nGHI\n',
 'EMPLOYEE_ID',
 'ID234\n',
 'TWEET',
 'ABC\nDEF\nGHI\n',
 'ID',
 'ID345\n',
 'TWEET',
 '#@ABC\nDEF\nGHI@.[]\n',
 'USER_IDD',
 'ID456\n',
 'TWEET',
 'google.com\n123456789']

Here we have all the labels and values split. The first item of the list is an empty string because of the split of the first line. We must just ignore it. We can also see that all the values ends with '\n'. We will correct all these things.

Lets just pair each label with its respective value. This way I believe you will be able to use these values as you want.

I will separate the labels from the values and zip them together.

labels = split_string[1::2]
values = map(lambda x: x.rstrip('\n'), split_string[2::2])
output = zip(labels, values)

You can either use the output as an generator or keep it as an iterable doing something like

output = tuple(output)

In this case the output will be

(('ID', 'ID123'),
 ('PUBLISHED_TWEET', 'ABC\nDEF\nGHI'),
 ('EMPLOYEE_ID', 'ID234'),
 ('TWEET', 'ABC\nDEF\nGHI'),
 ('ID', 'ID345'),
 ('TWEET', '#@ABC\nDEF\nGHI@.[]'),
 ('USER_IDD', 'ID456'),
 ('TWEET', 'google.com\n123456789'))

You can check the python demo here.