Extract all strings in between between all same pairs of the two string using regex or python functi-CodePudding

I am trying to use regex or python functions to extract all the bolded texts, or texts in between ' and <=.

"[Text(447.1153846153846, 471.625, 'the <= 0.5\nentropy = 0.97\nsamples = 100.0%\nvalue = [0.399, 0.601]\nclass = True News'), Text(238.46153846153845, 336.875, 'donald <= 0.5\nentropy = 0.921\nsamples = 83.7%\nvalue = [0.336, 0.664]\nclass = True News'), Text(119.23076923076923, 202.125, 'hillary <= 0.5\nentropy = 0.981\nsamples = 55.6%\nvalue = [0.42, 0.58]\nclass = True News'), Text(59.61538461538461, 67.375, '\n (...) \n'), Text(178.84615384615384, 67.375, '\n (...) \n'), Text(357.6923076923077, 202.125, 'hillary <= 0.5\nentropy = 0.663\nsamples = 28.2%\nvalue = [0.172, 0.828]\nclass = True News'), Text(298.0769230769231, 67.375, '\n (...) \n'), Text(417.30769230769226, 67.375, '\n (...) \n'), Text(655.7692307692307, 336.875, 'trumps <= 0.5\nentropy = 0.859\nsamples = 16.3%\nvalue = [0.718, 0.282]\nclass = Fake News'), Text(596.1538461538462, 202.125, 'hillary <= 0.5\nentropy = 0.821\nsamples = 15.7%\nvalue = [0.744, 0.256]\nclass = Fake News'), Text(536.5384615384615, 67.375, '\n (...) \n'), Text(655.7692307692307, 67.375, '\n (...) \n'), Text(715.3846153846154, 202.125, 'entropy = 0.0\nsamples = 0.6%\nvalue = [0.0, 1.0]\nclass = True News')]"

So far the closest I got was (?=')(.*)(?= <=), but had no luck so far.

Would anyone let me know how to extract these bolded texts in between single quote and <=?

Does not necessary need to be using regex.

Thanks!

CodePudding user response：

Using a look behind for the single quote ' and a look ahead for <=
the middle non-quote characters can be matched for the content.

r"(?<=')[^']*?(?=\s*<=)"

https://regex101.com/r/KlYLQ2/1

CodePudding user response：

One approach:

import re

text = "[Text(447.1153846153846, 471.625, 'the <= 0.5\nentropy = 0.97\nsamples = 100.0%\nvalue = [0.399, " \
       "0.601]\nclass = True News'), Text(238.46153846153845, 336.875, 'donald <= 0.5\nentropy = 0.921\nsamples = " \
       "83.7%\nvalue = [0.336, 0.664]\nclass = True News'), Text(119.23076923076923, 202.125, 'hillary <= " \
       "0.5\nentropy = 0.981\nsamples = 55.6%\nvalue = [0.42, 0.58]\nclass = True News'), Text(59.61538461538461, " \
       "67.375, '\n (...) \n'), Text(178.84615384615384, 67.375, '\n (...) \n'), Text(357.6923076923077, 202.125, " \
       "'hillary <= 0.5\nentropy = 0.663\nsamples = 28.2%\nvalue = [0.172, 0.828]\nclass = True News'), " \
       "Text(298.0769230769231, 67.375, '\n (...) \n'), Text(417.30769230769226, 67.375, '\n (...) \n'), " \
       "Text(655.7692307692307, 336.875, 'trumps <= 0.5\nentropy = 0.859\nsamples = 16.3%\nvalue = [0.718, " \
       "0.282]\nclass = Fake News'), Text(596.1538461538462, 202.125, 'hillary <= 0.5\nentropy = 0.821\nsamples = " \
       "15.7%\nvalue = [0.744, 0.256]\nclass = Fake News'), Text(536.5384615384615, 67.375, '\n (...) \n'), " \
       "Text(655.7692307692307, 67.375, '\n (...) \n'), Text(715.3846153846154, 202.125, 'entropy = 0.0\nsamples = " \
       "0.6%\nvalue = [0.0, 1.0]\nclass = True News')] "

for match in re.finditer(", '(.*)?<=", text):
    print(match.group(1))

Output

the 
donald 
hillary 
hillary 
trumps 
hillary

CodePudding user response：

This regex works. We use a named group so it is easy to refer to the exact data you want. It's setup to find consecutive words, and digits followed by " <=". We then use finditer to get all of the matches.

import re

data = "[Text(447.1153846153846, 471.625, 'the <= 0.5\nentropy = 0.97\nsamples = 100.0%\nvalue = [0.399, 0.601]\nclass = True News'), Text(238.46153846153845, 336.875, 'donald <= 0.5\nentropy = 0.921\nsamples = 83.7%\nvalue = [0.336, 0.664]\nclass = True News'), Text(119.23076923076923, 202.125, 'hillary <= 0.5\nentropy = 0.981\nsamples = 55.6%\nvalue = [0.42, 0.58]\nclass = True News'), Text(59.61538461538461, 67.375, '\n (...) \n'), Text(178.84615384615384, 67.375, '\n (...) \n'), Text(357.6923076923077, 202.125, 'hillary <= 0.5\nentropy = 0.663\nsamples = 28.2%\nvalue = [0.172, 0.828]\nclass = True News'), Text(298.0769230769231, 67.375, '\n (...) \n'), Text(417.30769230769226, 67.375, '\n (...) \n'), Text(655.7692307692307, 336.875, 'trumps <= 0.5\nentropy = 0.859\nsamples = 16.3%\nvalue = [0.718, 0.282]\nclass = Fake News'), Text(596.1538461538462, 202.125, 'hillary <= 0.5\nentropy = 0.821\nsamples = 15.7%\nvalue = [0.744, 0.256]\nclass = Fake News'), Text(536.5384615384615, 67.375, '\n (...) \n'), Text(655.7692307692307, 67.375, '\n (...) \n'), Text(715.3846153846154, 202.125, 'entropy = 0.0\nsamples = 0.6%\nvalue = [0.0, 1.0]\nclass = True News')]"

fmt = re.compile(r'(?P<info>[\w\d] ) <=', re.I)
for m in fmt.finditer(data):
    print(m.group('info'))

If you just want to go the whole 9 yards, the below will parse the entire thing into a named tuple that primarily mirrors the format of the text. I didn't know what the first 2 values represent so, I just called them x and y. I went this far because what you want doesn't seem very useful, and I assume this question is just a precursor to eventually pinpointing more data. This pinpoints all of the data. Any entry with \n (...) \n data is printed as being "empty", and is not stored in the entries list.

import re
from collections import namedtuple

data    = "[Text(447.1153846153846, 471.625, 'the <= 0.5\nentropy = 0.97\nsamples = 100.0%\nvalue = [0.399, 0.601]\nclass = True News'), Text(238.46153846153845, 336.875, 'donald <= 0.5\nentropy = 0.921\nsamples = 83.7%\nvalue = [0.336, 0.664]\nclass = True News'), Text(119.23076923076923, 202.125, 'hillary <= 0.5\nentropy = 0.981\nsamples = 55.6%\nvalue = [0.42, 0.58]\nclass = True News'), Text(59.61538461538461, 67.375, '\n (...) \n'), Text(178.84615384615384, 67.375, '\n (...) \n'), Text(357.6923076923077, 202.125, 'hillary <= 0.5\nentropy = 0.663\nsamples = 28.2%\nvalue = [0.172, 0.828]\nclass = True News'), Text(298.0769230769231, 67.375, '\n (...) \n'), Text(417.30769230769226, 67.375, '\n (...) \n'), Text(655.7692307692307, 336.875, 'trumps <= 0.5\nentropy = 0.859\nsamples = 16.3%\nvalue = [0.718, 0.282]\nclass = Fake News'), Text(596.1538461538462, 202.125, 'hillary <= 0.5\nentropy = 0.821\nsamples = 15.7%\nvalue = [0.744, 0.256]\nclass = Fake News'), Text(536.5384615384615, 67.375, '\n (...) \n'), Text(655.7692307692307, 67.375, '\n (...) \n'), Text(715.3846153846154, 202.125, 'entropy = 0.0\nsamples = 0.6%\nvalue = [0.0, 1.0]\nclass = True News')]"

#regex to describe the overall entry
entfmt  = re.compile(r'Text\((?P<x>([\d\.] )), (?P<y>([\d\.] )), \'(?P<data>([^\'] ))\'\)', re.I|re.S)

#format all of the float groups ~ 
#  flt is a repeatable chunk so we create this part of the expression in a loop
#  all this really does is make the final datfmt regex seem shorter
flt     = '{}(?P<{}>([\d\.] ))'
args    = ('_fval', '\nentropy = _ent', '\nsamples = _samp', '%\nvalue = \[_lval', ', _rval')
fltreg  = ''.join([flt.format(a, b) for (a, b) in [arg.split('_') for arg in args]])

#regex to describe the data portion of an entry
datfmt  = re.compile('(?P<focus>([\w\d] )) <= {}\]\nclass = (?P<class>(. ))'.format(fltreg), re.I|re.S)

#container for individual entries
entries = []

#entry descriptor
Entry   = namedtuple('Entry', 'x y focus fvalue entropy samples value cls')

#for storing entry index
c = 0

#find all entries
for m in entfmt.finditer(data):
    #consistent entry data
    x, y = float(m.group('x')), float(m.group('y'))
    #get all data for this entry
    m2 = datfmt.match(m.group('data'))
    #make sure this was not an empty entry
    if m2:
        #append entry
        entries.append(Entry(x, y,
                             m2.group('focus'), 
                             float(m2.group('fval')), 
                             float(m2.group('ent')), 
                             float(m2.group('samp')), 
                             [float(m2.group('lval')), float(m2.group('rval'))], 
                             m2.group('class')))
    else:
        #entry has empty data
        print('Data[{}] with [x:{}, y:{}] is empty'.format(c, x, y))
        
    #increment entry index
    c  = 1
        
#print all entries
print(*entries, sep='\n')

#Entry(x=447.1153846153846 , y=471.625, focus='the'    , fvalue=0.5, entropy=0.97 , samples=100.0, value=[0.399, 0.601], cls='True News')
#Entry(x=238.46153846153845, y=336.875, focus='donald' , fvalue=0.5, entropy=0.921, samples=83.7 , value=[0.336, 0.664], cls='True News')
#Entry(x=119.23076923076923, y=202.125, focus='hillary', fvalue=0.5, entropy=0.981, samples=55.6 , value=[0.42 , 0.58 ], cls='True News')
#Entry(x=357.6923076923077 , y=202.125, focus='hillary', fvalue=0.5, entropy=0.663, samples=28.2 , value=[0.172, 0.828], cls='True News')
#Entry(x=655.7692307692307 , y=336.875, focus='trumps' , fvalue=0.5, entropy=0.859, samples=16.3 , value=[0.718, 0.282], cls='Fake News')
#Entry(x=596.1538461538462 , y=202.125, focus='hillary', fvalue=0.5, entropy=0.821, samples=15.7 , value=[0.744, 0.256], cls='Fake News')

CodePudding user response：

No lookarounds, short and working:

re.findall(r"'(\w )\s*<=", s)

See regex proof | Python proof.

EXPLANATION

--------------------------------------------------------------------------------
  '                        '\''
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \w                       word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  <=                       '<='