I am trying to use regex or python functions to extract all the bolded texts, or texts in between ' and <=.
"[Text(447.1153846153846, 471.625, 'the <= 0.5\nentropy = 0.97\nsamples = 100.0%\nvalue = [0.399, 0.601]\nclass = True News'), Text(238.46153846153845, 336.875, 'donald <= 0.5\nentropy = 0.921\nsamples = 83.7%\nvalue = [0.336, 0.664]\nclass = True News'), Text(119.23076923076923, 202.125, 'hillary <= 0.5\nentropy = 0.981\nsamples = 55.6%\nvalue = [0.42, 0.58]\nclass = True News'), Text(59.61538461538461, 67.375, '\n (...) \n'), Text(178.84615384615384, 67.375, '\n (...) \n'), Text(357.6923076923077, 202.125, 'hillary <= 0.5\nentropy = 0.663\nsamples = 28.2%\nvalue = [0.172, 0.828]\nclass = True News'), Text(298.0769230769231, 67.375, '\n (...) \n'), Text(417.30769230769226, 67.375, '\n (...) \n'), Text(655.7692307692307, 336.875, 'trumps <= 0.5\nentropy = 0.859\nsamples = 16.3%\nvalue = [0.718, 0.282]\nclass = Fake News'), Text(596.1538461538462, 202.125, 'hillary <= 0.5\nentropy = 0.821\nsamples = 15.7%\nvalue = [0.744, 0.256]\nclass = Fake News'), Text(536.5384615384615, 67.375, '\n (...) \n'), Text(655.7692307692307, 67.375, '\n (...) \n'), Text(715.3846153846154, 202.125, 'entropy = 0.0\nsamples = 0.6%\nvalue = [0.0, 1.0]\nclass = True News')]"
So far the closest I got was (?=')(.*)(?= <=), but had no luck so far.
Would anyone let me know how to extract these bolded texts in between single quote and <=?
Does not necessary need to be using regex.
Thanks!
CodePudding user response:
Using a look behind for the single quote '
and a look ahead for <=
the middle non-quote characters can be matched for the content.
r"(?<=')[^']*?(?=\s*<=)"
https://regex101.com/r/KlYLQ2/1
CodePudding user response:
One approach:
import re
text = "[Text(447.1153846153846, 471.625, 'the <= 0.5\nentropy = 0.97\nsamples = 100.0%\nvalue = [0.399, " \
"0.601]\nclass = True News'), Text(238.46153846153845, 336.875, 'donald <= 0.5\nentropy = 0.921\nsamples = " \
"83.7%\nvalue = [0.336, 0.664]\nclass = True News'), Text(119.23076923076923, 202.125, 'hillary <= " \
"0.5\nentropy = 0.981\nsamples = 55.6%\nvalue = [0.42, 0.58]\nclass = True News'), Text(59.61538461538461, " \
"67.375, '\n (...) \n'), Text(178.84615384615384, 67.375, '\n (...) \n'), Text(357.6923076923077, 202.125, " \
"'hillary <= 0.5\nentropy = 0.663\nsamples = 28.2%\nvalue = [0.172, 0.828]\nclass = True News'), " \
"Text(298.0769230769231, 67.375, '\n (...) \n'), Text(417.30769230769226, 67.375, '\n (...) \n'), " \
"Text(655.7692307692307, 336.875, 'trumps <= 0.5\nentropy = 0.859\nsamples = 16.3%\nvalue = [0.718, " \
"0.282]\nclass = Fake News'), Text(596.1538461538462, 202.125, 'hillary <= 0.5\nentropy = 0.821\nsamples = " \
"15.7%\nvalue = [0.744, 0.256]\nclass = Fake News'), Text(536.5384615384615, 67.375, '\n (...) \n'), " \
"Text(655.7692307692307, 67.375, '\n (...) \n'), Text(715.3846153846154, 202.125, 'entropy = 0.0\nsamples = " \
"0.6%\nvalue = [0.0, 1.0]\nclass = True News')] "
for match in re.finditer(", '(.*)?<=", text):
print(match.group(1))
Output
the
donald
hillary
hillary
trumps
hillary
CodePudding user response:
This regex works. We use a named group so it is easy to refer to the exact data you want. It's setup to find consecutive words, and digits followed by " <=". We then use finditer
to get all of the matches.
import re
data = "[Text(447.1153846153846, 471.625, 'the <= 0.5\nentropy = 0.97\nsamples = 100.0%\nvalue = [0.399, 0.601]\nclass = True News'), Text(238.46153846153845, 336.875, 'donald <= 0.5\nentropy = 0.921\nsamples = 83.7%\nvalue = [0.336, 0.664]\nclass = True News'), Text(119.23076923076923, 202.125, 'hillary <= 0.5\nentropy = 0.981\nsamples = 55.6%\nvalue = [0.42, 0.58]\nclass = True News'), Text(59.61538461538461, 67.375, '\n (...) \n'), Text(178.84615384615384, 67.375, '\n (...) \n'), Text(357.6923076923077, 202.125, 'hillary <= 0.5\nentropy = 0.663\nsamples = 28.2%\nvalue = [0.172, 0.828]\nclass = True News'), Text(298.0769230769231, 67.375, '\n (...) \n'), Text(417.30769230769226, 67.375, '\n (...) \n'), Text(655.7692307692307, 336.875, 'trumps <= 0.5\nentropy = 0.859\nsamples = 16.3%\nvalue = [0.718, 0.282]\nclass = Fake News'), Text(596.1538461538462, 202.125, 'hillary <= 0.5\nentropy = 0.821\nsamples = 15.7%\nvalue = [0.744, 0.256]\nclass = Fake News'), Text(536.5384615384615, 67.375, '\n (...) \n'), Text(655.7692307692307, 67.375, '\n (...) \n'), Text(715.3846153846154, 202.125, 'entropy = 0.0\nsamples = 0.6%\nvalue = [0.0, 1.0]\nclass = True News')]"
fmt = re.compile(r'(?P<info>[\w\d] ) <=', re.I)
for m in fmt.finditer(data):
print(m.group('info'))
If you just want to go the whole 9 yards, the below will parse the entire thing into a named tuple that primarily mirrors the format of the text. I didn't know what the first 2 values represent so, I just called them x
and y
. I went this far because what you want doesn't seem very useful, and I assume this question is just a precursor to eventually pinpointing more data. This pinpoints all of the data. Any entry with \n (...) \n
data is printed as being "empty", and is not stored in the entries list
.
import re
from collections import namedtuple
data = "[Text(447.1153846153846, 471.625, 'the <= 0.5\nentropy = 0.97\nsamples = 100.0%\nvalue = [0.399, 0.601]\nclass = True News'), Text(238.46153846153845, 336.875, 'donald <= 0.5\nentropy = 0.921\nsamples = 83.7%\nvalue = [0.336, 0.664]\nclass = True News'), Text(119.23076923076923, 202.125, 'hillary <= 0.5\nentropy = 0.981\nsamples = 55.6%\nvalue = [0.42, 0.58]\nclass = True News'), Text(59.61538461538461, 67.375, '\n (...) \n'), Text(178.84615384615384, 67.375, '\n (...) \n'), Text(357.6923076923077, 202.125, 'hillary <= 0.5\nentropy = 0.663\nsamples = 28.2%\nvalue = [0.172, 0.828]\nclass = True News'), Text(298.0769230769231, 67.375, '\n (...) \n'), Text(417.30769230769226, 67.375, '\n (...) \n'), Text(655.7692307692307, 336.875, 'trumps <= 0.5\nentropy = 0.859\nsamples = 16.3%\nvalue = [0.718, 0.282]\nclass = Fake News'), Text(596.1538461538462, 202.125, 'hillary <= 0.5\nentropy = 0.821\nsamples = 15.7%\nvalue = [0.744, 0.256]\nclass = Fake News'), Text(536.5384615384615, 67.375, '\n (...) \n'), Text(655.7692307692307, 67.375, '\n (...) \n'), Text(715.3846153846154, 202.125, 'entropy = 0.0\nsamples = 0.6%\nvalue = [0.0, 1.0]\nclass = True News')]"
#regex to describe the overall entry
entfmt = re.compile(r'Text\((?P<x>([\d\.] )), (?P<y>([\d\.] )), \'(?P<data>([^\'] ))\'\)', re.I|re.S)
#format all of the float groups ~
# flt is a repeatable chunk so we create this part of the expression in a loop
# all this really does is make the final datfmt regex seem shorter
flt = '{}(?P<{}>([\d\.] ))'
args = ('_fval', '\nentropy = _ent', '\nsamples = _samp', '%\nvalue = \[_lval', ', _rval')
fltreg = ''.join([flt.format(a, b) for (a, b) in [arg.split('_') for arg in args]])
#regex to describe the data portion of an entry
datfmt = re.compile('(?P<focus>([\w\d] )) <= {}\]\nclass = (?P<class>(. ))'.format(fltreg), re.I|re.S)
#container for individual entries
entries = []
#entry descriptor
Entry = namedtuple('Entry', 'x y focus fvalue entropy samples value cls')
#for storing entry index
c = 0
#find all entries
for m in entfmt.finditer(data):
#consistent entry data
x, y = float(m.group('x')), float(m.group('y'))
#get all data for this entry
m2 = datfmt.match(m.group('data'))
#make sure this was not an empty entry
if m2:
#append entry
entries.append(Entry(x, y,
m2.group('focus'),
float(m2.group('fval')),
float(m2.group('ent')),
float(m2.group('samp')),
[float(m2.group('lval')), float(m2.group('rval'))],
m2.group('class')))
else:
#entry has empty data
print('Data[{}] with [x:{}, y:{}] is empty'.format(c, x, y))
#increment entry index
c = 1
#print all entries
print(*entries, sep='\n')
#Entry(x=447.1153846153846 , y=471.625, focus='the' , fvalue=0.5, entropy=0.97 , samples=100.0, value=[0.399, 0.601], cls='True News')
#Entry(x=238.46153846153845, y=336.875, focus='donald' , fvalue=0.5, entropy=0.921, samples=83.7 , value=[0.336, 0.664], cls='True News')
#Entry(x=119.23076923076923, y=202.125, focus='hillary', fvalue=0.5, entropy=0.981, samples=55.6 , value=[0.42 , 0.58 ], cls='True News')
#Entry(x=357.6923076923077 , y=202.125, focus='hillary', fvalue=0.5, entropy=0.663, samples=28.2 , value=[0.172, 0.828], cls='True News')
#Entry(x=655.7692307692307 , y=336.875, focus='trumps' , fvalue=0.5, entropy=0.859, samples=16.3 , value=[0.718, 0.282], cls='Fake News')
#Entry(x=596.1538461538462 , y=202.125, focus='hillary', fvalue=0.5, entropy=0.821, samples=15.7 , value=[0.744, 0.256], cls='Fake News')
CodePudding user response:
No lookarounds, short and working:
re.findall(r"'(\w )\s*<=", s)
See regex proof | Python proof.
EXPLANATION
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
<= '<='