I am new to regular expressions and I have a text as follows. How can I use the RegEx to extract all words with at least one digit in it? Really appreciate it.
text = '''The start of the Civil War in 1861 followed by Tennessee’s secession from the Union and the lodging of
wounded Confederate soldiers on campus did not close East Tennessee University. By spring 1862 when the
trustees finally suspended operations, the majority of students had joined the military, President Joseph
Ridley had resigned, and two professors had left the university. Wounded Confederate soldiers were lodged
at university buildings after the January 1862 Battle of Mill Springs in Kentucky, known as the Battle of
Fishing Creek to the Confederacy. In the fall of 1863, Union troops forced the Confederates out of
Knoxville. On the Hill, the Union Army enclosed the three university buildings with an earthen
fortification they named Fort Byington in honor of an officer from Michigan who was killed in the defense
of Knoxville. They used the buildings for their headquarters, barracks, and a hospital for Black troops.
Despite a Confederate attempt to retake the city by siege—climaxed by a bloody, abortive attack on Fort
Sanders on November 29, 1863—the Union held and occupied Knoxville for the rest of the war. During the
battle, the Hill was hit with artillery fire from Confederate guns located in a trench at the site of
UT’s present-day Sorority Village. Campus also sustained a great deal of damage caused by the Union Army.
Troops denuded the grounds of trees, ruined the steward’s house, and destroyed the gymnasium with
misdirected cannon fire aimed at Confederate troops across the river. After the Civil War ended in 1865
and the Union Army left campus, Thomas Humes was elected university president. The university reopened in
1866 and operated for six months downtown in the Deaf and Dumb Asylum while repairs began at the damaged
campus. A petition to the federal war department for monetary compensation for campus damage done by the
Union Army undoubtedly received more favorable consideration because of Humes’s known Union loyalty
throughout the war. A Senate committee which considered the bill for damages also noted that East
Tennessee University was “particularly deserving of the favorable consideration of Congress” because it
was “the only educational institution of known loyalty…in any of the seceding states.” However in 1873,
President Ulysses S. Grant vetoed the bill that would have provided $18,500 to the university because he
felt it would set a bad precedent. The bill was redrafted specifying that the payment was compensation
for aid East Tennessee University gave to the Union during the war. On June 22, 1874, President Grant
signed the new bill and the trustees accepted the funds the same day with an agreement to release the
government from all claims. (More than a century and a half later, a buried Union trench was located in
2019 on the north side of the present-day McClung Museum with the use of ground-penetrating radar.)
'''
CodePudding user response:
You could use this pattern:
'\w*\d \w*'
How does it work:
\w*
matches 0 or more characters (but not space)
\d
matches 1 or more digits
\w*
matches 0 or more characters again
Using re
and findall we get:
re.findall('\w*\d \w*')
we get:
['1861',
'1862',
'1862',
'1863',
'29',
'1863',
'1865',
'1866',
'1873',
'18',
'500',
'22',
'1874',
'2019']
CodePudding user response:
Is this what you mean?
re.findall(r"\S*\d \S*", text)
\S any character but a space,
\d any digit,
one or more occurrences,
* zero or more occurrences