Home > Net >  How can I use regular expressions to extract all words with at least one digit in text with Python
How can I use regular expressions to extract all words with at least one digit in text with Python

Time:02-22

I am new to regular expressions and I have a text as follows. How can I use the RegEx to extract all words with at least one digit in it? Really appreciate it.

text = '''The start of the Civil War in 1861 followed by Tennessee’s secession from the Union and the lodging of 
wounded Confederate soldiers on campus did not close East Tennessee University. By spring 1862 when the 
trustees finally suspended operations, the majority of students had joined the military, President Joseph 
Ridley had resigned, and two professors had left the university. Wounded Confederate soldiers were lodged 
at university buildings after the January 1862 Battle of Mill Springs in Kentucky, known as the Battle of 
Fishing Creek to the Confederacy. In the fall of 1863, Union troops forced the Confederates out of 
Knoxville. On the Hill, the Union Army enclosed the three university buildings with an earthen 
fortification they named Fort Byington in honor of an officer from Michigan who was killed in the defense 
of Knoxville. They used the buildings for their headquarters, barracks, and a hospital for Black troops. 
Despite a Confederate attempt to retake the city by siege—climaxed by a bloody, abortive attack on Fort 
Sanders on November 29, 1863—the Union held and occupied Knoxville for the rest of the war. During the 
battle, the Hill was hit with artillery fire from Confederate guns located in a trench at the site of 
UT’s present-day Sorority Village. Campus also sustained a great deal of damage caused by the Union Army. 
Troops denuded the grounds of trees, ruined the steward’s house, and destroyed the gymnasium with 
misdirected cannon fire aimed at Confederate troops across the river. After the Civil War ended in 1865 
and the Union Army left campus, Thomas Humes was elected university president. The university reopened in 
1866 and operated for six months downtown in the Deaf and Dumb Asylum while repairs began at the damaged 
campus. A petition to the federal war department for monetary compensation for campus damage done by the 
Union Army undoubtedly received more favorable consideration because of Humes’s known Union loyalty 
throughout the war. A Senate committee which considered the bill for damages also noted that East 
Tennessee University was “particularly deserving of the favorable consideration of Congress” because it 
was “the only educational institution of known loyalty…in any of the seceding states.” However in 1873, 
President Ulysses S. Grant vetoed the bill that would have provided $18,500 to the university because he 
felt it would set a bad precedent. The bill was redrafted specifying that the payment was compensation 
for aid East Tennessee University gave to the Union during the war. On June 22, 1874, President Grant 
signed the new bill and the trustees accepted the funds the same day with an agreement to release the 
government from all claims. (More than a century and a half later, a buried Union trench was located in 
2019 on the north side of the present-day McClung Museum with the use of ground-penetrating radar.)
'''

CodePudding user response:

You could use this pattern:

'\w*\d \w*'

How does it work:
\w* matches 0 or more characters (but not space)
\d matches 1 or more digits
\w* matches 0 or more characters again

Using re and findall we get:

re.findall('\w*\d \w*')

we get:

['1861',
 '1862',
 '1862',
 '1863',
 '29',
 '1863',
 '1865',
 '1866',
 '1873',
 '18',
 '500',
 '22',
 '1874',
 '2019']

CodePudding user response:

Is this what you mean?

re.findall(r"\S*\d \S*", text)

\S any character but a space, \d any digit,
one or more occurrences, * zero or more occurrences

  • Related