I am using Selenium with Python to scrape some file information. I would like to extract only the file type and version number if available eg. GML 3.1.1
. I'm looking for the split function to do so. My current response is a list that looks like this:
ESRI Shapefile, (50.7 kB)
GML 3.1.1, (124.9 kB)
Google Earth KML 2.1, (126.5 kB)
MapInfo MIF, (53.5 kB)
The script section is as follows:
for file in files:
file_format = file.text
print(file_format)
I'm looking for the strip()
function that checks if the word before the comma is uppercase or uppercase followed by float. The following is the output I'm looking for:
ESRI
GML 3.1.1
KML 2.1
MIF
CodePudding user response:
Using a regex that finds words of all uppercase letters followed optionally by a space and digits / dots would work here:
s='''ESRI Shapefile, (50.7 kB)
GML 3.1.1, (124.9 kB)
Google Earth KML 2.1, (126.5 kB)
MapInfo MIF, (53.5 kB)'''
import re
re.findall(r'\b[A-Z] \b(?:\s[\d\.] )?', s)
['ESRI', 'GML 3.1.1', 'KML 2.1', 'MIF']