I am importing dynamic text scraped from a webpage into pandas. Several of the columns are specific length, but one has a descriptor which varies in length and has spaces, thus a split by spaces doesn't work well.
The text looks like:
SN Ver. Rank Mod. Reg. Engine Operator STD
07245 AAA01 00022 RX7-2A4 THX-1149 Correlian Lucas, G. ST1
07247 BBB05 00005 BR4-23B NC-1701 Federation Federation of Planets ST7
07251 CCC13 00007 7G3-K14 YT-1300 Correlian Solo, Han (WANTED) STA
07253 DDD13 00021 431-2DF 404-E-132-4FE274A Std-Radiation Reynolds, M. (WANTED) STX
All columns are space separated, but the Operator column has spaces. The last column is always three characters in length. As a note, I added extra spaces for readability in the above. All multiple spaces are single in the actual text.
I have tried the following:
df = pd.DataFrame([row.split(" ") for row in strtbl.split('\n')])
header = ['SN', 'Ver.', 'Rank', 'Mod.', 'Reg.', 'Engine',
'Operator', 'STD']
df = df[1:]
df.columns = header
Of course, the dataframe comes in jagged. Also, this does not lend well to a fixed field width import, and the nature of the file has only worked with a list comprehension split. Since Reg. and Engine are not all the same length, pandas.read_fwf() isn't working either. It will sort out the first columns then put the remainder in the last field. Using colspecs, I can get the last field working, but the middle colums are still one field.
How can one handle this data structure in pandas?
CodePudding user response:
When all else fails, throw some regex at it is my motto: https://regexr.com/6qvb0
Given:
text = '''SN Ver. Rank Mod. Reg. Engine Operator STD
07245 AAA01 00022 RX7-2A4 THX-1149 Correlian Lucas, G. ST1
07247 BBB05 00005 BR4-23B NC-1701 Federation Federation of Planets ST7
07251 CCC13 00007 7G3-K14 YT-1300 Correlian Solo, Han (WANTED) STA
07253 DDD13 00021 431-2DF 404-E-132-4FE274A Std-Radiation Reynolds, M. (WANTED) STX'''
Doing:
# Read it into a single column:
df = pd.DataFrame(text.splitlines())
# Regex extract by characterizing each piece:
df = df[0].str.extract('(\S ) (\S ) (\S ) (\S ) (\S ) (\S ) (. ) (\w{3})')
# Fix the header:
df.columns = df.loc[0]
df.columns.name = None
df = df.drop(0).reset_index(drop=True)
print(df)
Output:
SN Ver. Rank Mod. Reg. Engine Operator STD
0 07245 AAA01 00022 RX7-2A4 THX-1149 Correlian Lucas, G. ST1
1 07247 BBB05 00005 BR4-23B NC-1701 Federation Federation of Planets ST7
2 07251 CCC13 00007 7G3-K14 YT-1300 Correlian Solo, Han (WANTED) STA
3 07253 DDD13 00021 431-2DF 404-E-132-4FE274A Std-Radiation Reynolds, M. (WANTED) STX