Home > Enterprise >  Importing table with random length text column into pandas
Importing table with random length text column into pandas

Time:08-02

I am importing dynamic text scraped from a webpage into pandas. Several of the columns are specific length, but one has a descriptor which varies in length and has spaces, thus a split by spaces doesn't work well.

The text looks like:

SN    Ver.  Rank  Mod.    Reg.              Engine        Operator              STD
07245 AAA01 00022 RX7-2A4 THX-1149          Correlian     Lucas, G.             ST1
07247 BBB05 00005 BR4-23B NC-1701           Federation    Federation of Planets ST7
07251 CCC13 00007 7G3-K14 YT-1300           Correlian     Solo, Han (WANTED)    STA
07253 DDD13 00021 431-2DF 404-E-132-4FE274A Std-Radiation Reynolds, M. (WANTED) STX

All columns are space separated, but the Operator column has spaces. The last column is always three characters in length. As a note, I added extra spaces for readability in the above. All multiple spaces are single in the actual text.

I have tried the following:

df = pd.DataFrame([row.split(" ") for row in strtbl.split('\n')])
header = ['SN', 'Ver.', 'Rank', 'Mod.', 'Reg.', 'Engine',
          'Operator', 'STD']
df = df[1:]
df.columns = header

Of course, the dataframe comes in jagged. Also, this does not lend well to a fixed field width import, and the nature of the file has only worked with a list comprehension split. Since Reg. and Engine are not all the same length, pandas.read_fwf() isn't working either. It will sort out the first columns then put the remainder in the last field. Using colspecs, I can get the last field working, but the middle colums are still one field.

How can one handle this data structure in pandas?

CodePudding user response:

When all else fails, throw some regex at it is my motto: https://regexr.com/6qvb0

Given:

text = '''SN Ver. Rank Mod. Reg. Engine Operator STD
07245 AAA01 00022 RX7-2A4 THX-1149 Correlian Lucas, G. ST1
07247 BBB05 00005 BR4-23B NC-1701 Federation Federation of Planets ST7
07251 CCC13 00007 7G3-K14 YT-1300 Correlian Solo, Han (WANTED) STA
07253 DDD13 00021 431-2DF 404-E-132-4FE274A Std-Radiation Reynolds, M. (WANTED) STX'''

Doing:

# Read it into a single column:
df = pd.DataFrame(text.splitlines())

# Regex extract by characterizing each piece:
df = df[0].str.extract('(\S ) (\S ) (\S ) (\S ) (\S ) (\S ) (. ) (\w{3})')

# Fix the header:
df.columns = df.loc[0]
df.columns.name = None
df = df.drop(0).reset_index(drop=True)

print(df)

Output:

      SN   Ver.   Rank     Mod.               Reg.         Engine               Operator  STD
0  07245  AAA01  00022  RX7-2A4           THX-1149      Correlian              Lucas, G.  ST1
1  07247  BBB05  00005  BR4-23B            NC-1701     Federation  Federation of Planets  ST7
2  07251  CCC13  00007  7G3-K14            YT-1300      Correlian     Solo, Han (WANTED)  STA
3  07253  DDD13  00021  431-2DF  404-E-132-4FE274A  Std-Radiation  Reynolds, M. (WANTED)  STX
  • Related