Home > Back-end >  Dealing with non-ASCII characters when parsing fixed-width txt file
Dealing with non-ASCII characters when parsing fixed-width txt file

Time:07-27

So I have a series of huge files (several GB) of tabular data. They are txt and each column is defined by a fixed width. This width is indicated by a number of dashes right below the headers. So far so good, I have a script that reads those files line by line and outputs them to XML.

One challenge is that most but NOT all of the content is encoded in UTF-8. Trying to decode the content while processing will throw an error somewhere down the line. Hence, my script only reads and processes byte strings. This will cause readability issues in the output but that's tolerable and not my concern.

My problem: The widths were calculated with the decoded content in mind. Non-ascii characters that are represented by several bytes in UTF-8 are not accounted for.

Example: The string ´Zürich, Albisgütli´ has a length of 18 and is found in a column with a fixed width of 19. In its UTF8 representation, however, the string is ´Z\xc3\xbcrich, Albisg\xc3\xbctli´ which is 20 chars long and thus will throw off the parsing of the rest of the data row.

Solution attempts so far:

  • Tried decoding the data first so that the length is correct, but as mentioned, a few data entries aren't actually UTF8 and I'd prefer to avoid the whole encoding thing.
  • Identify all non-ASCII characters that could come up so that I can adjust the parsing. This is an issue because the data are huge and I'm not confident that I can come up with an exhaustive list of non-ASCII characters that could come up. Also, I don't know yet how to efficiently correct the parsing in these cases.

One issue is also that I'm using copied code for the parsing so I don't know how I could change its behavior to count non-Ascii chars differently.

Thankful for any pointers what a possible approach could be!

Code as it is now:

´´´

def convert(infile, outfile):
    secondline = infile.readline() # the dashes are in this line
    maxlen = len(secondline)
    fieldwidths = get_widths(secondline) # counts the dashes to get the widths

    # code taken from: https://stackoverflow.com/a/4915359/9021715
    fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                        for fw in fieldwidths)
    fieldstruct = struct.Struct(fmtstring)
    parse = fieldstruct.unpack_from
    
    c = 0
    
    outfile.write(b"<?xml version='1.0' encoding='UTF-8'?>\n")
    
    namespace = f'xmlns="http://www.bar.admin.ch/xmlns/siard/2/table.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.bar.admin.ch/xmlns/siard/2/table.xsd {table_w_num}.xsd" version="2.1"'.encode()
    
    outfile.write(b'<table '   namespace   b'>\n')
    for line in infile:
        diff = maxlen - len(line)
        padded_line = bytearray()
        padded_line  = line
        for _ in range(diff):
            padded_line  = b' '

        data = [elem.strip() for elem in parse(padded_line)]
        data = parse(padded_line)
        
        if b"Albis" in line:
            print(line)
            print(data)
        row = b''
        for elem, n in zip(data, range(1, len(data) 1)):
            # Timestamp-Fix
            elem = re.sub(b"(\d{4}\-\d{2}\-\d{2}) (\d{2}:\d{2}:\d{2}(\.\d )?)\S*?", b"\g<1>T\g<2>Z", elem)
            if elem == b'' or elem == b'NULL':
                pass
            else:
                row = b'%s<c%s>%s</c%s>' % (row, str(n).encode(), xml_escape(elem), str(n).encode())
        row = b"<row>%s</row>" % (row)
        outfile.write(b''.join([row, b'\n']))
        
        c  = 1
        if c % infostep == 0:
            timestamp = int(time.time() - start_time)
            print(f"Quarter done, time needed: {str(timestamp)} seconds")
    
    outfile.write(b'</table>')

´´´

CodePudding user response:

The first snippet in the related answer only works for single-byte codepages because it counts bytes, not characters. It doesn't even work for UTF16 which usually uses 2 bytes per character and certainly not UTF8 which uses a variable number of bytes. That was pointed out in the comments.

The same answer shows how to handle UTF8 in the third snippet:

def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields
    flds = tuple(zip_longest(pads, (0,) cuts, cuts))[:-1]  # ignore final one
    slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
    parse = eval('lambda line: ({})\n'.format(slcs))  # Create and compile source code.
    # Optional informational function attributes.
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse

The question's code should be split into separate functions, one to read the data and another to generate the XML output. Both operations are already available through various modules though. There are several modules that can read fixed-width files, several XML parser and serialization libraries, and some, like Pandas, can read multiple data formats, process the data and export it as XML.

For example, with Pandas, this code could be replaced with just 2 function calls:

import pandas as pd
namespaces={ 
    "xmlns" : "http://www.bar.admin.ch/xmlns/siard/2/table.xsd" ,
    "xsi" : "http://www.w3.org/2001/XMLSchema-instance" ,
    ...
}    

df=pd.read_fwf('data.csv')
df.to_xml('data.xml', root_name='table', namespaces=namespaces)

This will be faster and use less memory than the explicit string manipulations in the question's code. String manipulations create new temporary strings each time, costing in both CPU and RAM.

Attribute names can be specified through the attr_cols parameter, eg:

df.to_xml(attr_cols=[
          'index', 'shape', 'degrees', 'sides'
          ]) 

You can also rename Dataframe columns, change their type eg to parse string fields into dates or numbers:

df['Timestamp'] = pd.to_datetime(df['Col3'],format='%Y-%m-%d')
df=df.rename(columns={'Col1':'Bananas',...})

CodePudding user response:

have you tried different, but similar encoding types like UTF8-BOM or signed What's the difference between UTF-8 and UTF-8 without BOM?.

If that doesnt work you can strip out the non-ASCII stuff with regex: Remove non-ascii characters from CSV using pandas

df.to_csv(csvFile, index=False)

with open(csvFile) as f:
    new_text = re.sub(r'[^\x00-\x7F] ', '', f.read())

with open(csvFile, 'w') as f:
    f.write(new_text)
  • Related