How to load text file into pandas-CodePudding

i have a text file
sample:

----------NEW TRANSACTON-----------
TRANSACTION DATE : 08:42 AM, 30 Nov 2021
CLIENT ID : C00001
CLIENT NAME : SALISU BISI
AMOUNT : 16,450.00
TRANSACTION REF. : 00001

----------NEW TRANSACTON-----------
TRANSACTION DATE : 08:46 AM, 30 Nov 2021
CLIENT ID : C00002
CLIENT NAME : SULE YAYA
AMOUNT : 80,940.00
TRANSACTION REF. : 00002

----------NEW TRANSACTON-----------
TRANSACTION DATE : 08:51 AM, 30 Nov 2021
CLIENT ID : C00001
CLIENT NAME : SALISU BISI
AMOUNT : 44,900.00
TRANSACTION REF. : 00003

how do i load it into pandas?
i tried
df=pd.read_csv("2021_11_30.txt", delimiter="\t") df
but it loaded it in one column

CodePudding user response：

it will load as one column, because that is what a text file is.

what you are in fact asking is how to identify and split the lines correctly into multiple columns.

you could do this with pandas in a roundabout way, but with unstructured text data like this, it would be better to either:

read a properly formatted csv with pandas
read a file line-by-line into a list

reading into a list is covered here: How to read a file line-by-line into a list?

CodePudding user response：

from collections import defaultdict

d = defaultdict(list)
with open(file, 'r') as rf:
     str_file = rf.read()
first_replace = str_file.replace('----------NEW TRANSACTON-----------','')
spliting_str = first_replace.split('\n\n\n')
clean_list = [tuple(r.split(':',1)) for row in spliting_str for r in row.split('\n') if r != '']
for k,v in clean_list:
    d[k].append(v)

Now you have a dict of lists that you can load into a dataframe:

df =pd.DataFrame.from_dict(d)

CodePudding user response：

without extra import, replace, with less for and if.

with open("/path/to/file.txt", "r") as f:
    l = f.readlines()
d = {}
for idx, el in enumerate(l):
    if "----------NEW TRANSACTON-----------" in el:
        for i in range(1,5):
            k, v = l[idx i].strip().split(" : ")
            if k not in d.keys():
                d[k] = [v]
            else:
                d[k].append(v)
df = pd.DataFrame.from_dict(d)