How do I parse this kind of text file with special separator-CodePudding

I need to parse the following text file into a dataframe, any suggestion about the methods?

Input:

('name:   ', u'Jacky')
('male:   ', True)
('hobby:   ', u'play football and bascket')
('age:   ', 24.0)
----------------
('name:   ', u'Belly')
('male:   ', True)
('hobby:   ', u'dancer')
('age:   ', 74.0)
----------------
('name:   ', u'Chow')
('male:   ', True)
('hobby:   ', u'artist')
('age:   ', 46.0)

output:

name  male  hobby     age
jacky True  football  24
...

CodePudding user response：

I used regex to parse your text file :

import re
import pandas as pd
text_path = 'text.txt'
my_dict = {}
pattern = r"\('(\w ):\s ',\s u*'*([a-zA-Z0-9\s.]*)'*\)"
with open(text_path, 'r') as txt:
    for block in re.split(r"- \n", txt.read()):
        for line in filter(None, block.split('\n')):
            col_name, value = re.search(pattern, line).group(1,2)
            try:
                value = int(float(value))
            except ValueError:
                value = True if value=='True' else False if value=='False' else value
            if col_name in my_dict:    
                my_dict[col_name].append(value)
            else:
                my_dict[col_name] = [value]
df = pd.DataFrame(my_dict)
print(df)

Output :

    name  male                      hobby  age
0  Jacky  True  play football and bascket   24
1  Belly  True                     dancer   74
2   Chow  True                     artist   46

Booleans values are not string but real bool True or False, numerical value (like age) are int (you could keep them as float) and not strings.

Ask me if you don't understand something.

CodePudding user response：

I don't know any way to parse this data convention with usage of some existing parser so I suggest to build your own ones. Then I would use readlines() method on open file so it allows me to iterate over lines of data and apply correct parser to each row in iteration. Finally, I would combine data and create DataFrame. Example code is below:

import pandas as pd
import sys

def parse_from_weird_file_to_pandas_df(file):
    with open(file, 'r') as f:
        content = f.readlines()

    name_vals = [_parse_text(content[line]) for line in range(0, len(content), 5)]
    male_vals = [_parse_bool(content[line]) for line in range(1, len(content), 5)]
    hobby_vals = [_parse_text(content[line]) for line in range(2, len(content), 5)]
    age_vals = [_parse_int(content[line]) for line in range(3, len(content), 5)]

    df_rows = zip(name_vals, male_vals, hobby_vals, age_vals)
    df = pd.DataFrame(data=df_rows, columns=["name", "male", "hobby", "age"])
    return df


def _parse_text(text_line):
    text = text_line[text_line.find("u'")   2: text_line.find("')")]
    return text

def _parse_bool(bool_line):
    val_bool = bool_line[bool_line.find("', ")   3: bool_line.find(")")]
    return True if val_bool == "True" else False

def _parse_int(int_line):
    val_int = int_line[int_line.find("', ")   3: int_line.find(")")]
    return int(float(val_int))

If you wish to shorten 'play football and bascket' to just 'football' you can achieve this for example by creating list with all available hobbies, looping them through parsed hobby and returning the matching one.

CodePudding user response：

Here is a quick code I made just before lunch, not optimised but seems to work (I did not remove the 'u'in the string and did not convert the int but you should be able to manage that ? If not let me kow and i will work on it after !

The .join remove unecessary char and I assume you only have 4 object every time...

    file = open("yourfile.txt", 'r')
    lines = file.readlines()
    init = True
    list_to_append = []
    df = pd.DataFrame(columns=['name', 'male', 'hobby','age'])
    for line in lines:
        if '---' not in line:
            line = line.split(',')[1]
            processed_line = ''.join(c for c in line if c not in " ()'\n")
            list_to_append.append(processed_line)
            if len(list_to_append) == 4:
                df.loc[len(df)] = list_to_append
                list_to_append = []
        else :
            pass
    file.close()