I need to parse the following text file into a dataframe, any suggestion about the methods?
Input:
('name: ', u'Jacky')
('male: ', True)
('hobby: ', u'play football and bascket')
('age: ', 24.0)
----------------
('name: ', u'Belly')
('male: ', True)
('hobby: ', u'dancer')
('age: ', 74.0)
----------------
('name: ', u'Chow')
('male: ', True)
('hobby: ', u'artist')
('age: ', 46.0)
output:
name male hobby age
jacky True football 24
...
CodePudding user response:
I used regex to parse your text file :
import re
import pandas as pd
text_path = 'text.txt'
my_dict = {}
pattern = r"\('(\w ):\s ',\s u*'*([a-zA-Z0-9\s.]*)'*\)"
with open(text_path, 'r') as txt:
for block in re.split(r"- \n", txt.read()):
for line in filter(None, block.split('\n')):
col_name, value = re.search(pattern, line).group(1,2)
try:
value = int(float(value))
except ValueError:
value = True if value=='True' else False if value=='False' else value
if col_name in my_dict:
my_dict[col_name].append(value)
else:
my_dict[col_name] = [value]
df = pd.DataFrame(my_dict)
print(df)
Output :
name male hobby age
0 Jacky True play football and bascket 24
1 Belly True dancer 74
2 Chow True artist 46
Booleans values are not string but real bool True
or False
, numerical value (like age
) are int
(you could keep them as float
) and not strings.
Ask me if you don't understand something.
CodePudding user response:
I don't know any way to parse this data convention with usage of some existing parser so I suggest to build your own ones. Then I would use readlines()
method on open file so it allows me to iterate over lines of data and apply correct parser to each row in iteration. Finally, I would combine data and create DataFrame. Example code is below:
import pandas as pd
import sys
def parse_from_weird_file_to_pandas_df(file):
with open(file, 'r') as f:
content = f.readlines()
name_vals = [_parse_text(content[line]) for line in range(0, len(content), 5)]
male_vals = [_parse_bool(content[line]) for line in range(1, len(content), 5)]
hobby_vals = [_parse_text(content[line]) for line in range(2, len(content), 5)]
age_vals = [_parse_int(content[line]) for line in range(3, len(content), 5)]
df_rows = zip(name_vals, male_vals, hobby_vals, age_vals)
df = pd.DataFrame(data=df_rows, columns=["name", "male", "hobby", "age"])
return df
def _parse_text(text_line):
text = text_line[text_line.find("u'") 2: text_line.find("')")]
return text
def _parse_bool(bool_line):
val_bool = bool_line[bool_line.find("', ") 3: bool_line.find(")")]
return True if val_bool == "True" else False
def _parse_int(int_line):
val_int = int_line[int_line.find("', ") 3: int_line.find(")")]
return int(float(val_int))
If you wish to shorten 'play football and bascket'
to just 'football'
you can achieve this for example by creating list with all available hobbies, looping them through parsed hobby and returning the matching one.
CodePudding user response:
Here is a quick code I made just before lunch, not optimised but seems to work (I did not remove the 'u'in the string and did not convert the int but you should be able to manage that ? If not let me kow and i will work on it after !
The .join remove unecessary char and I assume you only have 4 object every time...
file = open("yourfile.txt", 'r')
lines = file.readlines()
init = True
list_to_append = []
df = pd.DataFrame(columns=['name', 'male', 'hobby','age'])
for line in lines:
if '---' not in line:
line = line.split(',')[1]
processed_line = ''.join(c for c in line if c not in " ()'\n")
list_to_append.append(processed_line)
if len(list_to_append) == 4:
df.loc[len(df)] = list_to_append
list_to_append = []
else :
pass
file.close()