Home > Software engineering >  How to read a file which in CSV format, but different extension?
How to read a file which in CSV format, but different extension?

Time:11-24

I have a dataset which has a good dataframe structure starting from row 3. For the first rows, unfortunately separators are diverse, and there is a few information to be included in my dataframe. The files are in CSV strcture mostly, but they have extensions like WOC, WOL, WPL, and so on.

The WOC file first rows look like:

Person:?,?;F dob. ?  MT: ? Z:C NewYork Mon.:S St.?

144 cm/35 Kg/5 YearsOld






45,34,22,26,0
78,74,82,11,0

Header of the ollowing values should be like:

A, B, C, D, E
45,34,22,26,0
78,74,82,11,0

Here is my attempt:

df44 = pd.DataFrame() # creates empty dataframe

for f in glob.glob('file_path_to_single_file'):

    with open(f, 'rb') as file:
        encodings = chardet.detect(file.read())["encoding"]
    a = pd.read_csv(f,sep='\s |;|,', engine='python', encoding=encodings,header=None,names=['A','B', 'C', 'D', 'E'], skiprows=2)
    df44 = df44.append(a)

What would be the best way to read such a file so that I can also extract height, weight, age and city?

My expected output is:

A, B, C, D, E, City, Height, Weight, Age
45,34,22,26,0,NewYork, 144,    35,   5
78,74,82,11,0,NewYork, 144,    35,   5

CodePudding user response:

Base on additional info from your comments above I think you can start build your solution with following:

`# I created a file 'data.woc' with data as stream from your question:`
import pandas as pd
from io import StringIO
import re
stack_data = '''Person:?,?;F dob. ?  MT: ? Z:C NewYork Mon.:S St.?

144 cm/35 Kg/5 YearsOld






45,34,22,26,0
78,74,82,11,0'''

# read heading rows, I arbitrally chose 5 rows to read

with open('data.woc', 'r') as f:
    heading_rows = [next(f) for _ in range(5)]

city = re.findall(pattern = ' \w  ', string = heading_rows[0])[0].strip()

numbers_list = [re.findall(pattern='\d ', string=row) for row in heading_rows if 'cm' and 'kg' in row.lower()][0]

height, weight, age = [int(numbers_lst[i]) for i in range(3)]
    
df = pd.read_csv('data.woc', sep='\s |;|,', skiprows=2,comment='cm', index_col=None, names=list('ABCDE'))
    
df.dropna(inplace=True)
  • Related