Home > Enterprise >  Read inconsistent text data and write into csv
Read inconsistent text data and write into csv

Time:10-09

I'm new in this, appreciate your patience. I have this txt file, 'student.txt', that I want to write into a csv file, 'student.csv'.

student.txt

    name-> Alice
    name-> Sam
    sibling-> Kate,
    sibling-> Luke,
    hobby_1-> football
    hobby_2-> games
    name-> Alex
    name-> Ramsay
    hobby_1-> dance
    hobby_2-> swimming
    hobby_3-> jogging

Expected output in csv:

Name            Sibling               Hobbies
name-> Alice     N/A                  N/A
name-> Sam       sibling-> Kate       hobby_1-> football
                 sibling-> Luke       hobby_2-> games
name-> Alex      N/A                  N/A
name-> Ramsay    N/A                  hobby_1-> dance
                                      hobby_2-> swimming
                                      hobby_3-> jogging

Code I've done so far:

file = open('student.txt' , 'r')
with open('student.csv' , 'w') as writer:
    writer.write ('Name,Sibling,Hobbies\n')


    for eachline in file:
            if 'name' in eachline:
                writer.write (eachline)

            if 'sibling' in eachline:
                writer.write (eachline)

            if 'hobby' in eachline:
                writer.write (eachline)

Basically, any data after a name is captured before the next name. But I'm not sure how to put it orderly in csv, especially when some names don't have siblings / hobbies, to put it as N/A.

CodePudding user response:

You can approach the problem as follows:

  • parse the text into a dictionary
  • generate a list for each desired column in the csv
  • write the lists as columns in a csv file

Sample implementation:

file = open('student.txt' , 'r')

### parse the text into a dictionary

# remove whitespaces and commas from each line
lines = [x.strip().replace(',', '') for x in file]
# initialize dictionary
data = dict()

index = 0
while index < len(lines):
    line = lines[index]
    # for each name, generate a dictionary containing two lists
    if 'name' in line:
        # one list for siblings, one list for hobbies
        data[line] = {'siblings': [], 'hobbies': []}
        name_index = index   1
        # populate the lists with the values listed under the name
        while (name_index < len(lines)) and ('name' not in lines[name_index]):
            if 'sibling' in lines[name_index]:
                data[line]['siblings'].append(lines[name_index])
            elif 'hobby' in lines[name_index]:
                data[line]['hobbies'].append(lines[name_index])
            name_index  = 1
            index  = 1
    index  = 1
    
# data looks like:
'''
{'name-> Alice': {'siblings': [], 'hobbies': []},
 'name-> Sam': {'siblings': ['sibling-> Kate', 'sibling-> Luke'],
  'hobbies': ['hobby_1-> football', 'hobby_2-> games']},
 'name-> Ramsay': {'siblings': [],
  'hobbies': ['hobby_1-> dance', 'hobby_2-> swimming', 'hobby_3-> jogging']}}
'''

### generate a list for each column in the csv

names = []
siblings = []
hobbies = []

null_str = 'N/A'

for key in data:
    # add name to names
    names.append(key)
    
    # get rows for this name
    len_sibs = len(data[key]['siblings'])
    len_hobs = len(data[key]['hobbies'])
    num_rows = max([len_sibs, len_hobs])
    
    if num_rows == 0:
        siblings.append(null_str)
        hobbies.append(null_str)
    else:
        # add (num_rows - 1) empty strings to names
        names.extend([''] * (num_rows - 1))
        siblings_na_added = False
        hobbies_na_added = False
        for i in range(num_rows):
            # add siblings values with conditions for N/A and ''
            if i > (len(data[key]['siblings']) - 1):
                if siblings_na_added == False:
                    siblings.append(null_str)
                    siblings_na_added = True
                else:
                    siblings.append('')
            else:
                siblings.append(data[key]['siblings'][i])
            # add hobbies values with conditions for N/A and ''
            if i > (len(data[key]['hobbies']) - 1):
                if hobbies_na_added == False:
                    hobbies.append(null_str)
                    hobbies_na_added = True
                else:
                    hobbies.append('')
            else:
                hobbies.append(data[key]['hobbies'][i])
    
### write the lists as columns in a csv file

with open('student.csv' , 'w') as writer:
    writer.write ('Name,Sibling,Hobbies\n')
    for i in range(len(names)):
        row = names[i]   ','   siblings[i]   ','   hobbies[i]   '\n'
        writer.write(row)

Output read as text:

Name,Sibling,Hobbies
name-> Alice,N/A,N/A
name-> Sam,sibling-> Kate,hobby_1-> football
,sibling-> Luke,hobby_2-> games
name-> Ramsay,N/A,hobby_1-> dance
,,hobby_2-> swimming
,,hobby_3-> jogging

Output read as csv in excel:

enter image description here

P.S. the problem was more complicated than I imagined :))

  • Related