Home > Back-end >  Count items in a lists that have data with a nested structure
Count items in a lists that have data with a nested structure

Time:12-06

I have tsv file data that looks as follows (mock sample, real data is somewhat different and much large),

Group_one James,jaime,jim,jimmy Robert,Rob,bob Samuel,sam
Group_two Richard,rick,dick Rodney,Rod

So, the first level in the data is tab separation and the second level in the data is comma separation. I am to count data in each cell

For example, Group_one 4 3 2 Group_two 2 2

(Note: count for different versions of names.

I thought to do it as follows,

Step 1: read each line in Step 2: Use split('\t') to parse the first level Step 3: Use split(',') to parse the second level Step 4: use len() to count the second batch of lists and use end=''

import sys

def main():

    name_of_table_file = 'file name here'  

    with open(name_of_table_file,'rt') as file_name:
        file_name_lines = file_name.readlines()

    for lines in file_name_lines:
        lines=lines.rstrip()
        lines = lines.rsplit('\t')
        for comma_separated_items in lines:
            comma_separated_items = comma_separated_items.rsplit(',')
            print(len(comma_separated_items),end='\t')
           
main()

I came up with the following code,

The issue is the data is instead being printed as,

Group_one 
Group_two
43232 

Instead of:

Group_one 4 3 2
Group_two 3 2

(the lines in the first level of data are not being maintained, I was thinking the for would print to the next line after each line's end).

I tried to see if I could instead load the file into a pandas dataframe count each cell with a comma-based separation but not a whole lot of luck with this on google,here.

How would I solve this issue?

CodePudding user response:

I don't get what you get when I run it, so I'm not sure if some detail is missing from the question. But something like this should work:

with open(name_of_table_file, "rt") as file_name:
    file_name_lines = file_name.readlines()

for line in file_name_lines:
    groups = line.split("\t")
    
    output = groups[0]
    for group in groups[1:]:
        output  = f" {len(group.split(','))}"
    print(output)

Outputs:

Group_one 4 3 2
Group_two 3 2

CodePudding user response:

This is the code that worked for me.

for lines in file_name_lines:
            # lines=lines.rstrip(), I removed the strip and kept the end of line '\n' which I use to jump to newline when needed later down the line.
            lines = lines.rsplit('\t')
            # print(lines)
            for comma_separated_items in lines:
                comma_separated_items = comma_separated_items.rsplit(',')
                if comma_separated_items[-1].endswith('\n'):# testing for newline
                    print(len(comma_separated_items))
                else:
                    print(len(comma_separated_items),end='\t')
  • Related