I have tsv file data that looks as follows (mock sample, real data is somewhat different and much large),
Group_one James,jaime,jim,jimmy Robert,Rob,bob Samuel,sam
Group_two Richard,rick,dick Rodney,Rod
So, the first level in the data is tab separation and the second level in the data is comma separation. I am to count data in each cell
For example, Group_one 4 3 2 Group_two 2 2
(Note: count for different versions of names.
I thought to do it as follows,
Step 1: read each line in
Step 2: Use split('\t')
to parse the first level
Step 3: Use split(',')
to parse the second level
Step 4: use len()
to count the second batch of lists and use end=''
import sys
def main():
name_of_table_file = 'file name here'
with open(name_of_table_file,'rt') as file_name:
file_name_lines = file_name.readlines()
for lines in file_name_lines:
lines=lines.rstrip()
lines = lines.rsplit('\t')
for comma_separated_items in lines:
comma_separated_items = comma_separated_items.rsplit(',')
print(len(comma_separated_items),end='\t')
main()
I came up with the following code,
The issue is the data is instead being printed as,
Group_one
Group_two
43232
Instead of:
Group_one 4 3 2
Group_two 3 2
(the lines in the first level of data are not being maintained, I was thinking the for would print to the next line after each line's end).
I tried to see if I could instead load the file into a pandas dataframe count each cell with a comma-based separation but not a whole lot of luck with this on google,here.
How would I solve this issue?
CodePudding user response:
I don't get what you get when I run it, so I'm not sure if some detail is missing from the question. But something like this should work:
with open(name_of_table_file, "rt") as file_name:
file_name_lines = file_name.readlines()
for line in file_name_lines:
groups = line.split("\t")
output = groups[0]
for group in groups[1:]:
output = f" {len(group.split(','))}"
print(output)
Outputs:
Group_one 4 3 2
Group_two 3 2
CodePudding user response:
This is the code that worked for me.
for lines in file_name_lines:
# lines=lines.rstrip(), I removed the strip and kept the end of line '\n' which I use to jump to newline when needed later down the line.
lines = lines.rsplit('\t')
# print(lines)
for comma_separated_items in lines:
comma_separated_items = comma_separated_items.rsplit(',')
if comma_separated_items[-1].endswith('\n'):# testing for newline
print(len(comma_separated_items))
else:
print(len(comma_separated_items),end='\t')