How can I add every n number of rows to a n column?-CodePudding

I have a .txt file with 683,500 rows, every 7 rows its a different person that contain:

ID
Name
Work position
Date 1 (year - month)
Date 2 (year - month)
Gross payment
Service time

I would like to read that .txt and output (could be a json, csv, txt, or even in a database) every person in a 7 column, for example:

ID    Name     Work position   Date 1   Date 2    Gross payment     Service time
ID    Name     Work position   Date 1   Date 2    Gross payment     Service time
ID    Name     Work position   Date 1   Date 2    Gross payment     Service time
ID    Name     Work position   Date 1   Date 2    Gross payment     Service time

Example in the txt:

00000000886
MANUEL DE JESUS SUBERVI PEÑA
MAESTRO MEDIA GENERAL
2006-08
2021-09
30,556.04
15.7
00000000086
MANUEL DE JESUS SUBERVI PEÑA
MAESTRO MEDIA GENERAL
2006-01
2021-09
30,556.04
15.7
00100000086
MANUEL DE JESUS SUBERVI PEÑA
MAESTRO MEDIA GENERAL
2006-01
2021-09
30,556.04
15.7

import csv

#opening file

file = open (r"C:\Users\Redford\Documents\Proyecto automatizacion\data1.txt") #open file
counter = 0
total_lines = len(file.readlines()) #count lines
#print('Total lines:', x)

#reading from file

content = file.read()
colist  = content.split ()
print(colist)


#read data from data1.txt and write in data2.txt

lines = open (r"C:\Users\Redford\Documents\Proyecto automatizacion\data1.txt")
arr = []
with open('data2.txt', 'w') as f:
    for line in lines:
        #arr.append(line)
        f.write (line)

I'm new to programing and I don't know how to translate my logic to code.

CodePudding user response：

Your code does not collect multiple lines to write them into one.

Use this approach:

read your file line by line
collect each line without a \n into a list
if list reaches 7 length, write into csv and clear list
repeat until done

Create data file:

with open ("t.txt","w") as f:
    f.write("""00000000886\nMANUEL DE JESUS SUBERVI PEÑA\nMAESTRO MEDIA GENERAL\n2006-08\n2021-09\n30,556.04\n15.7
00000000086\nMANUEL DE JESUS SUBERVI PEÑA\nMAESTRO MEDIA GENERAL\n2006-01\n2021-09\n30,556.04\n15.7
00100000086\nMANUEL DE JESUS SUBERVI PEÑA\nMAESTRO MEDIA GENERAL\n2006-01\n2021-09\n30,556.04\n15.7""")

Program:

import csv

with open("t.csv","w",newline="") as wr, open("t.txt") as r:
    # create a csv writer
    writer = csv.writer(wr)

    # uncomment if you want a header over your data
    # h =  ["ID","Name","Work position","Date 1","Date 2",
    #       "Gross payment","Service time"]
    # writer.writerow(h)

    person = []
    for line in r: # could use enumerate as well, this works ok
        # collect line data minus the \n into list
        person.append(line.strip())

        # this person is finished, write, clear list
        if len(person) == 7:
            # leveraged the csv module writer, look it up if you need
            # to customize it further regarding quoting etc
            writer.writerow(person)
            person = [] # reset list for next person

    # something went wrong, your file is inconsistent, write remainder
    if person:
        writer.writerow(person)

print(open("t.csv").read())

Output:

00000000886,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-08,2021-09,"30,556.04",15.7
00000000086,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-01,2021-09,"30,556.04",15.7
00100000086,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-01,2021-09,"30,556.04",15.7

Readup: csv module - writer

The "Gross payment" needs to be quoted because it contain s a ',' wich is the delimiter for csv - the module does this automagically.

CodePudding user response：

On top of the excellent answer from @PatrickArtner, I would like to propose an itertools-based solution:

import csv
import itertools


def file_grouper_itertools(
        in_filepath="t.txt",
        out_filepath="t.csv",
        size=7):
    with open(in_filepath, 'r') as in_file,\
            open(out_filepath, 'w') as out_file:
        writer = csv.writer(out_file)
        args = [iter(in_file)] * size
        for block in itertools.zip_longest(*args, fillvalue=' '):
            # equivalent to: [x.strip() for x in block]
            block = ''.join(block).strip().split('\n')
            writer.writerow(block)

The idea there is to loop in blocks of the required size. For larger group sizes this gets faster simply because of the fewer cycles the main loop is being executed.

Running some micro-benchmarking shows that your use case should benefit from this approach compared to the manual looping (adapted into a function):

import csv


def file_grouper_manual(
        in_filepath="t.txt",
        out_filepath="t.csv",
        size=7):
    with open(in_filepath, 'r') as in_file,\
            open(out_filepath, 'w') as out_file:
        writer = csv.writer(out_file)
        block = []
        for line in in_file:
            block.append(line.strip())
            if len(block) == size:
                writer.writerow(block)
                block = []
        if block:
            writer.writerow(block)

Benchmarking:

n = 100_000
k = 7
with open ("t.txt", "w") as f:
    for i in range(n):
        f.write("\n".join(["0123456"] * k))


%timeit file_grouper_manual()
# 1 loop, best of 5: 325 ms per loop
%timeit file_grouper_itertools()
# 1 loop, best of 5: 230 ms per loop

Alternatively, you could use Pandas, which is very convenient, but requires that all the input fit into available memory (which should not be a problem in your case, but can be for larger inputs):

import numpy as np
import pandas as pd


def file_grouper_pandas(in_filepath="t.txt", out_filepath="t.csv", size=7):
    with open(in_filepath) as in_filepath:
        data = [x.strip() for x in in_filepath.readlines()]
    df = pd.DataFrame(np.array(data).reshape((-1, size)), columns=list(range(size)))
    df.to_csv(out_filepath)


%timeit file_grouper_pandas()
# 1 loop, best of 5: 666 ms per loop

CodePudding user response：

If you do a lot of work with tables and data, NumPy and Pandas are really useful libraries to get comfortable with.

import numpy as np
import pandas as pd

columns = ['ID', 'Name' , 'Work position', 'Date 1 (year - month)', 'Date 2 (year - month)',
           'Gross payment', 'Service time']

with open('oldfile.txt', 'r') as stream:
    # read file into a list of lines
    lines = stream.readlines()
    # remove newline character from each element of the list.
    lines = [line.strip('\n') for line in lines]
    # Figure out how many rows there will be in the table
    number_of_people = len(lines)/7
    # Split data into rows
    data = np.array_split(lines, number_of_people)

# Convert data to pandas dataframe
df = pd.DataFrame(data, columns = columns)

Once you have converted the data to a Pandas Dataframe, you can easily output it to any of the formats you listed. For example to output to csv you can do:

df.to_csv('newfile.csv')

Or for json it would be:

df.to_json('newfile.csv')