Regular Expression and python to split lines but keep first identifier for all new lines-CodePudding

I have input csv that looks like this (two columns, separated by "|"):

group_1|{""id"":1,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]},{""id"":2,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}
group_2|{""id"":5,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}

And want to get output like this:

group_1|{""id"":1,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}
group_1|{""id"":2,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}
group_2|{""id"":5,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}

My usual solution and why it wouldn't work now:

Before I would have used notepad for regex, searched for ({""id"") and replaced with \|\r\n\1. After this I would have imported the file to Excel to fill down the "group_x" for each empty cell on that column (like this). But my problem is that I have a huge file (several gigabytes large) and this method would just freeze my pc. Im sure Excel cannot even handle that many rows (several million rows). So I was hopping that someone could point me in the right direction. Perhaps using python script with regular expression? This would be especially useful because I have a basic undestanding of these tools and will need to do several regex transformations after this first big one, and could then incorporate it into the same script. But I will be GRATEFUL for any kind of help. Thank you in advance.

CodePudding user response：

Try this

import csv
import re

# Output CSV file
outfile = open('out.csv', 'w')
writer = csv.writer(outfile)

# Open the input CSV and process it
with open("in.csv", "r") as f:
    # Read line by line, for large file processing
    line = f.readline()
    while line:
        # Process the line
        _id, data = line.strip().split('|')  # Split the line into id and csv rows
        
        # Split and make the data into a list of columns - using regex 
        data = [r.strip(",") for r in re.sub('{""id""', '\n{""id""', data).split("\n") if r]
        # Create a new row, with same id for every column groups
        new_row = [[_id, d] for d in data]
        # Write the new row to the output CSV
        writer.writerows(new_row)
        # Read next line
        line = f.readline()

I hope the comments are enough to understand the code.

CodePudding user response：

Well .. reading a row by row is sound like a non-developer way to solve this problem. However, I think it is intuitive to treat this.

So, far first trying to read a file each row. And you can split column name ("group_x") and records with "|". Next, you can separate your records by comma, then you can get the pure data record what you want to.

But this way may occur some error like "list index out of range" , so please add try-catch(IndexError) to handling.

Finally, you can get your data what you want as below code. I think this may not the exact result what you really want. But, I want to help you get some idea what I've seen. Thanks.

for i, line in enumerate(open("file.txt")):
    try:
        for record in line.split("|")[1].split("]},"):
            print(line.split("|"), "|", record)
    except IndexError:
        continue

CodePudding user response：

Propose to use split for string if you know exact structure of string.

import csv

# variable to store whole result
full_table = []

# read your csv with delimeter="|"
with open('your_datd.csv', 'r', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter='|')
        
    # split every row by ending before "id" and get first element
    for row in reader:
        col_1 = row[0]
        col_2 = row[1].split(',{""id""')[0]
        
        # build row
        row = [col_1, col_2]
        
        # add this row to whole table. Additionally you can make here some modifications with extracted columns.
        full_table.append(row)
        
# create new file and write full_table into it.
with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f, delimiter='|')
    writer.writerows(full_table)

If your dataset is big enough, take a look on pandas.

PS output.csv would be

group_1|"{""""id"""":1,""""name"""":""""name_here"""":[{""""field_1"""":""""xx"""",""""value"""":""""""""},{""""field_2"""":...}]}"
group_2|"{""""id"""":5,""""name"""":""""name_here"""":[{""""field_1"""":""""xx"""",""""value"""":""""""""},{""""field_2"""":...}]}"