Basic Python text extraction scenario-CodePudding

I am currently working with a text file that looks like this.

NUMBER = 6367283940 |  FOOD = PASTA | NAME = JOHN WALKER
NUMBER = 6367283940 |  FOOD = PASTA | NAME = JOHN WALKER
NUMBER = 6367283940 |  FOOD = PASTA | NAME = JOHN WALKER

I would like to extract the number (just the integers) and save them all to a text file that would read:

6367283940
6367283940
6367283940

How would I go about doing this?

I am brand new.

CodePudding user response：

There's perhaps a few ways you might approach this.

Regex

A simple regex pattern should work.

import re
text = """\
NUMBER = 6367283940 |  FOOD = PASTA | NAME = JOHN WALKER
NUMBER = 6367283940 |  FOOD = PASTA | NAME = JOHN WALKER
NUMBER = 6367283940 |  FOOD = PASTA | NAME = JOHN WALKER
"""
pattern = '^NUMBER = (\d )'

for number in re.findall(pattern, text):
    print(number)

6367283940
6367283940
6367283940

For an explanation of the regex, see this regex101 link.

String splitting

A more rudimentary way may be to use regular string operations, like .split

with open('mytext.txt') as f:
    for line in f:
        fields = line.split('|')
        number_field = fields[0]
        _, number = number_field.split(' = ')
        print(number)

Csv/pandas

Because your file is pipe-delimited, you could also use the csv module or pandas as Nuno Carvalho answered.

CodePudding user response：

This script should work if you name your text file input.txt. You can also change that in the code. I added some comments to make some steps clear for someone that isn't that experienced. I hope I could help you.

INPUT_FILE = "./input.txt"
OUTPUT_FILE = "./output.txt"


def main():
    result_numbers = []
    with open(INPUT_FILE) as file:                      # open the text file in read-only mode
        lines = file.readlines()                        # fetching all lines
        for i in lines:                                 # iterate through the lines
            first_row = i.split("|")[0].strip()         # we only need the first row and we don't need the extra spaces
            number = first_row.split("=")[1].strip()    # we need the part behind the = and we don't need the space before it
            result_numbers.append(number)               # add number to the result list
    with open(OUTPUT_FILE, "w") as file:                # open a new text file in write mode to save the results to it
        file.write("\n".join(result_numbers))           # join the results with a line break and write them to that file


if __name__ == '__main__':
    main()

If you have any questions, feel free to ask.

CodePudding user response：

Firstly, you could open the text file by using the readlines method to extract the data in it as a list. Then loop through each element, split each element by a space and add the 3rd element which is the number in all cases, to the variable number, add \n or a new line each iteration as well. Finally, write the data into a text file.

with open("data.txt") as file:
    data = file.readlines()

numbers = ""
for char in data:
    numbers  = char.split(" ")[2]
    numbers  = "\n"

with open("numbers.txt", mode="w") as file:
    file.write(numbers)

CodePudding user response：

#input.txt is the input file and output.txt is the output file.

with open('input.txt') as file:
lines = file.readlines()
lines = [line.rstrip() for line in lines]
filename='output.txt'
file_out=open(filename,'a')
import re
for x in lines:
    start = 'NUMBER = '
    end = 'FOOD'
    s = x
    result = re.search('%s(.*)%s' % (start, end), s).group(1)[:10 - 1]
    file_out.write(result '\n')

CodePudding user response：

I suggest using pandas.

1 - Install the module.

pip install pandas

2 - Save that text in a file named "text.csv".

3 - Run this script

import pandas as pd

data = pd.read_csv("text.csv", header=None, sep="|")

print(data[0])

# Removing 'NUMBER = '
numbers = data[0].apply(lambda x: x.replace("NUMBER = ", ""))


# The output will be here
numbers.to_csv("your-numbers.csv", header=None, index=None)

Result:

your-numbers.csv

6367283940 
6367283940 
6367283940