Home > other >  How to delete all lines that contain a number lower than given value from a text file?
How to delete all lines that contain a number lower than given value from a text file?

Time:07-14

I'm trying to remove every line from a file that has any number below the value of -2000. I'm quite new to python and it's most likely that I don't understand the re module, nor am I sure about the method I am using.

Here is the sample file:

{ "Position": { "X": -1660.313, "Y": -3107.795, "Z": 12.85458 }
{ "Position": { "X": -494.0083, "Y": 57.33647, "Z": 56.59263 }
{ "Position": { "X": -1039.662, "Y": -2641.444, "Z": 36.96656 }

And here is what I've got:

 with open('file.json','r') as input:
    with open("temp.json", 'w') as output:  
        for line in input:
            match = re.search(r'('-'\d )', line)
            my_number = float(match.group())
            if my_number < -2000:
                output.write(line.strip())

As for now I'm sure that in re.search(r'('-'\d )), '-'is wrong. I'm also not sure about the proper use of match.group().

If anyone would guide me in the right direction or suggest a different method I would be grateful.

CodePudding user response:

Don't use a regular expression. Parse the line as JSON and test that.

with open('file.json','r') as input, open("temp.json", 'w') as output:  
    for line in input:
        d = json.loads(line)
        if all(x >= -2000 for x in d['Position'].values()):
            output.write(line)

Also, don't write line.strip(). This will remove the newlines, so the output file will have everything on one long line, instead of a separate line for each item like the input file.

This answer assumes that you made a mistake when copying the lines into your question. To be valid JSON the lines would have to be like:

{ "Position": { "X": -1660.313, "Y": -3107.795, "Z": 12.85458 } }

with a second } at the end.

CodePudding user response:

You can use a regex like r'-[2-9]\d{3,}' that should match -2000 or lower.

Explanation: match - for negative and a number between 2-9, followed by any 3 or more digits.

Why use regex? It's actually faster than an approach with json.loads (see below).

The downside, however is it seems to also match -2000 by itself.

import re

# please note: not valid JSON here
line = '{ "Position": { "X": -1660.313, "Y": -3107.795, "Z": 12.85458 }'

match = re.search(r'-[2-9]\d{3,}', line)

print(bool(match))
print(match.group(0))

Output:

True
-3107

To include lines with numbers that are exactly -2000 (which is an edge case) you can use re.findall to find all numbers that are -2000 or below, then cast the numeric values to float, and do a comparison similar to how you had it above:

import re


NEG_TARGET_RE = re.compile(r'-[2-9]\d{3,}(?:\.\d )?')


def has_any_num_less_than_target(line, target=-2000) -> bool:
    for m in NEG_TARGET_RE.findall(line):
        if float(m) < target:
            return True

    return False


line = '{ "Position": { "X": -1660.313, "Y": -2000.795, "Z": 12.85458 }'
print(has_any_num_less_than_target(line))  # True

line = '{ "Position": { "X": -1660.313, "Y": -2000, "Z": -2000.000 }'
print(has_any_num_less_than_target(line))  # False

Performance Comparison

Quick benchmarks show that regex approach is ~4x faster than an approach with json.loads.

import json
from timeit import timeit

lines = """
{ "Position": { "X": -1660.313, "Y": -3107.795, "Z": 12.85458 } }
{ "Position": { "X": -494.0083, "Y": 57.33647, "Z": 56.59263 } }
{ "Position": { "X": -1039.662, "Y": -2641.444, "Z": 36.96656 } }
"""

print('json:  ', timeit(r"""
for line in lines.strip().split('\n'):
    d = json.loads(line)
    if all(x >= -2000 for x in d['Position'].values()):
        ...
""", globals=globals(), number=1000))

print('re:    ', timeit(r"""
for line in lines.strip().split('\n'):
    if not has_any_num_less_than_target(line):
        ...
""", globals=globals(), number=1000))

Result:

json:   0.004367416957393289
re:     0.0011514590587466955
  • Related