I'm trying to remove every line from a file that has any number below the value of -2000. I'm quite new to python and it's most likely that I don't understand the re
module, nor am I sure about the method I am using.
Here is the sample file:
{ "Position": { "X": -1660.313, "Y": -3107.795, "Z": 12.85458 }
{ "Position": { "X": -494.0083, "Y": 57.33647, "Z": 56.59263 }
{ "Position": { "X": -1039.662, "Y": -2641.444, "Z": 36.96656 }
And here is what I've got:
with open('file.json','r') as input:
with open("temp.json", 'w') as output:
for line in input:
match = re.search(r'('-'\d )', line)
my_number = float(match.group())
if my_number < -2000:
output.write(line.strip())
As for now I'm sure that in re.search(r'('-'\d ))
, '-'
is wrong. I'm also not sure about the proper use of match.group()
.
If anyone would guide me in the right direction or suggest a different method I would be grateful.
CodePudding user response:
Don't use a regular expression. Parse the line as JSON and test that.
with open('file.json','r') as input, open("temp.json", 'w') as output:
for line in input:
d = json.loads(line)
if all(x >= -2000 for x in d['Position'].values()):
output.write(line)
Also, don't write line.strip()
. This will remove the newlines, so the output file will have everything on one long line, instead of a separate line for each item like the input file.
This answer assumes that you made a mistake when copying the lines into your question. To be valid JSON the lines would have to be like:
{ "Position": { "X": -1660.313, "Y": -3107.795, "Z": 12.85458 } }
with a second }
at the end.
CodePudding user response:
You can use a regex like r'-[2-9]\d{3,}'
that should match -2000 or lower.
Explanation: match
-
for negative and a number between2-9
, followed by any3
or more digits.
Why use regex? It's actually faster than an approach with json.loads
(see below).
The downside, however is it seems to also match -2000 by itself.
import re
# please note: not valid JSON here
line = '{ "Position": { "X": -1660.313, "Y": -3107.795, "Z": 12.85458 }'
match = re.search(r'-[2-9]\d{3,}', line)
print(bool(match))
print(match.group(0))
Output:
True
-3107
To include lines with numbers that are exactly -2000
(which is an edge case) you can use re.findall
to find all numbers that are -2000
or below, then cast the numeric values to float
, and do a comparison similar to how you had it above:
import re
NEG_TARGET_RE = re.compile(r'-[2-9]\d{3,}(?:\.\d )?')
def has_any_num_less_than_target(line, target=-2000) -> bool:
for m in NEG_TARGET_RE.findall(line):
if float(m) < target:
return True
return False
line = '{ "Position": { "X": -1660.313, "Y": -2000.795, "Z": 12.85458 }'
print(has_any_num_less_than_target(line)) # True
line = '{ "Position": { "X": -1660.313, "Y": -2000, "Z": -2000.000 }'
print(has_any_num_less_than_target(line)) # False
Performance Comparison
Quick benchmarks show that regex approach is ~4x faster than an approach with json.loads
.
import json
from timeit import timeit
lines = """
{ "Position": { "X": -1660.313, "Y": -3107.795, "Z": 12.85458 } }
{ "Position": { "X": -494.0083, "Y": 57.33647, "Z": 56.59263 } }
{ "Position": { "X": -1039.662, "Y": -2641.444, "Z": 36.96656 } }
"""
print('json: ', timeit(r"""
for line in lines.strip().split('\n'):
d = json.loads(line)
if all(x >= -2000 for x in d['Position'].values()):
...
""", globals=globals(), number=1000))
print('re: ', timeit(r"""
for line in lines.strip().split('\n'):
if not has_any_num_less_than_target(line):
...
""", globals=globals(), number=1000))
Result:
json: 0.004367416957393289
re: 0.0011514590587466955