I want to extract text strings from a file using regex and add them to a list to create a new file with the extracted text, but I'm not able to separate the text I want to capture from the surrounding regex stuff that gets included
Example text:
#女
&10"「信号が乱れているみたい。聞こえる? アナタ?」"100("",3,2,0,0)
100("se003",0,2,0,0)
#男
&11"「──ポン太、もっと近づ────すぐ直す」"100("",4,2,0,0)
#女
&12"「……了解」"&13"またガニメデステーションに送られた通信信号と混線してしまったのだろう。別段慌てるような事ではなかった。"&14"作業船の方を確認した後、女はやるべき事を進めようとカプセルに視線を戻す。"52("_BGMName","bgm06")
42("BGM","00Sound.dat")
52("_GRPName","y12r1")42("DrawBG","00Draw.dat")#女
&15"「!?」"&16"睡眠保存カプセルは確かに止まっていたのに、その『中身』は止まっていなかった。"&17"スーツの外は真空状態で何も聞こえない。だが、その『中身』が元気よく泣いている事は見ればわかる。"100("",3,2,0,0)
100("se003",0,2,0,0)
#男
&18"「お──信号がまた──どうした!」"#女
&19"「信じられない。赤ちゃんよ。しかもこの子は……生きている。生きようとしてる!!」"100("",4,2,0,0)
I want to extract what is between &00"text to capture" and only keep what's between the quotation marks. I've tried various ways of writing the regex using non capturing groups, lookahead/behind but python will always capture everything. What I've currently got in the code below would work if it only occurred once per line, but sometimes there are multiple per line so I can't just add group 1 to the list like in #2 below.
In the code below #1 will append the corresponding string found on the line including the stuff I want to remove:
&10"「信号が乱れているみたい。聞こえる? アナタ?」"100("",3,2,0,0)
#2 will output what I actually want:
「信号が乱れているみたい。聞こえる? アナタ?」
but it only works if it occurs once per line so &13, &14 and &16, &17 disappear.
How can I add only the part I want to extract especially when it occurs multiple times per line?
# Code:
def extract(filename):
words = []
with open(filename, 'r', encoding="utf8") as f:
for line in f:
if (re.search(r'(?<=&\d")(. ?"*)(?=")|(?<=&\d\d")(. ?"*)(?=")|(?<=&\d\d\d")(. ?"*)(?=")|(?<=&\d\d\d\d")(. ?"*)(?=")|(?<=&\d\d\d\d")(. ?"*)(?=")', line)):
#1 words.append(line)
#2 words.append(re.split(r'(?<=&)\d "(. ?)(?=")', line)[1])
for line in words:
print(line "\n")
CodePudding user response:
You can shorten the pattern and match &
followed by 1 digits and capture what is between double quotes in group 1.
Read the whole file at once and use re.findall to the capture group values.
&\d "([^"]*)"
The pattern matches:
&\d
Match&
and 1 digits"
Match opening double quote([^"]*)
Capture group 1, match any char except"
(including newlines)"
Match closing double quote