I am new to Python and would like to write a script that that takes a .txt file as input and outputs the results to a .csv file.
The .txt files look as follows
text:eub1
region:euboea
μενανδρεσεμεεποισε
I would like to write a script that creates a new row for each instance of μ or ν in the third line above. I also want each row to contain the text
and region
identifier. So the result should look like this:
text,region,letter
eub1,euboea,μ
eub1,euboea,ν
eub1,euboea,μ
I don't really know where to start with the coding, so I'd be grateful for any advice on how to do this.
CodePudding user response:
Try:
import pandas as pd
data = {}
with open("your_file.txt", "r") as f_in:
for line in map(str.strip, f_in):
if line == "":
continue
if line.startswith("text:"):
data["text"] = line.split(":", maxsplit=1)[-1]
elif line.startswith("region:"):
data["region"] = line.split(":", maxsplit=1)[-1]
else:
data["letter"] = [ch for ch in line if ch in "μν"]
df = pd.DataFrame(data)
print(df)
df.to_csv("data.csv", index=False)
Prints:
text region letter
0 eub1 euboea μ
1 eub1 euboea ν
2 eub1 euboea ν
3 eub1 euboea μ
and saves data.csv
:
text,region,letter
eub1,euboea,μ
eub1,euboea,ν
eub1,euboea,ν
eub1,euboea,μ
Content of your_file.txt
:
text:eub1
region:euboea
μενανδρεσεμεεποισε
EDIT: To load from this file:
text:eub1
region:euboea
μενανδρεσεμεεποισε
text:eub2
region:xxx
μμμ
text:eub3
region:zzz
abc
you can try:
import pandas as pd
data = {}
with open("your_file.txt", "r") as f_in:
for line in map(str.strip, f_in):
if line == "":
continue
if line.startswith("text:"):
data.setdefault("text", []).append(line.split(":", maxsplit=1)[-1])
elif line.startswith("region:"):
data.setdefault("region", []).append(
line.split(":", maxsplit=1)[-1]
)
else:
data.setdefault("letter", []).append(
[ch for ch in line if ch in "μν"]
)
df = pd.DataFrame(data).explode("letter")
print(df)
df.to_csv("data.csv", index=False)
Prints:
text region letter
0 eub1 euboea μ
0 eub1 euboea ν
0 eub1 euboea ν
0 eub1 euboea μ
1 eub2 xxx μ
1 eub2 xxx μ
1 eub2 xxx μ
2 eub3 zzz NaN
and saves data.csv
:
text,region,letter
eub1,euboea,μ
eub1,euboea,ν
eub1,euboea,ν
eub1,euboea,μ
eub2,xxx,μ
eub2,xxx,μ
eub2,xxx,μ
eub3,zzz,