I am currently try to have the data like this
noms sommets
0000 Abbesses
0001 Alexandre Dumas
...
coord sommets
0000 308 536
0001 472 386
0002 193 404
The data should be like this when converted to csv, the data should has three columns
nom | sommets | coord sommets |
---|---|---|
0000 | Abbesses | 308 536 |
However, everything in the data is a straight line and hard to deal with. What is the solution for this. I try to convert it from txt to csv.
CodePudding user response:
from pathlib import Path
import pandas as pd
f = Path("data.txt")
lines = [[], []]
file_num = 0
for line in f.read_text().split("\n"):
if not line:
continue
if line.startswith("coord"):
file_num = 1
lines[file_num].append(line.split(maxsplit=1))
def get_df(data):
df1 = pd.DataFrame(data)
df1.columns = df1.iloc[0]
df1 = df1.drop(index=0)
df1.columns.name = None
return df1
df1 = get_df(lines[0])
df2 = get_df(lines[1])
df2.columns = [df1.columns[0], " ".join(df2.columns)]
res = pd.merge(df1, df2, how="outer", on="noms")
# noms sommets coord sommets
# 0 0000 Abbesses 308 536
# 1 0001 Alexandre Dumas 472 386
# 2 0002 NaN 193 404
res.to_csv("data.csv")
Edit: to resolve the encoding issue pass the encoding you want to read_text()
.
for line in f.read_text(encoding="latin-1").split("\n"):
...
CodePudding user response:
without imports you can do this.
There's some safety checks due to the noise in the data.
Also, I'm using a dict as they are extremely fast when trying to find key/value pairs.
with open("metro", encoding="latin-1") as infile:
data = infile.read().splitlines()
nom_start = "noms sommets"
coord_start = "coord sommets"
end = "arcs values"
mode = None
# use a dict as lookups on dicts are stupidly fast.
result = {}
for line in data:
# this one is needed due to the first line
if mode == None:
if line == nom_start:
mode = nom_start
continue
line = line.strip()
# safety check
if line != "":
if line == end:
# skip the end data
break
key, value = line.split(maxsplit=1)
if mode == nom_start:
if line != coord_start:
result[key] = {"sommets": value}
else:
mode = coord_start
else:
result[key]["coord sommets"] = value
# CSV separator
SEP = ";"
with open("output.csv", "w", encoding="latin-1") as outfile:
# CSV header
outfile.write(f"noms{SEP}sommets{SEP}coord sommets\n")
for key, val in result.items():
outfile.write(f'{key}{SEP}{val["sommets"]}{SEP}{val["coord sommets"]}\n')
CodePudding user response:
Quite an interesting problem. I'm assuming the file contains more columns, or sets of key/variables, than just in the example. So you wouldn't want to hard-code the column names.
I would create an new empty dataframe, then read the input file line-by-line, check if it is the next new column name (not starting with digits), build a dictionary with those new values, and then keep merging that dictionary as a new columns into the new dataframe.
So I would do something like this:
import pandas as pd
# create an Empty DataFrame object
df_new = pd.DataFrame({"recordkey": []})
# read all input lines
inputfilename = "inputfile.txt"
file1 = open(inputfilename, 'r')
Lines = file1.readlines()
tmpdict = {}
colname = ""
# iterate through all lines
for idx in range(len(Lines)):
line = Lines[idx]
# this is assuming all keys are exactly 4 digits
iscolname = not (line[:4].isdigit())
if not iscolname:
# split on the first space for key and value
tmp = line.split(" ", 1)
getkey = tmp[0].strip()
getvalue = tmp[1].strip()
# add to dictionary
tmpdict[getkey] = getvalue
# new column or last line
if iscolname or idx == len(Lines)-1:
# new column (except skip for first line of file)
if colname != "":
# create new column from dictionary
df_tmp = pd.DataFrame(tmpdict.items(), columns=["recordkey", colname])
df_new = df_new.merge(df_tmp, how='outer', on='recordkey')
# keep new column name
colname = line.strip()
tmpdict = {}
# display dataframe
print(df_new)
# write dataframe to csv
fileoutput = "outputfile.csv"
df_new.to_csv(fileoutput, sep=",", index=False)