I'm dealing with a csv that repeats its headers name within each rows:
player: John Doe ; level: 45 ; last_login: 7854414174 ; coins: 7600
player: Anckx Uj ; level: 471 ; last_login: 7854418847 ; coins: 684111
I'd like to know how I can only select the values when importing it using pandas so that the output looks like this:
Player level last_login coins
John Doe 45 7854414174 7600
Anckx Uj 471 7854418847 684111
I tried adding the header parameter as I thought it would filter out the repeating in the rows, without success.
import pandas as pd df = pd.read_csv('base.txt', sep=';', header=None, names=['player', 'level', 'last_login', 'coins']
returns me exactly the same thing as the csv (without the delimiter)
*Any help would be appreciated
CodePudding user response:
One solution might be to can clean the rows after the loading:
df = df.apply(lambda x: x.str.replace(r"^[^:] :", "").str.strip())
print(df)
Prints:
player level last_login coins
0 John Doe 45 7854414174 7600
1 Anckx Uj 471 7854418847 684111
And probably convert the level
/coins
columns to int
:
df[["level", "coins"]] = df[["level", "coins"]].astype(int)
CodePudding user response:
A proposition using pandas.DataFrame.pivot
:
df= pd.read_csv("base.txt", header=None, names=["col"])
out = (
df["col"].str.extractall("(\w : \w \s?\w )")
.reset_index(drop=True)[0]
.str.split(":", expand=True)
.assign(idx= lambda x: x.groupby(0).cumcount())
.pivot(index="idx", columns=0)
.reset_index(drop=True)
)
out.columns = out.columns.get_level_values(1)
# Output :
print(out)
0 coins last_login level player
0 7600 7854414174 45 John Doe
1 684111 7854418847 471 Anckx Uj
CodePudding user response:
This seems like a row iteration problem, and I think the csv module makes this easy to understand and execute.
- Read the input file with the plain reader, which will give us a list of strings for each row.
- For each row:
- create the empty dict new_row
- iterate the columns and split on a colon (':') to get the header name and its value
- build up new_row with the name-value pairs
- append new_row to the list all_rows
- Use the DictWriter to convert all_rows into the final CSV
Here's the reading part:
import csv
all_rows = []
with open("input.csv", newline="") as f:
reader = csv.reader(f, delimiter=";")
for row in reader:
new_row = {}
# row looks like, ['player: John Doe ', ' level: 45 ', ' last_login: 7854414174 ', ' coins: 7600']
for col in row:
name, val = col.split(":", 1)
new_row[name.strip()] = val.strip()
all_rows.append(new_row)
print(all_rows)
That gives us:
[
{'player': 'John Doe', 'level': '45', 'last_login': '7854414174', 'coins': '7600'},
{'player': 'Anckx Uj', 'level': '471', 'last_login': '7854418847', 'coins': '684111'},
]
From that, we can use the DictWriter, giving it the first row as a sample of the fieldnames it should expect to find and write:
with open("output.csv", "w", newline="") as f:
writer = csv.DictWriter(f, delimiter=";", fieldnames=all_rows[0])
writer.writeheader()
writer.writerows(all_rows)
Here's output.csv:
player;level;last_login;coins
John Doe;45;7854414174;7600
Anckx Uj;471;7854418847;684111