I have the table in file and it looks like that:
------------- ----------------- --------------- --------------- --------------
|number |name |very |column4 |very long |
| | |long column3 | |column 5 |
------------- ----------------- --------------- --------------- --------------
|1.1 |Some | Х | Х |2373833636.85 |
| |position 1.1 | | | |
------------- ----------------- --------------- --------------- --------------
|1.2 |Some | 0,5 | 0,70 |237383.90 |
| |position 1.2 | | | |
------------- ----------------- --------------- --------------- --------------
|1.3 |Some | Х | 0,5 |2864583.90 |
| |position 1.3 | | | |
------------- ----------------- --------------- --------------- --------------
First row is the header of table. First two columns are of string type, other three columns are of number (float) type.
I need to convert this table to pandas DataFrame:
number | name | very long column3 | column4 | very long column 5 |
---|---|---|---|---|
1.1 | Some position 1.1 | 2373833636.85 | ||
1.2 | Some position 1.2 | 0.5 | 0.7 | 237383.9 |
1.3 | Some position 1.3 | 0.5 | 2864583.9 |
CodePudding user response:
The example you show us is a table in Markdown format. Pandas itself doesn't offer a way to parse (understand) that. I also don't know another package that is able to breakdown a Markdown table into its logical components.
The answer to your question is that it is not possible to parse that with pandas.
You could write your own Python code to parse that but I wouldn't advice you to do that.
My advice is to go back to your data provider and tell them that they should provide you the data in a machine readable format e.g. CSV.
CodePudding user response:
You could try this:
1. step: Converting the file into a csv-file (adjust the file names accordingly):
import csv
from itertools import groupby
with open("file.txt", "r") as fin, open("file.csv", "w") as fout:
writer = csv.writer(fout)
for skip, lines in groupby(fin, lambda l: l.startswith(" ")):
if skip:
continue
lines = (
(part.strip() for part in line.strip().strip("|").split("|"))
for line in lines
)
line = (" ".join(filter(None, parts)) for parts in zip(*lines))
writer.writerow(line)
- I'm using
itertools.groupby()
from the standard library to grab the lines that don't start with - Then the lines get
.strip()
ed of (1) the whitespace at the ends, (2) the|
at the ends,.split()
on|
, and the individual components.strip()
ed again. - Then the columns get build via
zip()
: Only the truthy parts get" ".join()
ed. - The so build lines are written into a csv-file that looks like:
number,name,very long column3,column4,very long column 5
1.1,Some position 1.1,Х,Х,2373833636.85
1.2,Some position 1.2,"0,5","0,70",237383.90
1.3,Some position 1.3,Х,"0,5",2864583.90
2. step: Reading into a dataframe and some post processing:
import pandas as pd
df = pd.read_csv("file.csv", na_values=[chr(1061)])
df.iloc[:, 0] = df.iloc[:, 0].astype(str)
for no in (2, 3):
df.iloc[:, no] = df.iloc[:, no].str.replace(",", ".").astype(float)
- Use
pd.read_csv()
to read the csv-file into a dataframe, with the character with unicode codepoint 1061 added to theNaN
-values. - Convert the first column to string.
- Replace the
,
in columns 3 and 4 with a proper decimal point and convert them to float.
Result:
number name very long column3 column4 very long column 5
0 1.1 Some position 1.1 NaN NaN 2.373834e 09
1 1.2 Some position 1.2 0.5 0.7 2.373839e 05
2 1.3 Some position 1.3 NaN 0.5 2.864584e 06