I have a txt document with the following structure:
1:0.84722,0.52855;0.65268,0.24792;0.66525,0.46562
2:0.84722,0.52855;0.65231,0.24513;0.66482,0.46548
3:0.84722,0.52855;0.65197,0.24387;0.66467,0.46537
The first number with the colon is the index, and I don't know how to indicate it when I open the file. Indeed I would like to erase it. Then data is separated with commas and semicolons and I would like to have each number in a different column, regardless of whether the separator is a comma or a semicolon. How could I do it?
CodePudding user response:
Use the following to load the csv using pd.read_csv:
import pandas as pd
df = pd.read_csv("data.csv", # the file path, change it to your filename
sep="[,;:]", # the separator use a regular expression
engine="python", # need this to use regular expression as sep
usecols=range(1, 7), # use columns from [1, 7)
header=None # no header
)
print(df)
Output
1 2 3 4 5 6
0 0.84722 0.52855 0.65268 0.24792 0.66525 0.46562
1 0.84722 0.52855 0.65231 0.24513 0.66482 0.46548
2 0.84722 0.52855 0.65197 0.24387 0.66467 0.46537
Note
Once you load the file I advise to save it (using to_csv) as a proper csv file.
CodePudding user response:
First load it as a regular text (no .csv parser). Remove all the unwanted text, colums ... Replace the regular commas with semicolons.
To remove the first number, you can split the line by :
and keep only the second part.
To replace the rest, use replace(",", ";")
.
Now you have a nice .csv that you can read and parse immediatelly.
CodePudding user response:
As you are using pandas.read_csv already, simply have a look at its documentation for argument sep
:
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s ' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
So in your case, simply calling pandas.read_csv(..., sep='[,;:]')
should do the trick.
CodePudding user response:
For using multiple separators, you can pass a regex
pattern like :|,|;
that basically says any one of these.
For not using the first column - you can specify the columns with usecols
The engine="python"
specification is optional, as separators longer than 1 character will force the use of the python engine.
x = '''1:0.84722,0.52855;0.65268,0.24792;0.66525,0.46562
2:0.84722,0.52855;0.65231,0.24513;0.66482,0.46548
3:0.84722,0.52855;0.65197,0.24387;0.66467,0.46537'''
xf = io.StringIO(x)
df = pd.read_csv(xf, sep=":|,|;", engine='python', header=None, usecols=list(range(1, 7)))
print(df)
# 1 2 3 4 5 6
#0 0.84722 0.52855 0.65268 0.24792 0.66525 0.46562
#1 0.84722 0.52855 0.65231 0.24513 0.66482 0.46548
#2 0.84722 0.52855 0.65197 0.24387 0.66467 0.46537