Python open a csv document that have different types of separators-CodePudding

I have a txt document with the following structure:

1:0.84722,0.52855;0.65268,0.24792;0.66525,0.46562
2:0.84722,0.52855;0.65231,0.24513;0.66482,0.46548
3:0.84722,0.52855;0.65197,0.24387;0.66467,0.46537

The first number with the colon is the index, and I don't know how to indicate it when I open the file. Indeed I would like to erase it. Then data is separated with commas and semicolons and I would like to have each number in a different column, regardless of whether the separator is a comma or a semicolon. How could I do it?

CodePudding user response：

Use the following to load the csv using pd.read_csv:

import pandas as pd

df = pd.read_csv("data.csv",  # the file path, change it to your filename 
                 sep="[,;:]",  # the separator use a regular expression
                 engine="python",  # need this to use regular expression as sep
                 usecols=range(1, 7),  # use columns from [1, 7)
                 header=None  # no header
                 )
print(df)

Output

         1        2        3        4        5        6
0  0.84722  0.52855  0.65268  0.24792  0.66525  0.46562
1  0.84722  0.52855  0.65231  0.24513  0.66482  0.46548
2  0.84722  0.52855  0.65197  0.24387  0.66467  0.46537

Note
Once you load the file I advise to save it (using to_csv) as a proper csv file.

CodePudding user response：

First load it as a regular text (no .csv parser). Remove all the unwanted text, colums ... Replace the regular commas with semicolons.

To remove the first number, you can split the line by : and keep only the second part.

To replace the rest, use replace(",", ";").

Now you have a nice .csv that you can read and parse immediatelly.

CodePudding user response：

As you are using pandas.read_csv already, simply have a look at its documentation for argument sep:

Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s ' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.

So in your case, simply calling pandas.read_csv(..., sep='[,;:]') should do the trick.

CodePudding user response：

For using multiple separators, you can pass a regex pattern like :|,|; that basically says any one of these.

For not using the first column - you can specify the columns with usecols

The engine="python" specification is optional, as separators longer than 1 character will force the use of the python engine.

x = '''1:0.84722,0.52855;0.65268,0.24792;0.66525,0.46562
2:0.84722,0.52855;0.65231,0.24513;0.66482,0.46548
3:0.84722,0.52855;0.65197,0.24387;0.66467,0.46537'''
xf = io.StringIO(x)
df = pd.read_csv(xf, sep=":|,|;", engine='python', header=None, usecols=list(range(1, 7)))
print(df)
#         1        2        3        4        5        6
#0  0.84722  0.52855  0.65268  0.24792  0.66525  0.46562
#1  0.84722  0.52855  0.65231  0.24513  0.66482  0.46548
#2  0.84722  0.52855  0.65197  0.24387  0.66467  0.46537