Proper way of cleaning csv file-CodePudding

I've got a huge CSV file, which looks like this:

1. 02.01.18;"""2,871""";"""2,915""";"""2,871""";"""2,878""";""" 1,66 %""";"""57.554""";"""166.075 EUR""";"""0,044"""
2. 03.01.18;"""2,875""";"""2,965""";"""2,875""";"""2,925""";""" 1,63 %""";"""39.116""";"""114.441 EUR""";"""0,090"""
3. 04.01.18;"""2,915""";"""3,005""";"""2,915""";"""2,988""";""" 2,15 %""";"""58.570""";"""174.168 EUR""";"""0,090"""

In the end I only want to extract the date and ratio. The dataset should look like this:

1.02.01.18,  1,66 %
2.03.01.18,  1,63 %
3.04.01.18,  2,15 %

I tried this and until now I'm just getting more trouble:

import pandas as pd
df = pd.read_csv("Dataset.csv", nrows=0)
print(df)
data = []
for response in df:
    data.append(
       response.split(';')
    )
print(data[0])

Do you know some better way to clean up this dataset?

CodePudding user response：

Using pandas:

import pandas as pd

df = pd.read_csv('data.csv', sep=';', usecols=[0,5], names=['date', 'rate'])
df.rate = df.rate.str.strip('"')
print(df)

Result:

          date     rate
0  1. 02.01.18   1,66 %
1  2. 03.01.18   1,63 %
2  3. 04.01.18   2,15 %

As mentioned in the comments, you probably don't need the extra index in the date columns. Also the index and excessive quoting suggests the file was not created properly in the first place and the process should be fixed.

Note, right now both columns are type str, which is probably not what you want...

CodePudding user response：

You can use a regular expression for this:

regex = re.compile(r'([\d\. ] ).*([ -][\d, %] )')
date, ratio = regex.match(s).groups()
date = date.replace(' ', '')

Test:

>>> date
'2.03.01.18'

>>> ratio
' 1,63 %'