I need some help with pandas, I'm trying to clean up csv files. I have three types of CSV
- correct and expected csv
0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|
100 | 200 | 300 | 400 | 500 |
- type one clumped
0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|
100 | 200 | 300 400 | NaN | 500 |
- type two clumped
0 | 1 | 2 | 3 |
---|---|---|---|
100 | 200 | 300 400 500 | NaN |
I'm trying to correct the csv 2 and 3 so that it will become like csv 1
Code
import glob
import pandas as pd
dir = r'D:\csv_files'
file_list = glob.glob(dir '/*.csv')
files = []
for filename in file_list:
df = pd.read_csv(filename, header=None)
split = df.pop(2).str.split(' ', expand=True)
df.join(split, how='right', lsuffix = '_left', rsuffix = '_right')
print(df)
output:
0 1 2 3 4
0 100 200 300 400 500
0 1 3 4
0 100 200 NaN 500
0 3
0 100 NaN
Goal:
0 1 2 3 4
0 100 200 300 400 500
0 1 2 3 4
0 100 200 300 400 500
0 1 2 3 4
0 100 200 300 400 500
I printed out the split
and it's correct, however, I'm unable to find how can I put it back into the main data frame.
Thanks in advance
CodePudding user response:
You might find it easier to pre-parse the data using a standard Python csv.reader()
. This could be used to split up any 'clumped' values and then flatten them back into a single list.
For example:
import pandas as pd
from itertools import chain
import glob
import csv
data = []
for fn in glob.glob('rate*.csv'):
with open(fn) as f_input:
csv_input = csv.reader(f_input)
for row in csv_input:
values = chain.from_iterable(value.split(' ') for value in row[2:] if value)
data.append([row[0], row[1], *values])
df = pd.DataFrame(data, columns=range(6))
print(df)
This would give you a dataframe starting:
0 1 2 3 4 5
0 Montserrat Manzini 6 6 5 6
1 Madagascar San Juan 10 4 9 8
2 Botswana Tehran 2 10 9 10
3 Syrian Arab Republic Fairbanks 2 4 9 2
4 Guinea Punta Arenas 5 1 6 3