Pandas split column with inconsistent data and inconsistent columns-CodePudding

I need some help with pandas, I'm trying to clean up csv files. I have three types of CSV

correct and expected csv

0	1	2	3	4
100	200	300	400	500

type one clumped

0	1	2	3	4
100	200	300 400	NaN	500

type two clumped

0	1	2	3
100	200	300 400 500	NaN

I'm trying to correct the csv 2 and 3 so that it will become like csv 1

Code

import glob
import pandas as pd


dir = r'D:\csv_files'

file_list = glob.glob(dir  '/*.csv')
files = []
for filename in file_list:
    df = pd.read_csv(filename, header=None)
    split = df.pop(2).str.split(' ', expand=True)
    df.join(split, how='right', lsuffix = '_left', rsuffix = '_right')
    print(df)

output:

  0    1   2   3   4
0 100 200 300 400 500

  0    1   3   4
0 100 200 NaN 500

  0    3
0 100 NaN

Goal:

   0   1   2   3   4
0 100 200 300 400 500

   0   1   2   3   4
0 100 200 300 400 500

   0   1   2   3   4
0 100 200 300 400 500

I printed out the split and it's correct, however, I'm unable to find how can I put it back into the main data frame.

Thanks in advance

CodePudding user response：

You might find it easier to pre-parse the data using a standard Python csv.reader(). This could be used to split up any 'clumped' values and then flatten them back into a single list.

For example:

import pandas as pd    
from itertools import chain
import glob
import csv

data = []

for fn in glob.glob('rate*.csv'):
    with open(fn) as f_input:
        csv_input = csv.reader(f_input)
        
        for row in csv_input:
            values = chain.from_iterable(value.split(' ') for value in row[2:] if value)
            data.append([row[0], row[1], *values])

df = pd.DataFrame(data, columns=range(6))
print(df)

This would give you a dataframe starting:

                         0                                  1     2     3     4     5
0               Montserrat                            Manzini     6     6     5     6
1               Madagascar                           San Juan    10     4     9     8
2                 Botswana                             Tehran     2    10     9    10
3     Syrian Arab Republic                          Fairbanks     2     4     9     2
4                   Guinea                       Punta Arenas     5     1     6     3