Home > Blockchain >  Pandas split column with inconsistent data and inconsistent columns
Pandas split column with inconsistent data and inconsistent columns

Time:03-17

I need some help with pandas, I'm trying to clean up csv files. I have three types of CSV

  1. correct and expected csv
0 1 2 3 4
100 200 300 400 500
  1. type one clumped
0 1 2 3 4
100 200 300 400 NaN 500
  1. type two clumped
0 1 2 3
100 200 300 400 500 NaN

I'm trying to correct the csv 2 and 3 so that it will become like csv 1

Code

import glob
import pandas as pd


dir = r'D:\csv_files'

file_list = glob.glob(dir  '/*.csv')
files = []
for filename in file_list:
    df = pd.read_csv(filename, header=None)
    split = df.pop(2).str.split(' ', expand=True)
    df.join(split, how='right', lsuffix = '_left', rsuffix = '_right')
    print(df)

output:

  0    1   2   3   4
0 100 200 300 400 500

  0    1   3   4
0 100 200 NaN 500

  0    3
0 100 NaN

Goal:

   0   1   2   3   4
0 100 200 300 400 500

   0   1   2   3   4
0 100 200 300 400 500

   0   1   2   3   4
0 100 200 300 400 500

I printed out the split and it's correct, however, I'm unable to find how can I put it back into the main data frame.

Thanks in advance

CodePudding user response:

You might find it easier to pre-parse the data using a standard Python csv.reader(). This could be used to split up any 'clumped' values and then flatten them back into a single list.

For example:

import pandas as pd    
from itertools import chain
import glob
import csv

data = []

for fn in glob.glob('rate*.csv'):
    with open(fn) as f_input:
        csv_input = csv.reader(f_input)
        
        for row in csv_input:
            values = chain.from_iterable(value.split(' ') for value in row[2:] if value)
            data.append([row[0], row[1], *values])

df = pd.DataFrame(data, columns=range(6))
print(df)

This would give you a dataframe starting:

                         0                                  1     2     3     4     5
0               Montserrat                            Manzini     6     6     5     6
1               Madagascar                           San Juan    10     4     9     8
2                 Botswana                             Tehran     2    10     9    10
3     Syrian Arab Republic                          Fairbanks     2     4     9     2
4                   Guinea                       Punta Arenas     5     1     6     3
  • Related