Home > OS >  Pandas read_csv() delimiter used in a text
Pandas read_csv() delimiter used in a text

Time:09-29

I have a CSV file with ";" as separator made like this:

col1;col2;col3;col4 
4;hello;world;1;1
4;hi;1;1
4;hi;1;1
4;hi;1;1

Obviously by using ";" as sep it gives me the error about tokenizing data (obviously from the header less columns are expected), how can i obtain a dataframe like this:

col1 col2        col3 col4 
4    hello;world 1    1
4    hi          1    1
4    hi          1    1
4    hi          1    1

It could be read even with others packages and other data type (even if i prefer pandas because of the following operations in the code)

CodePudding user response:

import re

pattern = r"(?<=;)(?P<second_column>[\w] ;[\w] )(?=;)"

with open("bad_csv.csv") as file:
    text = file.readlines()

for i in range(len(text)):
    if text[i].count(';') == 4:
        text[i] = re.sub(pattern, '"'   r"\g<second_column>"   '"', text[i], 1)

with open("good_csv.csv", "w") as file:
    for row in text:
        file.writelines(row)


df = pd.read_csv("good_csv.csv", sep=';')

CodePudding user response:

You could split off the outer cols until you are left with the remaining col2. This could be done in Pandas as follows:

import pandas as pd

df_raw = pd.read_csv("input.csv", delimiter=None, header=None, skiprows=1)

df_raw[['col1', 'rest']] = df_raw[0].str.split(";", n=1, expand=True)
df_raw[['col2', 'col3', 'col4']] = df_raw.rest.str.rsplit(";", n=2, expand=True)
df =  df_raw[['col1', 'col2', 'col3', 'col4']]

print(df)

Giving df as:

  col1         col2 col3 col4
0    4  hello;world    1    1
1    4           hi    1    1
2    4           hi    1    1
3    4           hi    1    1

This uses n=1 to first split out just col1 using the ; delimiter. It then takes the left over rest and uses a reverse split with n=2 to split out the last two columns, with anything remaining being col2.

This assume that only col2 can have additional ; separators and the last two are fixed.

  • Related