I have a CSV file with ";" as separator made like this:
col1;col2;col3;col4
4;hello;world;1;1
4;hi;1;1
4;hi;1;1
4;hi;1;1
Obviously by using ";" as sep it gives me the error about tokenizing data (obviously from the header less columns are expected), how can i obtain a dataframe like this:
col1 col2 col3 col4
4 hello;world 1 1
4 hi 1 1
4 hi 1 1
4 hi 1 1
It could be read even with others packages and other data type (even if i prefer pandas because of the following operations in the code)
CodePudding user response:
import re
pattern = r"(?<=;)(?P<second_column>[\w] ;[\w] )(?=;)"
with open("bad_csv.csv") as file:
text = file.readlines()
for i in range(len(text)):
if text[i].count(';') == 4:
text[i] = re.sub(pattern, '"' r"\g<second_column>" '"', text[i], 1)
with open("good_csv.csv", "w") as file:
for row in text:
file.writelines(row)
df = pd.read_csv("good_csv.csv", sep=';')
CodePudding user response:
You could split off the outer cols until you are left with the remaining col2
. This could be done in Pandas as follows:
import pandas as pd
df_raw = pd.read_csv("input.csv", delimiter=None, header=None, skiprows=1)
df_raw[['col1', 'rest']] = df_raw[0].str.split(";", n=1, expand=True)
df_raw[['col2', 'col3', 'col4']] = df_raw.rest.str.rsplit(";", n=2, expand=True)
df = df_raw[['col1', 'col2', 'col3', 'col4']]
print(df)
Giving df
as:
col1 col2 col3 col4
0 4 hello;world 1 1
1 4 hi 1 1
2 4 hi 1 1
3 4 hi 1 1
This uses n=1
to first split out just col1
using the ;
delimiter. It then takes the left over rest
and uses a reverse split with n=2
to split out the last two columns, with anything remaining being col2
.
This assume that only col2
can have additional ;
separators and the last two are fixed.