Home > Software engineering >  How to split csv data
How to split csv data

Time:01-17

I have a problem where I got a csv data like this:

AgeGroup           Where do you hear our company from?       How long have you using our platform?
18-24              Word of mouth; Supermarket Product        0-1 years
36-50              Social Media; Word of mouth               1-2 years
18-24              Advertisement                              4 years

and I tried to make the file into this format through either jupyter notebook or from excel csv:

AgeGroup           Where do you hear our company from?
18-24              Word of mouth                             0-1 years
18-24              Supermarket Product                       0-1 years
36-50              Social Media                              1-2 years
36-50              Word of mouth                             1-2 years
18-24              Advertisement                              4 years

Let's say the csv file is Untitled form.csv and I import the data to jupyter notebook:

data = pd.read_csv('Untitled form.csv')

Can anyone tell me how should I do it?

I have tried doing it in excel csv using data-column but of course, they only separate the data into column while what I wanted is the data is separated into a row while still pertain the data from other column

CodePudding user response:

Anyway... I found another way to do it which is more roundabout. First I edit the file through PowerSource excel and save it to different file... and then if utf-8 encoding appear... I just add encoding='cp1252'

So it would become like this:

import pandas as pd

data_split = pd.read_csv('Untitled form split.csv', 
                          skipinitialspace=True, 
                          usecols=range(1,7), 
                          encoding='cp1252') 

However if there's a more efficient way, please let me know. Thanks

CodePudding user response:

I'm not 100% sure about your question since I think it might be two separate issues but hopefully this should fix it.


import pandas as pd

data = pd.read_fwf('Untitled form.csv')
cols = data.columns
data_long = pd.DataFrame(columns=data.columns)

for idx, row in data.iterrows():
    hear_from = row['Where do you hear our company from?'].split(';')
    hear_from_fmt = list(map(lambda x: x.strip(), hear_from))
    n_items = len(hear_from_fmt)
    d = {
        cols[0] : [row[0]]*n_items,
        cols[1] : hear_from_fmt,
        cols[2] : [row[2]]*n_items,
    }
    data_long = pd.concat([data_long, pd.DataFrame(d)], ignore_index=True)

Let's brake it down.

This line data = pd.read_fwf('Untitled form.csv') reads the file inferring the spacing between columns. Now this is only useful because I am not sure your file is a proper CSV, if it is, you can open it normally, if not that this might help.

Now for the rest. We are iterating through each row and we are selecting the methods someone could have heard your company from. These are split using ; and then "stripped" to ensure there are no spaces. A new temp dataframe is created where first and last column are the same but you have as many rows as the number of elements in the hear_from_fmt list there are. The dataframes are then concatenated together.

Now there might be a more efficient solution, but this should work.

  • Related