Home > Back-end >  How to Split a huge csv into multiple csv's based on column header
How to Split a huge csv into multiple csv's based on column header

Time:09-17

I'll try to represent my problem, basing on simple example below. I have below main csv and I am trying to split into 2 or more csv basing on column header, keeping the unique column id in intact in every csv file.

Below is the code I am trying to figure out, but not quite getting the result.

import pandas as pd

df = pd.read_csv('abc.csv')
df[['id','name','age']] = df['csv1'].str.split(' ', expand=True)




csv
       id     name      age    color  Gender
0     101     Jack      23     white    M
1     102     Mary      25     black    F
2     103     Tom       24     brown    M

Output required

csv1

       id     name    age
0     101     Jack    23
1     102     Mary    25
2     103     Tom     24 


csv2 -

       id   color  Gender
0     101   white    M
1     102   black    F
2     103   brown    M

CodePudding user response:

UPDATE
I found a better apporach with np.array_split

I used this example df:

               x     y          R         TR       x_c      y_c     xxx    yyy         RRR        TTTR      xxx_c     yyy_c
id                                                                                                                         
1256780.0  13989  6241   6.689222  20.986341  14050.83  6315.33  213989  36241   46.689222  520.986341  614050.83  76315.33
12000.0    14013  6278  53.152036   0.000000  14060.00  6288.00  214013  36278  453.152036    5.000000  614060.00  76288.00
1100.0     14111  6379  87.598357   5.000000  14070.55  7000.00  214111  36379  487.598357   55.000000  614070.55  76288.00 

which has 12 columns.

# the 4 means, split the df into 4 evenly sized chunks
chunks = np.array_split(df,4, axis=1)

Chunks is a list containing all seperate dataframes. Output:

# chunks[0]
                 x       y          R
id                                   
1256780.0  13989.0  6241.0   6.689222
12000.0    14013.0  6278.0  53.152036
1100.0     14111.0  6379.0  87.598357

# chunks[1]
                  TR       x_c      y_c
id                                     
1256780.0  20.986341  14050.83  6315.33
12000.0     0.000000  14060.00  6288.00
1100.0      5.000000  14070.55  7000.00

# chunks[2]
                xxx      yyy         RRR
id                                      
1256780.0  213989.0  36241.0   46.689222
12000.0    214013.0  36278.0  453.152036
1100.0     214111.0  36379.0  487.598357

# chunks[3]
                 TTTR      xxx_c     yyy_c
id                                        
1256780.0  520.986341  614050.83  76315.33
12000.0      5.000000  614060.00  76288.00
1100.0      55.000000  614070.55  76288.00

Old answer:

You could calculate half of the columns and then use iloc to split the dataframes into two parts.

df = df.set_index('id')
half = len(df.columns)//2
df1, df2 = df.iloc[:,:half], df.iloc[:,half:]
df1 = df1.reset_index()
df2 = df2.reset_index()

Output:

#df1
    id  name  age
0  101  Jack   23
1  102  Mary   25
2  103   Tom   24

#df2
    id  color Gender
0  101  white      M
1  102  black      F
2  103  brown      M

  • Related