Home > OS >  Splitting a large pandas datafile based on the data in one colimn
Splitting a large pandas datafile based on the data in one colimn

Time:05-05

I have a large-ish csv file that I want to split in to separate data files based on the data in one of the columns so that all related data can be analyzed.

ie. [name, color, number, state;
    bob, green, 21, TX;
    joe, red, 33, TX;
    sue, blue, 22, NY;
 ....]

I'd like to have it put each states worth of data in to its own data sub file

df[1] = [bob, green, 21, TX] [joe, red, 33, TX]

df[2] = [sue, blue, 22, NY]

Pandas seems like the best option for this as the csv file given is about 500 lines long

CodePudding user response:

You could try something like:

import pandas as pd

for state, df in pd.read_csv("file.csv").groupby("state"):
    df.to_csv(f"file_{state}.csv", index=False)

Here file.csv is your base file. If it looks like

name,color,number,state
bob,green,21,TX
joe,red,33,TX
sue,blue,22,NY

the output would be 2 files:

file_TX.csv:

name,color,number,state
bob,green,21,TX
joe,red,33,TX

file_NY.csv:

name,color,number,state
sue,blue,22,NY

CodePudding user response:

There are different methods for reading csv files. You may find all methods in following link: (https://www.analyticsvidhya.com/blog/2021/08/python-tutorial-working-with-csv-file-for-data-science/)

Since you want to work with dataframe, using pandas is indeed a practical choice. At start you may do:

import pandas as pd
df = pd.read_csv(r"file_path")

Now let's assume after these lines, you have the following dataframe:

name color number state
bob green 21 TX
joe red 33 TX
sue blue 22 NY
... ... ... ...

From your question, I understand that you want to dissect information based on different states. State data may be mixed. (Ex: TX-NY-TX-DZ-TX etc.) So, sorting alphabetically and resetting index may be first step:

df.sort_values(by=['state'])
df.reset_index(drop = True, inplace = True)

Now, there are several methods we may use. From your question, I did not understand df[1}=2 lists , df[2]=list. I am assuming you meant df as list of lists for a state. In that case, let's use following method:

Method 1- Making List of Lists for Different States

First, let's get state list without duplicates:

s_list = list(dict.fromkeys(df.loc[:,"state"].tolist()))

Now we need to use list comprehension.

lol = [[df.iloc[i2,:].tolist() for i2 in range(df.shape[0]) \
if state==df.loc[i2,"state"]] for state in s_list]

lol (list of lists) variable is a list, which contains x number (state number) of lists inside. Each inside list has one or more lists as rows. So you may reach a state by writing lol[0], lol[1] etc.

Method 2- Making Different Dataframes for Different States

In this method, if there are 20 states, we need to get 20 dataframes. And we may combine dataframes in a list. First, we need state names again:

s_list = list(dict.fromkeys(df.loc[:,"state"].tolist()))

We need to get row index values (as list of lists) for different states. (For ex. NY is in row 3,6,7,...)

r_index = [[i for i in range(df.shape[0]) \
if df.loc[i,"Year"]==state] for state in s_list]

Let's make different dataframes for different states: (and reset index)

dfs = [df.loc[rows,:] for rows in r_index]
for df in dfs: df.reset_index(drop = True, inplace = True)

Now you have a list which contains n (state number) of dataframes inside. After this point, you may sort dataframes for name for example.

Method 3 - My Recommendation

Firstly, I would recommend you to split data based on name since it is a great identifier. But I am assuming you need to use state information. I would add state column as index. And make a nested dictionary:

import pandas as pd
df = pd.read_csv(r"path")
df = df.sort_values(by=['state'])
df.reset_index(drop = True, inplace = True) 
# we know state is in column 3
states = list(dict.fromkeys(df.iloc[:,3].tolist()))
rows = [[i for i in range(df.shape[0]) if df.iloc[i,3]==s] for s in states]
temp = [[i2 for i2 in range(len(rows[i]))] for i in range(len(rows))]
into = [inner for outer in temp for inner in outer]
df.insert(4, "No", into)
df.set_index(pd.MultiIndex.from_arrays([df.iloc[:,no] for no in [3,4]]),inplace=True)
df.drop(df.columns[[3,4]], axis=1, inplace=True)
dfs = [df.iloc[row,:] for row in rows]

for i in range(len(dfs)): dfs[i] = dfs[i]\
.melt(var_name="app",ignore_index=False).set_index("app",append=True)

def call(df):
    if df.index.nlevels == 1: return df.to_dict()[df.columns[0]]
    return {key: call(df_gr.droplevel(0, axis=0)) for key, df_gr in df.groupby(level=0)}

data = {}
for i in range(len(states)): data.update(call(dfs[i]))

I may have done some typos, but I hope you understand the idea.

This code gives a nested dictionary such as:

  • first choice is state (TX,NY...)
  • next choice is state number index (0,1,2...)
  • next choice is name or color or number

Now that I look back at number column in csv file, you may avoid making a new column by using number directly if number column has no duplicates.

  • Related