Loop to split a large dataframe to small dataframes-CodePudding

DF is a dataframe with columns [A1,B1,C1,A2,B2,C2,A3,B3,C3]

I want to split that 'DF' data frame into small dataframes DF1,DF2,DF3

DF1 to have [A1,B1,C1] as columns

DF2 to have [A2,B2,C2] as columns

DF3 to have [A3,B3,C3] as columns

The number in the name of the dataframe DF'3' should match with its columns [A'3',B'3',C'3']

i tried

for i in range(1,4):
    'DF{}'.format(i)=DF[DF['A{}'.format(i),'B{}'.format(i),'C{}'.format(i)]]

Getting the error

SyntaxError: cannot assign to function call

Is it possible to do this in a single loop?

CodePudding user response：

The error message stems from trying to assign a dataframe to the string format function, instead of a variable.

Dynamically creating variables from DF1 to DFN for N numbers can be a bit tricky. It is easy to create key-item pairs in dicts though. Try the following:

dfs = {}
for i in range(1,4):
    dfs["DF{}".format(i)] = DF[["A{}".format(i), "B{}".format(i), "C{}".format(i)]]

Instead of getting DF1, DF2 and DF3 variables, you get dfs["DF1"], dfs["DF2"], and dfs["DF3"]

CodePudding user response：

You can't dynamically change an object's name.
You can use a list comprehension with explicit definition of the dfs:

df1,df2,df3=[df[['A{}'.format(i),'B{}'.format(i),'C{}'.format(i)]] for i in range(1,4)]

Update based on ViettelSolutions' comment
Here is a more concise way of doing that: df1,df2,df3=[df[[f'A{i}',f'B{i}','C{i}']] for i in range(1,4)]

You can also use a list instead of explicitly name the dfs, and unpack them when needed.

n=4 # Define the number of dfs
dfs=[df[['A{}'.format(i),'B{}'.format(i),'C{}'.format(i)]] for i in range(1,n)]

CodePudding user response：

You could make it completely configurable:

def split_dataframe(df, letters, numbers):
    return [df[[f'{letter}{number}' for letter in letters]] for number in numbers]


letters = ("A","B","C")
numbers = range(1,4)

df1, df2, df3 = split_dataframe(df, letters, numbers)

You can make it the function even more general as follows:

import re

letters_pattern = re.compile("^\D ")
numbers_pattern = re.compile("\d $")

def split_dataframe(df):
    letters = sorted(set(letters_pattern.findall(x)[0] for x in df.columns))
    numbers = sorted(set(numbers_pattern.findall(x)[0] for x in df.columns))
    return [df[[x for x in [f'{letter}{number}' for letter in letters] if x in df.columns]] for number in numbers]

This method has 2 advantages:

you don't need to provide the letters and the numbers in advance, the method will discover what is available in the header and proceed
it will manage "irregular" situations - when, for example, d1 exists but d2 doesn't

To give a concrete example:

df = pd.DataFrame({"A1":[1,2], "B1":[2,3], "C1":[3,4], "D1":[4,5], "A2":[2,3], "B2":[10,11], "C2":[12,13]})

for sub_df in split_dataframe(df):
    print(sub_df)

OUTPUT

   A1  B1  C1  D1
0   1   2   3   4
1   2   3   4   5
   A2  B2  C2
0   2  10  12
1   3  11  13

The columns names discovery process could be set as optional if you pass letters and numbers you only want to consider, as follows:

def split_dataframe(df, letters=None, numbers=None):
    letters = sorted(set(letters_pattern.findall(x)[0] for x in df.columns)) if letters is None else letters
    numbers = sorted(set(numbers_pattern.findall(x)[0] for x in df.columns)) if numbers is None else numbers
    return [df[[x for x in [f'{letter}{number}' for letter in letters] if x in df.columns]] for number in numbers]

for sub_df in split_dataframe(df, letters=("B","C"), numbers=[1,2]):
    print(sub_df)

OUTPUT