Home > OS >  How to extract multiple datasets from a larger dataset
How to extract multiple datasets from a larger dataset

Time:02-21

In python I am trying to prepare my data set. I want to train an LSTM, and will do so by training the model on all of the data grouped by credit card number and ordered by time.

For my test set, however, I want to create a set of sequences, that is a sequences of transactions for each client credit card number. So then with this I can test the model for individual clients, and not just random transactions.

To do this, I first extracted all credit card numbers from the credit card number column and added them to a list.

cc_list = df['cc_num'].tolist()

After I then got rid of any duplicates in the list, so that I was left with a list of all possible distinct accounts:

all_accounts = list(set(cc_list))

Then I tried to to create a new Dataframe for each account number in the list (which is approximately 900)

for acc in all_accounts:
y = df.loc[df['cc_num'] == acc]\
    .sort_values(by=['trans_date_trans_time'],ascending=True)
print(y)

However, this only gives me a Dataframe for the last account number, and not all.

I have also tried to define a function

accounts = df['cc_num'].unique()

def grouping_ccn(df):
    subset_values = []
    for acc in accounts:
        x = df.loc[df['cc_num'] == acc]\
            .sort_values(by=['trans_date_trans_time'],ascending=True)
        subset_values.append(x)
return np.array(subset_values)
groups_cc = grouping_ccn(df)

but this also didn't work, it gives an error saying: "ValueError: cannot copy sequence with size 2028 to array axis with dimension 22". I see why I got this error, but I just don't know how to fix this.

  • How do I properly extract a Dataframe for each individual credit card number?

I am a complete noob at this, so any help would be greatly appreciated.

CodePudding user response:

Standard rule: if you work with list (or dict) then you need list (or dict) to keep all results.


If you want every account keep in separated dataframe then you should append them to list and later use for-loop to work with every dataframe separatelly (one-by-one).

Or you could add to dictionary as account: data and later it could be simpler get dataframe only for selected account.

To get unique accounts you can use unique() instead of set().

But it could be simpler to use groupby() and later use loop to put every group in list or directory.

If you want to convert list with all dataframes back to one dictionary then you can use concat().


Minimal working example with some data

import pandas as pd

data = {
    'account': ['A','B','A','C','B'], 
    'time': ['2022.02.20','2022.02.20','2022.02.19','2022.02.20','2022.02.10'], 
}

main_df = pd.DataFrame(data)

print('\n--- main df ---\n')
print(main_df)

print('\n--- unique accounts ---\n')
print(main_df['account'].unique())

print('\n--- groups ---\n')

sorted_groups = main_df.sort_values(['time']).groupby('account')
all_dfs_list = []
all_dfs_dict = {}
for value, group in sorted_groups:
    print('-', value, '-')
    #print(group)
    print(group.reset_index(drop=True))
    all_dfs_list.append(group.reset_index(drop=True))
    all_dfs_dict[value] = group.reset_index(drop=True)

print('\n--- list ---\n')
for df in all_dfs_list:
    print(df)
    print('-')

print('\n--- dict ---\n')
for account, df in all_dfs_dict.items():
    print('-', account, '-')
    print(df)
    print('-')

print('\n--- concat groups ---\n')
print(pd.concat(all_dfs_list).reset_index(drop=True))

print('\n--- sort df ---\n')
print(main_df.sort_values(['account', 'time']).reset_index(drop=True))

Result:

--- main df ---

  account        time
0       A  2022.02.20
1       B  2022.02.20
2       A  2022.02.19
3       C  2022.02.20
4       B  2022.02.10

--- unique accounts ---

['A' 'B' 'C']

--- groups ---

- A -
  account        time
0       A  2022.02.19
1       A  2022.02.20
- B -
  account        time
0       B  2022.02.10
1       B  2022.02.20
- C -
  account        time
0       C  2022.02.20

--- list ---

  account        time
0       A  2022.02.19
1       A  2022.02.20
-
  account        time
0       B  2022.02.10
1       B  2022.02.20
-
  account        time
0       C  2022.02.20
-

--- dict ---

- A -
  account        time
0       A  2022.02.19
1       A  2022.02.20
-
- B -
  account        time
0       B  2022.02.10
1       B  2022.02.20
-
- C -
  account        time
0       C  2022.02.20
-

--- concat groups ---

  account        time
0       A  2022.02.19
1       A  2022.02.20
2       B  2022.02.10
3       B  2022.02.20
4       C  2022.02.20

--- sort df ---

  account        time
0       A  2022.02.19
1       A  2022.02.20
2       B  2022.02.10
3       B  2022.02.20
4       C  2022.02.20

Doc:

  • Related