Pandas: Unjoin a dataframe-CodePudding

I have a large table of data in pandas that could possibly be better represented as multiple relational tables. Is there a way to pick the columns to use in a new dataframe and to index the leftover columns to allow a join to recreate the original table?

For example, lets say we have a dataset where some columns get repeated multiple times over several rows:

data = {
        '1': ['A', 'Abe', 'Bucket', '2022-01-01', 1,0], 
        '2': ['A', 'Abe', 'Bucket', '2022-01-02', 0,1], 
        '3': ['A', 'Abe', 'Bucket', '2022-01-03', 3,3], 
        '4': ['A', 'Abe', 'Bucket with Hole', '2022-01-04', 3,1], 
        '5': ['B', 'Ben', 'Jug', '2022-01-01', 2,1], 
        '6': ['C', 'Cat', 'Jug', '2022-01-01', 1,2]}

joined_df = pd.DataFrame.from_dict(data, 
                   orient='index',
                       columns=['Sample', 'Author','Instrument', 'Date', 'Data1','Data2'])
joined_df

returns

Sample  Author  Instrument  Date    Data1   Data2
1   A   Abe Bucket  2022-01-01  1   0
2   A   Abe Bucket  2022-01-02  0   1
3   A   Abe Bucket  2022-01-03  3   3
4   A   Abe Bucket with Hole    2022-01-04  3   1
5   B   Ben Jug 2022-01-01  2   1
6   C   Cat Jug 2022-01-01  1   2

how can I go about splitting the data so that I have an "Sample" dataframe, "Time" dataframe, and "Data" dataframe:

Data:
Sample_ID   Instrument_ID   Date    Data1   Data2
d1  s1  i1  2022-01-01  1   0
d2  s1  i1  2022-01-02  0   1
d3  s1  i1  2022-01-03  3   3
d4  s1  i2  2022-01-04  3   1
d5  s2  i3  2022-01-01  2   1
d6  s3  i3  2022-01-01  1   2

and

Samples:
Sample  Name
s1  A   Abe
s2  B   Ben
s3  C   Cat

Instument
i1  bucket
i2  Bucket with Hole
i3  Jug

Joining the sample and data dataframes on the SampleID should return the original table

CodePudding user response：

This works for the dummy dataset but I'm not sure about the performance on a large dataset. Might need to test with Parquet and a "real" dataset. The steps are first to create each relational table, drop duplicates, and then remerge to the original dataset to match indexes:

sample_df = joined_df[['Sample','Author']].drop_duplicates().rename_axis('SampleID').reset_index()
instrument_df = joined_df[['Instrument']].drop_duplicates().rename_axis('InstrumentID').reset_index()

    SampleID    Sample  Author
0   1   A   Abe
1   5   B   Ben
2   6   C   Cat

    InstrumentID    Instrument
0   1   Bucket
1   4   Bucket with Hole
2   5   Jug 

df3 = joined_df.merge(instrument_df, on=["Instrument"])
df3 = df3.merge(sample_df, on=["Sample","Author"])
df3 = df3.drop(columns=['Sample','Author','Instrument'])
df3

    Date    Data1   Data2   InstrumentID    SampleID
0   2022-01-01  1   0   1   1
1   2022-01-02  0   1   1   1
2   2022-01-03  3   3   1   1
3   2022-01-04  3   1   4   1
4   2022-01-01  2   1   5   5
5   2022-01-01  1   2   5   6

CodePudding user response：

EDIT the original question was framed differently see below

Original answer Step 1 and 2 you ask for can be achieved the following way. Let us begin with step 2. For brevity I call your dataframe joined_df, simply df and another df2

You want unique groups of Sample and Sampler, with a unique id, and a dataframe containing those, so you could

mygroups = df.groupby(['Sample','Sampler']).size()
### give id label instead of group size metric
for i in range(len(mygroups)):
    mygroups[i] = i 1

df2 = pd.DataFrame(mygroups)
df2 = (df2.reset_index(level=0)).reset_index(level = 0)
df2= df2.rename(columns={0: "Sample_ID"})

then let us go to step 1. You want to slice only what you call sample dataframe and combine them with the new unique group id

sample_dataframe = df.iloc[:,2:4]
sample_dataframe["Sample_ID"] = ""

## actual combination step
new_df = pd.merge(df, df2)

Original answer stop

EDIT For the records the original question was framed around the following data

import pandas as pd

data = {'1': ['A', 'Abe', 1, 0], 
    '2': ['A', 'Abe', 0,1], 
    '3': ['B', 'Ben', 3,3], 
    '4': ['B', 'Ben', 3,1], 
    '5': ['B', 'Cat', 2,1], 
    '6': ['C', 'Cat', 1,2]}

df = pd.DataFrame.from_dict(data, 
                   orient='index',
                   columns=['Sample', 'Sampler', 'Data1', 'Data2'])

If combined with my answer, this will yield solutions to step 1 and 2 in the question, even after changes but without the final joining part, but which again can likely be achieved by some kind of grouping and merge application, you really just need to go the steps backwards.