I have a large table of data in pandas that could possibly be better represented as multiple relational tables. Is there a way to pick the columns to use in a new dataframe and to index the leftover columns to allow a join to recreate the original table?
For example, lets say we have a dataset where some columns get repeated multiple times over several rows:
data = {
'1': ['A', 'Abe', 'Bucket', '2022-01-01', 1,0],
'2': ['A', 'Abe', 'Bucket', '2022-01-02', 0,1],
'3': ['A', 'Abe', 'Bucket', '2022-01-03', 3,3],
'4': ['A', 'Abe', 'Bucket with Hole', '2022-01-04', 3,1],
'5': ['B', 'Ben', 'Jug', '2022-01-01', 2,1],
'6': ['C', 'Cat', 'Jug', '2022-01-01', 1,2]}
joined_df = pd.DataFrame.from_dict(data,
orient='index',
columns=['Sample', 'Author','Instrument', 'Date', 'Data1','Data2'])
joined_df
returns
Sample Author Instrument Date Data1 Data2
1 A Abe Bucket 2022-01-01 1 0
2 A Abe Bucket 2022-01-02 0 1
3 A Abe Bucket 2022-01-03 3 3
4 A Abe Bucket with Hole 2022-01-04 3 1
5 B Ben Jug 2022-01-01 2 1
6 C Cat Jug 2022-01-01 1 2
how can I go about splitting the data so that I have an "Sample" dataframe, "Time" dataframe, and "Data" dataframe:
Data:
Sample_ID Instrument_ID Date Data1 Data2
d1 s1 i1 2022-01-01 1 0
d2 s1 i1 2022-01-02 0 1
d3 s1 i1 2022-01-03 3 3
d4 s1 i2 2022-01-04 3 1
d5 s2 i3 2022-01-01 2 1
d6 s3 i3 2022-01-01 1 2
and
Samples:
Sample Name
s1 A Abe
s2 B Ben
s3 C Cat
Instument
i1 bucket
i2 Bucket with Hole
i3 Jug
Joining the sample and data dataframes on the SampleID should return the original table
CodePudding user response:
This works for the dummy dataset but I'm not sure about the performance on a large dataset. Might need to test with Parquet and a "real" dataset. The steps are first to create each relational table, drop duplicates, and then remerge to the original dataset to match indexes:
sample_df = joined_df[['Sample','Author']].drop_duplicates().rename_axis('SampleID').reset_index()
instrument_df = joined_df[['Instrument']].drop_duplicates().rename_axis('InstrumentID').reset_index()
SampleID Sample Author
0 1 A Abe
1 5 B Ben
2 6 C Cat
InstrumentID Instrument
0 1 Bucket
1 4 Bucket with Hole
2 5 Jug
df3 = joined_df.merge(instrument_df, on=["Instrument"])
df3 = df3.merge(sample_df, on=["Sample","Author"])
df3 = df3.drop(columns=['Sample','Author','Instrument'])
df3
Date Data1 Data2 InstrumentID SampleID
0 2022-01-01 1 0 1 1
1 2022-01-02 0 1 1 1
2 2022-01-03 3 3 1 1
3 2022-01-04 3 1 4 1
4 2022-01-01 2 1 5 5
5 2022-01-01 1 2 5 6
CodePudding user response:
EDIT the original question was framed differently see below
Original answer
Step 1 and 2 you ask for can be achieved the following way. Let us begin with step 2. For brevity I call your dataframe joined_df
, simply df
and another df2
You want unique groups of Sample and Sampler, with a unique id, and a dataframe containing those, so you could
mygroups = df.groupby(['Sample','Sampler']).size()
### give id label instead of group size metric
for i in range(len(mygroups)):
mygroups[i] = i 1
df2 = pd.DataFrame(mygroups)
df2 = (df2.reset_index(level=0)).reset_index(level = 0)
df2= df2.rename(columns={0: "Sample_ID"})
then let us go to step 1. You want to slice only what you call sample dataframe
and combine them with the new unique group id
sample_dataframe = df.iloc[:,2:4]
sample_dataframe["Sample_ID"] = ""
## actual combination step
new_df = pd.merge(df, df2)
Original answer stop
EDIT For the records the original question was framed around the following data
import pandas as pd
data = {'1': ['A', 'Abe', 1, 0],
'2': ['A', 'Abe', 0,1],
'3': ['B', 'Ben', 3,3],
'4': ['B', 'Ben', 3,1],
'5': ['B', 'Cat', 2,1],
'6': ['C', 'Cat', 1,2]}
df = pd.DataFrame.from_dict(data,
orient='index',
columns=['Sample', 'Sampler', 'Data1', 'Data2'])
If combined with my answer, this will yield solutions to step 1 and 2 in the question, even after changes but without the final joining part, but which again can likely be achieved by some kind of grouping and merge application, you really just need to go the steps backwards.