I am trying to unwrap?, explode?, a data frame with several columns into a new data frame with rows-CodePudding

I apologize for not knowing the correct terminology, but I am looking for a way in Pandas to transform a data frame with several similar columns into a data frame with rows that explode? into more rows. Basically for every column that starts with Line.{x}, I want to create a new row that has all the Line.{x} columns. Same for all columns with values in {x}, e.g. 1,2,3.

Here is an example of a data frame I'd like to convert from:

                    Column1     Column2     Column3     Column4      Line.0.a       Line.0.b    Line.0.c     Line.1.a       Line.1.b    Line.1.c     Line.2.a       Line.2.b    Line.2.c      Line.3.a       Line.3.b       Line.3.c
        0           the         quick       brown       dog          100               200           300        400             500          600       700              800        900           1000           1100          1200
        1           you         see         spot        run          101               201           301        401             501          601
        2           four        score       and         seven        102               202           302

I would like to convert it to this:

        Column1     Column2   Column3      Column4         Line.a           Line.b      Line.c
0          the       quick    brown        dog             100              200         300             
1          the       quick    brown        dog             400              500         600             
2          the       quick    brown        dog             700              800         900
3          the       quick    brown        dog            1000             1100        1200
4          you       see      spot         run             101              201         301 
5          you       see      spot         run             401              501         601
6          four      score    and          seven           102              202         302

Thank you in advance!

CodePudding user response：

One option is with pivot_longer from pyjanitor, where for this particular use case, you pass .value placeholder to names_to, to keep track of the parts of the column you want to retain as headers; you then pass a regular expression with matching groups to names_pattern:

# pip install pyjanitor
import pandas as pd
import janitor

df.pivot_longer(index='Col*', 
                names_to = (".value", ".value"), 
                names_pattern = r"(. )\.\d (. )") 
   Column1 Column2 Column3 Column4  Line.a  Line.b  Line.c
0      the   quick   brown     dog   100.0   200.0   300.0
1      you     see    spot     run   101.0   201.0   301.0
2     four   score     and   seven   102.0   202.0   302.0
3      the   quick   brown     dog   400.0   500.0   600.0
4      you     see    spot     run   401.0   501.0   601.0
5     four   score     and   seven     NaN     NaN     NaN
6      the   quick   brown     dog   700.0   800.0   900.0
7      you     see    spot     run     NaN     NaN     NaN
8     four   score     and   seven     NaN     NaN     NaN
9      the   quick   brown     dog  1000.0  1100.0  1200.0
10     you     see    spot     run     NaN     NaN     NaN
11    four   score     and   seven     NaN     NaN     NaN

You can get rid of the nulls with dropna:

(df
.pivot_longer(
    index='Col*', 
    names_to = (".value", ".value"), 
    names_pattern = r"(. )\.\d (. )")
.dropna()
)

  Column1 Column2 Column3 Column4  Line.a  Line.b  Line.c
0     the   quick   brown     dog   100.0   200.0   300.0
1     you     see    spot     run   101.0   201.0   301.0
2    four   score     and   seven   102.0   202.0   302.0
3     the   quick   brown     dog   400.0   500.0   600.0
4     you     see    spot     run   401.0   501.0   601.0
6     the   quick   brown     dog   700.0   800.0   900.0
9     the   quick   brown     dog  1000.0  1100.0  1200.0

Another option, as pointed out by @Mozway, is to convert the columns into a MultiIndex, and stack:

temp = df.set_index(['Column1', 'Column2', 'Column3','Column4'])
# this is where the MultiIndex is created
cols = temp.columns.str.split('.', expand=True)
temp.columns = cols
# now we stack
# nulls are dropped by default
temp = temp.stack(level=1).droplevel(-1)
temp.columns = temp.columns.map('.'.join)
temp.reset_index()

  Column1 Column2 Column3 Column4  Line.a  Line.b  Line.c
0     the   quick   brown     dog   100.0   200.0   300.0
1     the   quick   brown     dog   400.0   500.0   600.0
2     the   quick   brown     dog   700.0   800.0   900.0
3     the   quick   brown     dog  1000.0  1100.0  1200.0
4     you     see    spot     run   101.0   201.0   301.0
5     you     see    spot     run   401.0   501.0   601.0
6    four   score     and   seven   102.0   202.0   302.0

CodePudding user response：

Here is an approach that works. It uses melt and then joins. new_df contains what you need. the order of items might be different though. The fuction takes 3 parameters. The first is your data frame. Second is keys that remain static and third is convertion dict that tells what goes where.

import pandas as pd

def vars_to_cases(df:pd.DataFrame,keys:list,convertion_dict:dict):
    vals = list(convertion_dict.values())
    l = len(vals[0])
    if not all(len(item) == l for item in vals):
        raise Exception("Dictionary values don't have the same length")
    tempkeys = keys.copy()
    tempkeys.append("variable")
    df_data = pd.DataFrame()
    for short_name, my_list in convertion_dict.items():
        my_replace_dict = {}
        for count, item, in enumerate(my_list):
            my_replace_dict[item] = count

        mydf = pd.melt(df, id_vars=tempkeys[:-1], value_vars=my_list)

        mydf["variable"].replace(my_replace_dict, inplace=True)
        mydf.rename(columns={"value": short_name}, inplace=True)
        mydf = mydf.set_index(tempkeys)
        if df_data.empty:
            df_data = mydf.copy()
        else:
            df_data = df_data.join(mydf)
    return df_data

#here is the data
df=pd.DataFrame({'Column1': {0: 'the', 1: 'you', 2: 'four'},
             'Column2': {0: 'quick', 1: 'see', 2: 'score'},
             'Column3': {0: 'brown', 1: 'spot', 2: 'and'},
             'Column4': {0: 'dog', 1: 'run', 2: 'seven'},
             'Line.0.a': {0: 100, 1: 101, 2: 102},
             'Line.0.b': {0: 200, 1: 201, 2: 202},
             'Line.0.c': {0: 300, 1: 301, 2: 302},
             'Line.1.a': {0: 400.0, 1: 401.0, 2: None},
             'Line.1.b': {0: 500.0, 1: 501.0, 2: None},
             'Line.1.c': {0: 600.0, 1: 601.0, 2: None},
             'Line.2.a': {0: 700.0, 1: None, 2: None},
             'Line.2.b': {0: 800.0, 1: None, 2: None},
             'Line.2.c': {0: 900.0, 1: None, 2: None},
             'Line.3.a': {0: 1000.0, 1: None, 2: None},
             'Line.3.b': {0: 1100.0, 1: None, 2: None},
             'Line.3.c': {0: 1200.0, 1: None, 2: None}})



convertion_dict={"Line.a":["Line.0.a","Line.1.a","Line.2.a","Line.3.a"],
                 "Line.b":["Line.0.b","Line.1.b","Line.2.b","Line.3.b"],
                 "Line.c":["Line.0.c","Line.1.c","Line.2.c","Line.3.c"]}

keys=["Column1","Column2","Column3","Column4"]

new_df=vars_to_cases(df,keys,convertion_dict)
new_df=new_df.reset_index()
new_df=new_df.dropna()
new_df=new_df.drop(columns="variable")