I apologize for not knowing the correct terminology, but I am looking for a way in Pandas to transform a data frame with several similar columns into a data frame with rows that explode? into more rows. Basically for every column that starts with Line.{x}, I want to create a new row that has all the Line.{x} columns. Same for all columns with values in {x}, e.g. 1,2,3.
Here is an example of a data frame I'd like to convert from:
Column1 Column2 Column3 Column4 Line.0.a Line.0.b Line.0.c Line.1.a Line.1.b Line.1.c Line.2.a Line.2.b Line.2.c Line.3.a Line.3.b Line.3.c
0 the quick brown dog 100 200 300 400 500 600 700 800 900 1000 1100 1200
1 you see spot run 101 201 301 401 501 601
2 four score and seven 102 202 302
I would like to convert it to this:
Column1 Column2 Column3 Column4 Line.a Line.b Line.c
0 the quick brown dog 100 200 300
1 the quick brown dog 400 500 600
2 the quick brown dog 700 800 900
3 the quick brown dog 1000 1100 1200
4 you see spot run 101 201 301
5 you see spot run 401 501 601
6 four score and seven 102 202 302
Thank you in advance!
CodePudding user response:
One option is with pivot_longer from pyjanitor, where for this particular use case, you pass .value
placeholder to names_to
, to keep track of the parts of the column you want to retain as headers; you then pass a regular expression with matching groups to names_pattern
:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index='Col*',
names_to = (".value", ".value"),
names_pattern = r"(. )\.\d (. )")
Column1 Column2 Column3 Column4 Line.a Line.b Line.c
0 the quick brown dog 100.0 200.0 300.0
1 you see spot run 101.0 201.0 301.0
2 four score and seven 102.0 202.0 302.0
3 the quick brown dog 400.0 500.0 600.0
4 you see spot run 401.0 501.0 601.0
5 four score and seven NaN NaN NaN
6 the quick brown dog 700.0 800.0 900.0
7 you see spot run NaN NaN NaN
8 four score and seven NaN NaN NaN
9 the quick brown dog 1000.0 1100.0 1200.0
10 you see spot run NaN NaN NaN
11 four score and seven NaN NaN NaN
You can get rid of the nulls with dropna
:
(df
.pivot_longer(
index='Col*',
names_to = (".value", ".value"),
names_pattern = r"(. )\.\d (. )")
.dropna()
)
Column1 Column2 Column3 Column4 Line.a Line.b Line.c
0 the quick brown dog 100.0 200.0 300.0
1 you see spot run 101.0 201.0 301.0
2 four score and seven 102.0 202.0 302.0
3 the quick brown dog 400.0 500.0 600.0
4 you see spot run 401.0 501.0 601.0
6 the quick brown dog 700.0 800.0 900.0
9 the quick brown dog 1000.0 1100.0 1200.0
Another option, as pointed out by @Mozway, is to convert the columns into a MultiIndex, and stack:
temp = df.set_index(['Column1', 'Column2', 'Column3','Column4'])
# this is where the MultiIndex is created
cols = temp.columns.str.split('.', expand=True)
temp.columns = cols
# now we stack
# nulls are dropped by default
temp = temp.stack(level=1).droplevel(-1)
temp.columns = temp.columns.map('.'.join)
temp.reset_index()
Column1 Column2 Column3 Column4 Line.a Line.b Line.c
0 the quick brown dog 100.0 200.0 300.0
1 the quick brown dog 400.0 500.0 600.0
2 the quick brown dog 700.0 800.0 900.0
3 the quick brown dog 1000.0 1100.0 1200.0
4 you see spot run 101.0 201.0 301.0
5 you see spot run 401.0 501.0 601.0
6 four score and seven 102.0 202.0 302.0
CodePudding user response:
Here is an approach that works. It uses melt and then joins. new_df contains what you need. the order of items might be different though. The fuction takes 3 parameters. The first is your data frame. Second is keys that remain static and third is convertion dict that tells what goes where.
import pandas as pd
def vars_to_cases(df:pd.DataFrame,keys:list,convertion_dict:dict):
vals = list(convertion_dict.values())
l = len(vals[0])
if not all(len(item) == l for item in vals):
raise Exception("Dictionary values don't have the same length")
tempkeys = keys.copy()
tempkeys.append("variable")
df_data = pd.DataFrame()
for short_name, my_list in convertion_dict.items():
my_replace_dict = {}
for count, item, in enumerate(my_list):
my_replace_dict[item] = count
mydf = pd.melt(df, id_vars=tempkeys[:-1], value_vars=my_list)
mydf["variable"].replace(my_replace_dict, inplace=True)
mydf.rename(columns={"value": short_name}, inplace=True)
mydf = mydf.set_index(tempkeys)
if df_data.empty:
df_data = mydf.copy()
else:
df_data = df_data.join(mydf)
return df_data
#here is the data
df=pd.DataFrame({'Column1': {0: 'the', 1: 'you', 2: 'four'},
'Column2': {0: 'quick', 1: 'see', 2: 'score'},
'Column3': {0: 'brown', 1: 'spot', 2: 'and'},
'Column4': {0: 'dog', 1: 'run', 2: 'seven'},
'Line.0.a': {0: 100, 1: 101, 2: 102},
'Line.0.b': {0: 200, 1: 201, 2: 202},
'Line.0.c': {0: 300, 1: 301, 2: 302},
'Line.1.a': {0: 400.0, 1: 401.0, 2: None},
'Line.1.b': {0: 500.0, 1: 501.0, 2: None},
'Line.1.c': {0: 600.0, 1: 601.0, 2: None},
'Line.2.a': {0: 700.0, 1: None, 2: None},
'Line.2.b': {0: 800.0, 1: None, 2: None},
'Line.2.c': {0: 900.0, 1: None, 2: None},
'Line.3.a': {0: 1000.0, 1: None, 2: None},
'Line.3.b': {0: 1100.0, 1: None, 2: None},
'Line.3.c': {0: 1200.0, 1: None, 2: None}})
convertion_dict={"Line.a":["Line.0.a","Line.1.a","Line.2.a","Line.3.a"],
"Line.b":["Line.0.b","Line.1.b","Line.2.b","Line.3.b"],
"Line.c":["Line.0.c","Line.1.c","Line.2.c","Line.3.c"]}
keys=["Column1","Column2","Column3","Column4"]
new_df=vars_to_cases(df,keys,convertion_dict)
new_df=new_df.reset_index()
new_df=new_df.dropna()
new_df=new_df.drop(columns="variable")