I have the following Pandas DataFrame:
|Company | Sales | Location|
-----------------------------
|American | NaN| NaN|
|Express | NaN| NaN|
|Co. | 1,000| USA|
|Apple | 2,000| USA|
|Microsoft| NaN| NaN|
|Corp. | 1,500| USA|
If a company's name is too long, it has a row speficially to fit that name. I want to 'merge' the rows in a wat that if there is NaN sales, it will merge with the rows below until there is a sales record.
This is what I want the output to look like!
|Company | Sales | Location|
----------------------------------------
|American Express Co.| 1,000| USA|
|Apple | 2,000| USA|
|Microsoft Corp. | 1,500| USA|
I tried fillna, but this would just copy the sales amount, whereas I really want to merge so I can keep the entire company name in the Company column.
Thanks!
CodePudding user response:
here is one way to do it
first to backfill so, all the individual parts of the company names has the same sales and location. then using the groupby on sales and location, the names are combined.
Instead of apply, we can use sum for Company name, but only caveat being that there won't be space in between individual parts, unless there is a trailing or leading space in individually named parts.
df[['Sales', 'Location']]=df[['Sales', 'Location']].fillna(method='bfill')
df.groupby(['Sales','Location'])['Company'].apply(' '.join).reset_index()
alternately, to avoid a scenarios where sales and location is same for two companies, we assign a unique id to each company (where location is not null) and use that in consolidating
# assign an index to rows where location is not null, and we use that to combine the splitted company names
df['idx']=df['Location'].notna().cumsum()
df.loc[df['Location'].isna(),'idx']=np.nan
df[['idx', 'Sales', 'Location']]=df[['idx', 'Sales', 'Location']].fillna(method='bfill')
df.groupby(['idx', 'Sales', 'Location'])['Company'].apply(' '.join).reset_index().drop(columns='idx')
Sales Location Company
0 1,000 USA American Express Co.
1 1,500 USA Microsoft Corp.
2 2,000 USA Apple
CodePudding user response:
A solution is to introduce a criteria that defines which lines are assumed to be followed by more details and which are "terminal". Then, it can be used to rank the lines: assign same rank to all lines that belong to the same group. Finally, the rank can be used to group the lines together.
df["EndOfLine"] = ~df.Location.isna() & ~df.Location.isna()
df["Rank"] = df.EndOfLine.cumsum().shift(1).fillna(0)
df.groupby("Rank").agg(
{
"Company": lambda s: " ".join(s),
"Sales": lambda x: x.tail(1),
"Location": lambda x: x.tail(1)
}
)
How you define when the definition stops (EndOfLine
) is the most important and the most fragile part, the rest is a technicality.
To demonstrate how the rank works, here is a more verbose version:
df["EndOfLine"] = ~df.Location.isna() & ~df.Location.isna()
df["RawRank"] = df.EndOfLine.cumsum()
df["Rank"] = df.EndOfLine.cumsum().shift(1).fillna(0)
which produces:
Company Sales Location EndOfLine RawRank Rank
0 American NaN NaN False 0 0.0
1 Express NaN NaN False 0 0.0
2 Co. 1000.0 USA True 1 0.0
3 Apple 2000.0 USA True 2 1.0
4 Microsoft NaN NaN False 2 2.0
5 Corp. 1500.0 USA True 3 2.0