How can I merge rows with NAN in a Pandas Dataframe-CodePudding

I have the following Pandas DataFrame:

|Company  | Sales | Location| 
-----------------------------
|American |    NaN|      NaN|
|Express  |    NaN|      NaN|
|Co.      |  1,000|      USA|
|Apple    |  2,000|      USA|
|Microsoft|    NaN|      NaN|
|Corp.    |  1,500|      USA|

If a company's name is too long, it has a row speficially to fit that name. I want to 'merge' the rows in a wat that if there is NaN sales, it will merge with the rows below until there is a sales record.

This is what I want the output to look like!

|Company             | Sales | Location| 
----------------------------------------
|American Express Co.|  1,000|      USA|
|Apple               |  2,000|      USA|
|Microsoft Corp.     |  1,500|      USA|

I tried fillna, but this would just copy the sales amount, whereas I really want to merge so I can keep the entire company name in the Company column.

Thanks!

CodePudding user response：

here is one way to do it

first to backfill so, all the individual parts of the company names has the same sales and location. then using the groupby on sales and location, the names are combined.

Instead of apply, we can use sum for Company name, but only caveat being that there won't be space in between individual parts, unless there is a trailing or leading space in individually named parts.

df[['Sales', 'Location']]=df[['Sales', 'Location']].fillna(method='bfill')
df.groupby(['Sales','Location'])['Company'].apply(' '.join).reset_index()

alternately, to avoid a scenarios where sales and location is same for two companies, we assign a unique id to each company (where location is not null) and use that in consolidating

# assign an index to rows where location is not null, and we use that to combine the splitted company names

df['idx']=df['Location'].notna().cumsum()
df.loc[df['Location'].isna(),'idx']=np.nan

df[['idx', 'Sales', 'Location']]=df[['idx', 'Sales', 'Location']].fillna(method='bfill')
df.groupby(['idx', 'Sales', 'Location'])['Company'].apply(' '.join).reset_index().drop(columns='idx')

Sales   Location    Company
0   1,000   USA     American Express Co.
1   1,500   USA     Microsoft Corp.
2   2,000   USA     Apple

CodePudding user response：

A solution is to introduce a criteria that defines which lines are assumed to be followed by more details and which are "terminal". Then, it can be used to rank the lines: assign same rank to all lines that belong to the same group. Finally, the rank can be used to group the lines together.

df["EndOfLine"] = ~df.Location.isna() & ~df.Location.isna()
df["Rank"] = df.EndOfLine.cumsum().shift(1).fillna(0)

df.groupby("Rank").agg(
    {
        "Company": lambda s: " ".join(s),
        "Sales": lambda x: x.tail(1),
        "Location": lambda x: x.tail(1)
    }
)

How you define when the definition stops (EndOfLine) is the most important and the most fragile part, the rest is a technicality.

To demonstrate how the rank works, here is a more verbose version:

df["EndOfLine"] = ~df.Location.isna() & ~df.Location.isna()
df["RawRank"] = df.EndOfLine.cumsum()
df["Rank"] = df.EndOfLine.cumsum().shift(1).fillna(0)

which produces:

     Company   Sales Location  EndOfLine  RawRank    Rank
0   American     NaN      NaN      False        0     0.0
1    Express     NaN      NaN      False        0     0.0
2        Co.  1000.0      USA       True        1     0.0
3      Apple  2000.0      USA       True        2     1.0
4  Microsoft     NaN      NaN      False        2     2.0
5      Corp.  1500.0      USA       True        3     2.0