Merging pandas df rows if they start with the same int-CodePudding

I have a pandas dataframe that I need to work with. The format is like the following

  Label
0 | 70 hello
1 | 70 world
2 | 40 hi
3 | 40 I
4 | 45 panda

I need to merge rows of the df that start with the same integer (i.e. 70). Some can be two digit or three digit. The final format should be like this:

  Label
0 | 70 hello world
1 | 40 hi I
2 | 45 panda

I've tried this.

search = "43"
  
check_if_num = df["Label"].str.startswith(search, na = False)
check_if_num

But, it only returns a boolean value for 1 number. Don't know how to expand it to integers as a whole (the numbers that the row can start with is not limited to the example above).

How would I accomplish this?

I'm fairly new to working with pandas and Stack Overflow.

CodePudding user response：

Use str.extract to split digits and string then groupby digit and create the final string for each group:

>>> df['Label'].str.extract(r'^(\d )\s(.*)') \
               .groupby(0, sort=False)[1] \
               .apply(lambda x: f"{x.name} {' '.join(x)}") \
               .reset_index(drop=True).rename('Label').to_frame()

            Label
0  70 hello world
1         40 hi I
2        45 panda

After the extraction, there are 2 columns because, one for each capture group (...). The column 0 contains the digits rather than the column 1 contains the string.

x.name in apply is the name of the group (the number) and x the values (the string).

The last line is only for presentation to match the expect outcome, there is no logic inside

CodePudding user response：

Try this:

df= pd.DataFrame({"ID":[0,1,2,3,4], "Label":['70 hello', '70 world', '40 hi', '40 I', '45 panda']})

df['Num']=df.Label.apply(lambda x: re.search('(\d*)', x).group(0))
df['txt']=df.Label.apply(lambda x: re.search('([a-zA-Z] )', x).group(0))

df.groupby('Num')['txt'].transform(lambda x: ' '.join(x)).drop_duplicates( keep='first')

Output:

0    hello world
2           hi I
4          panda

CodePudding user response：

Something like this?

import pandas as pd

df = pd.DataFrame({"label":[70,70,40,40,45],"text":["hello","world","hi","I","panda"]})

print(df)

#   label   text
# 0 70  hello
# 1 70  world
# 2 40  hi
# 3 40  I
# 4 45  panda

df.groupby("label")["text"].apply(" ".join)

#label
#40           hi I
#45          panda
#70    hello world

EDIT: Thanks to Quang Hoang for pointing out that we just can parse " ".join into apply and don't need to do apply(lambda x: " ".join(x))