Home > Mobile >  Merging pandas df rows if they start with the same int
Merging pandas df rows if they start with the same int

Time:10-05

I have a pandas dataframe that I need to work with. The format is like the following

  Label
0 | 70 hello
1 | 70 world
2 | 40 hi
3 | 40 I
4 | 45 panda

I need to merge rows of the df that start with the same integer (i.e. 70). Some can be two digit or three digit. The final format should be like this:

  Label
0 | 70 hello world
1 | 40 hi I
2 | 45 panda

I've tried this.

search = "43"
  
check_if_num = df["Label"].str.startswith(search, na = False)
check_if_num

But, it only returns a boolean value for 1 number. Don't know how to expand it to integers as a whole (the numbers that the row can start with is not limited to the example above).

How would I accomplish this?

I'm fairly new to working with pandas and Stack Overflow.

CodePudding user response:

Use str.extract to split digits and string then groupby digit and create the final string for each group:

>>> df['Label'].str.extract(r'^(\d )\s(.*)') \
               .groupby(0, sort=False)[1] \
               .apply(lambda x: f"{x.name} {' '.join(x)}") \
               .reset_index(drop=True).rename('Label').to_frame()

            Label
0  70 hello world
1         40 hi I
2        45 panda

After the extraction, there are 2 columns because, one for each capture group (...). The column 0 contains the digits rather than the column 1 contains the string.

x.name in apply is the name of the group (the number) and x the values (the string).

The last line is only for presentation to match the expect outcome, there is no logic inside

CodePudding user response:

Try this:

df= pd.DataFrame({"ID":[0,1,2,3,4], "Label":['70 hello', '70 world', '40 hi', '40 I', '45 panda']})

df['Num']=df.Label.apply(lambda x: re.search('(\d*)', x).group(0))
df['txt']=df.Label.apply(lambda x: re.search('([a-zA-Z] )', x).group(0))

df.groupby('Num')['txt'].transform(lambda x: ' '.join(x)).drop_duplicates( keep='first')

Output:

0    hello world
2           hi I
4          panda

CodePudding user response:

Something like this?

import pandas as pd

df = pd.DataFrame({"label":[70,70,40,40,45],"text":["hello","world","hi","I","panda"]})

print(df)

#   label   text
# 0 70  hello
# 1 70  world
# 2 40  hi
# 3 40  I
# 4 45  panda

df.groupby("label")["text"].apply(" ".join)

#label
#40           hi I
#45          panda
#70    hello world

EDIT: Thanks to Quang Hoang for pointing out that we just can parse " ".join into apply and don't need to do apply(lambda x: " ".join(x))

  • Related