I have a pandas dataframe that I need to work with. The format is like the following
Label
0 | 70 hello
1 | 70 world
2 | 40 hi
3 | 40 I
4 | 45 panda
I need to merge rows of the df that start with the same integer (i.e. 70). Some can be two digit or three digit. The final format should be like this:
Label
0 | 70 hello world
1 | 40 hi I
2 | 45 panda
I've tried this.
search = "43"
check_if_num = df["Label"].str.startswith(search, na = False)
check_if_num
But, it only returns a boolean value for 1 number. Don't know how to expand it to integers as a whole (the numbers that the row can start with is not limited to the example above).
How would I accomplish this?
I'm fairly new to working with pandas and Stack Overflow.
CodePudding user response:
Use str.extract
to split digits and string then groupby digit and create the final string for each group:
>>> df['Label'].str.extract(r'^(\d )\s(.*)') \
.groupby(0, sort=False)[1] \
.apply(lambda x: f"{x.name} {' '.join(x)}") \
.reset_index(drop=True).rename('Label').to_frame()
Label
0 70 hello world
1 40 hi I
2 45 panda
After the extraction, there are 2 columns because, one for each capture group (...)
. The column 0 contains the digits rather than the column 1 contains the string.
x.name
in apply
is the name of the group (the number) and x
the values (the string).
The last line is only for presentation to match the expect outcome, there is no logic inside
CodePudding user response:
Try this:
df= pd.DataFrame({"ID":[0,1,2,3,4], "Label":['70 hello', '70 world', '40 hi', '40 I', '45 panda']})
df['Num']=df.Label.apply(lambda x: re.search('(\d*)', x).group(0))
df['txt']=df.Label.apply(lambda x: re.search('([a-zA-Z] )', x).group(0))
df.groupby('Num')['txt'].transform(lambda x: ' '.join(x)).drop_duplicates( keep='first')
Output:
0 hello world
2 hi I
4 panda
CodePudding user response:
Something like this?
import pandas as pd
df = pd.DataFrame({"label":[70,70,40,40,45],"text":["hello","world","hi","I","panda"]})
print(df)
# label text
# 0 70 hello
# 1 70 world
# 2 40 hi
# 3 40 I
# 4 45 panda
df.groupby("label")["text"].apply(" ".join)
#label
#40 hi I
#45 panda
#70 hello world
EDIT:
Thanks to Quang Hoang for pointing out that we just can parse " ".join
into apply
and don't need to do apply(lambda x: " ".join(x))