I have the following dataframe: (containing information like the one below)
import pandas as pd
data = {
"items": ["4.2 Paint", "4.2.1 Paint job", "4.2.1.10 Paint red", "3.2 Seats", "3.2.3.8 Seat belt"]
}
df = pd.DataFrame(data)
print(df)
items
0 4.2 Paint
1 4.2.1 Paint job
2 4.2.1.10 Paint red
3 3.2 Seats
4 3.2.3.8 Seat belt
How can I display just the following?
items
0 4.2.1.10 Paint red
1 3.2.3.8 Seat belt
CodePudding user response:
It's very hard to workout what the criteria is here but if it's looking for the 4th subgroups then filter for when there are 3 dots.
df[df['items'].apply(lambda x: x.count(".")==3)]
-=-=-EDIT-=-==-
If want the max per subgroup then something like this would work.
- get the group number
- count the
.
per line - within each group select the max
df['group'] = df['items'].apply(lambda x: pd.to_numeric(x.split('.', 1)[0]))
df['level'] = df['items'].apply(lambda x: x.count("."))
df.groupby('group').apply(lambda x: x.loc[x['level'] == x['level'].max()])
CodePudding user response:
Here is another way to do it using regex
df[df['items'].str.match(r'^((\d [\.|\s]){4})\D')]
^ : from begining
(\d [.|\s]) : capture group of digit(s) followed by a period or space
{4} : occurrences of the previous captured group
\D : followed by a non-digit-
items
2 4.2.1.10 Paint red
4 3.2.3.8 Seat belt