I'm trying to remove rows of data that I don't need after importing from files and concatenating my list of dataframes. Here is what my current DataFrame looks like:
Best Movie
0 Movie: Orphan
1 2.
2 Movie: Avatar
3 3.
4 Movie: Inglourious Basterds
... ...
2371 Movie: The Deep End of the Ocean
2372 49.
2373 Movie: Drop Dead Gorgeous
2374 50.
2375 Movie: Go
I need to remove all rows with just the number in them so result looks like this:
Best Movie
0 Movie: Orphan
2 Movie: Avatar
4 Movie: Inglourious Basterds
... ...
2371 Movie: The Deep End of the Ocean
2373 Movie: Drop Dead Gorgeous
2375 Movie: Go
Thank you for your help!
CodePudding user response:
One solution using str.match
mask = ~df["Best Movie"].str.match(r"^\s*\d \.$")
res = df[mask]
print(res)
Output
Best Movie
0 Movie: Orphan
2 Movie: Avatar
4 Movie: Inglourious Basterds
5 Movie: The Deep End of the Ocean
7 Movie: Drop Dead Gorgeous
9 Movie: Go
UPDATE
To replace "Movie:" and reset the index, do:
res = df[mask].reset_index()
res = res["Best Movie"].str.replace(r"^\s*Movie:", "", regex=True)
print(res)
Output
0 Orphan
1 Avatar
2 Inglourious Basterds
3 The Deep End of the Ocean
4 Drop Dead Gorgeous
5 Go
Name: Best Movie, dtype: object
CodePudding user response:
You can do:
df.loc[~df['Best Movie'].str.match('^\d .$')]
CodePudding user response:
Sample input
df = pd.DataFrame({
"Best_Movie": ["Movie: Orphan", "2.", "Movie: Avatar", "3."]
})
apply pd.to_numeric. the rows with only numbers will be converted to float and others will be marked as NaN.
df["nums"] = pd.to_numeric(df['Best_Movie'], errors='coerce')
extract rows which has text (i.e. rows marked as nan )
df.loc[df.nums.isnull(), "Best_Movie"]
Sample output
0 Movie: Orphan
2 Movie: Avatar
Name: Best_Movie, dtype: object
CodePudding user response:
Try the following. '|' is basically means or in this case
df[~df['Best Movie'].str.contains('|'.join(str(i) for i in range(10)))]