I have a data frame and the first column is populated with strings. I want to create rules such that the whole row of the data frame is deleted if the first word in the string of the first column only contains one letter, and for all other strings everything after the first space will be deleted (only in that column the row will be kept).
An example is as follows:
Column 1 | Many other columns |
---|---|
G Gsrnrf | xxxxxxx |
DNN fdffk | xxxxx |
OCGN fnjfdj | xxxxx |
In the first row there is only one letter so the whole row should be deleted.
In the second row, the first column should only contain 'DNN' (i.e. 'fdffk' is deleted and the rest of the row is kept)
In the third row, the first column should only contain 'OCGN' and delete 'fnjfdj'.
CodePudding user response:
You can extract the expected value from "Column 1" using the \S\S
regex (anything that is not space and of length >1).
Then use dropna
on the modified "Column 1" to delete the rows where there was no match:
df['Column 1'] = df['Column 1'].str.extract('^(\S\S )')
df = df.dropna(subset=['Column 1'])
output:
Column 1 Many other columns
1 DNN xxxxx
2 OCGN xxxxx
CodePudding user response:
Another solution (without regex):
df["Column 1"] = df["Column 1"].str.split(n=1).str[0]
df = df[df["Column 1"].str.len().ne(1)]
print(df)
Prints:
Column 1 Many other columns
1 DNN xxxxx
2 OCGN xxxxx