Home > Software design >  How do I create rules for strings in a column in a dataframe?
How do I create rules for strings in a column in a dataframe?

Time:09-16

I have a data frame and the first column is populated with strings. I want to create rules such that the whole row of the data frame is deleted if the first word in the string of the first column only contains one letter, and for all other strings everything after the first space will be deleted (only in that column the row will be kept).

An example is as follows:

Column 1 Many other columns
G Gsrnrf xxxxxxx
DNN fdffk xxxxx
OCGN fnjfdj xxxxx

In the first row there is only one letter so the whole row should be deleted.

In the second row, the first column should only contain 'DNN' (i.e. 'fdffk' is deleted and the rest of the row is kept)

In the third row, the first column should only contain 'OCGN' and delete 'fnjfdj'.

CodePudding user response:

You can extract the expected value from "Column 1" using the \S\S regex (anything that is not space and of length >1).

Then use dropna on the modified "Column 1" to delete the rows where there was no match:

df['Column 1'] = df['Column 1'].str.extract('^(\S\S )')
df = df.dropna(subset=['Column 1'])

output:

  Column 1 Many other columns
1      DNN              xxxxx
2     OCGN              xxxxx

CodePudding user response:

Another solution (without regex):

df["Column 1"] = df["Column 1"].str.split(n=1).str[0]
df = df[df["Column 1"].str.len().ne(1)]
print(df)

Prints:

  Column 1 Many other columns
1      DNN              xxxxx
2     OCGN              xxxxx
  • Related