Home > database >  How to drop columns if a row in a column has a "url"or "http" in it without know
How to drop columns if a row in a column has a "url"or "http" in it without know

Time:03-24

How can drop column A because it has the following "https://" in python?

Back story: I have a 500 column Data Frame where 250 columns are "https://" links in the rows explaining what the prior variable is.

The goal is to loop through the df to drop columns that have "http://"

A B
https://mywebsite 25
https://mywebsite 42

My desired output is:

B
25
42

CodePudding user response:

The following code snippet should work, removing any columns that contain urls:

to_drop = []
for column in df:
  try:
      has_url = df[column].str.startswith('https://').any()
  except AttributeError:
      pass  # dtype is not string
  
  if has_url:
      to_drop.append(column)

df.drop(columns=to_drop, inplace=True)

For each column, it checks whether each row starts with 'https://'. If any of them do, then they are added to a 'to_drop' list of columns to drop. Then the columns in this list are dropped inplace.

The version below will only drop a column if at least 50% of the values in it are URLs:

to_drop = []
for column in df:
  try:
      has_url = df[column].str.startswith('https://').mean() > 0.5
  except AttributeError:
      pass  # dtype is not string
  
  if has_url:
      to_drop.append(column)

df.drop(columns=to_drop, inplace=True)

You can change 0.5 to another number between 0 and 1 to change how big of a percentage should be URLs in order for the column to be dropped.

  • Related