How can drop column A because it has the following "https://" in python?
Back story: I have a 500 column Data Frame where 250 columns are "https://" links in the rows explaining what the prior variable is.
The goal is to loop through the df to drop columns that have "http://"
A | B |
---|---|
https://mywebsite | 25 |
https://mywebsite | 42 |
My desired output is:
B |
---|
25 |
42 |
CodePudding user response:
The following code snippet should work, removing any columns that contain urls:
to_drop = []
for column in df:
try:
has_url = df[column].str.startswith('https://').any()
except AttributeError:
pass # dtype is not string
if has_url:
to_drop.append(column)
df.drop(columns=to_drop, inplace=True)
For each column, it checks whether each row starts with 'https://'. If any of them do, then they are added to a 'to_drop' list of columns to drop. Then the columns in this list are dropped inplace.
The version below will only drop a column if at least 50% of the values in it are URLs:
to_drop = []
for column in df:
try:
has_url = df[column].str.startswith('https://').mean() > 0.5
except AttributeError:
pass # dtype is not string
if has_url:
to_drop.append(column)
df.drop(columns=to_drop, inplace=True)
You can change 0.5 to another number between 0 and 1 to change how big of a percentage should be URLs in order for the column to be dropped.