I have an excel workbook which consists of a sheet having 4 columns viz. (traveller no., from, to, medium) and containing the routes travellers took to reach their destination. The traveller number can be repeated based on the number of changes a traveller made to reach their destination.
My question is, how do I only keep the relevant data while removing all other cell values having 'No information' using python.
Example worksheet:
traveller no. | from | to | medium |
---|---|---|---|
1 | manchester | london | train |
2 | derby | coventry | train |
2 | derby | coventry | train |
3 | liverpool | manchester | No information |
3 | manchester | london | flight |
3 | london | wembley | bus |
4 | No information | wolverhampton | No information |
5 | No information | No information | No information |
6 | No information | No information | No information |
7 | No information | No information | No information |
8 | No information | No information | No information |
CodePudding user response:
I am assuiming you are using the pandas library. In that case, you can loop through column titles, and the column that has the string: "No information", will be used to remove the entire row.
df = df[df["from|to|medium"].str.contains("No information") == False]
If the code doesn;t work, try playing with the below part (i.e only add "from").
"from|to|medium"
P.s the variable named df is the dataframe that you need to select.
CodePudding user response:
First of all, you need to load the data frame using pandas.
import pandas as pd
df = pd.read_excel('path/to/file.xlsx')
df
traveller no. from to medium
0 1 manchester london train
1 2 derby coventry train
2 2 derby coventry train
3 3 liverpool manchester No information
4 3 manchester london flight
5 3 london wembley bus
6 4 No information wolverhampton No information
7 5 No information No information No information
8 6 No information No information No information
9 7 No information No information No information
10 8 No information No information No information
After it, you need to specify the exact columns that you want to check and replace 'No information' with a more common NaN:
df = df[df != 'No information']
df
traveller no. from to medium
0 1 manchester london train
1 2 derby coventry train
2 2 derby coventry train
3 3 liverpool manchester NaN
4 3 manchester london flight
5 3 london wembley bus
6 4 NaN wolverhampton NaN
7 5 NaN NaN NaN
8 6 NaN NaN NaN
9 7 NaN NaN NaN
10 8 NaN NaN NaN
It will replace all instances of 'No information' with NaN. Now you can use more widespread approach of dropping NaN values with dropna()
function:
df = df.dropna(subset=['from', 'to', 'medium'])
df
traveller no. from to medium
0 1 manchester london train
1 2 derby coventry train
2 2 derby coventry train
4 3 manchester london flight
5 3 london wembley bus
The keyword subset
will remove all rows with at least one missing value in the specified columns. If you want a bit different behavior, please, address the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
CodePudding user response:
To remove rows from a pandas dataframe that contain the value 'No information' in any of the columns, you can use the DataFrame.drop method in combination with the DataFrame.isin method.
Here's an example of how you can do this:
import pandas as pd
# Load the data into a pandas dataframe
df = pd.read_excel('data.xlsx')
# Drop rows that contain 'No information' in any of the columns
df = df[~df.isin(['No information']).any(axis=1)]
This will drop all rows that contain the value 'No information' in any of the columns, and the resulting dataframe will only contain the relevant data.
If you only want to drop rows that contain 'No information' in specific columns, you can use the DataFrame.drop method in combination with the DataFrame.any method and specify the columns you want to include in the check:
# Drop rows that contain 'No information' in the 'from' and 'to' columns
df = df[~(df['from'].isin(['No information']) | df['to'].isin(['No information']))]
This will drop all rows that contain 'No information' in the 'from' and 'to' columns, while keeping rows that contain 'No information' in the other columns.