Remove values from excel sheet using python-CodePudding

I have an excel workbook which consists of a sheet having 4 columns viz. (traveller no., from, to, medium) and containing the routes travellers took to reach their destination. The traveller number can be repeated based on the number of changes a traveller made to reach their destination.

My question is, how do I only keep the relevant data while removing all other cell values having 'No information' using python.

Example worksheet:

traveller no.	from	to	medium
1	manchester	london	train
2	derby	coventry	train
2	derby	coventry	train
3	liverpool	manchester	No information
3	manchester	london	flight
3	london	wembley	bus
4	No information	wolverhampton	No information
5	No information	No information	No information
6	No information	No information	No information
7	No information	No information	No information
8	No information	No information	No information

CodePudding user response：

I am assuiming you are using the pandas library. In that case, you can loop through column titles, and the column that has the string: "No information", will be used to remove the entire row.

df = df[df["from|to|medium"].str.contains("No information") == False]

If the code doesn;t work, try playing with the below part (i.e only add "from").

"from|to|medium"

P.s the variable named df is the dataframe that you need to select.

CodePudding user response：

First of all, you need to load the data frame using pandas.

import pandas as pd
df = pd.read_excel('path/to/file.xlsx')
df
    traveller no.            from              to          medium
0               1      manchester          london           train
1               2           derby        coventry           train
2               2           derby        coventry           train
3               3       liverpool      manchester  No information
4               3      manchester          london          flight
5               3          london         wembley             bus
6               4  No information   wolverhampton  No information
7               5  No information  No information  No information
8               6  No information  No information  No information
9               7  No information  No information  No information
10              8  No information  No information  No information

After it, you need to specify the exact columns that you want to check and replace 'No information' with a more common NaN:

df = df[df != 'No information']
df
    traveller no.        from             to  medium
0               1  manchester         london   train
1               2       derby       coventry   train
2               2       derby       coventry   train
3               3   liverpool     manchester     NaN
4               3  manchester         london  flight
5               3      london        wembley     bus
6               4         NaN  wolverhampton     NaN
7               5         NaN            NaN     NaN
8               6         NaN            NaN     NaN
9               7         NaN            NaN     NaN
10              8         NaN            NaN     NaN

It will replace all instances of 'No information' with NaN. Now you can use more widespread approach of dropping NaN values with dropna() function:

df = df.dropna(subset=['from', 'to', 'medium'])
df
      traveller no.        from        to  medium
0              1  manchester    london   train
1              2       derby  coventry   train
2              2       derby  coventry   train
4              3  manchester    london  flight
5              3      london   wembley     bus

The keyword subset will remove all rows with at least one missing value in the specified columns. If you want a bit different behavior, please, address the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

CodePudding user response：

To remove rows from a pandas dataframe that contain the value 'No information' in any of the columns, you can use the DataFrame.drop method in combination with the DataFrame.isin method.

Here's an example of how you can do this:

import pandas as pd

# Load the data into a pandas dataframe
df = pd.read_excel('data.xlsx')

# Drop rows that contain 'No information' in any of the columns
df = df[~df.isin(['No information']).any(axis=1)]

This will drop all rows that contain the value 'No information' in any of the columns, and the resulting dataframe will only contain the relevant data.

If you only want to drop rows that contain 'No information' in specific columns, you can use the DataFrame.drop method in combination with the DataFrame.any method and specify the columns you want to include in the check:

# Drop rows that contain 'No information' in the 'from' and 'to' columns
df = df[~(df['from'].isin(['No information']) | df['to'].isin(['No information']))]

This will drop all rows that contain 'No information' in the 'from' and 'to' columns, while keeping rows that contain 'No information' in the other columns.