how to classify the rows based on the certain amount of columns (delete the rows which have more col-CodePudding

Well, I have a very big data, in short, I can say I have 10K events (rows), and in each event, I have different stations (columns, maximum for example 30 stations), and I want to set the maximum amount of columns (e.g. 20 columns) and those events which have more than 20 columns be excluded,

my question is how can I find the rows (within 10K) which have more than 20 columns and then drop() them.

for example in the picture below, I want to set 7 stations (columns) as length, therefore the rows which have more than 7 stations should be deleted. enter image description here

I do not know making the loop would be worked or not

for i in range(7,10):
    if df.iloc[i]['station8','statoin9','station10']== True:
        print i,df.iloc[i]['station8','statoin9','station10']

CodePudding user response：

Using Pandas function read_csv and attributes:

skiprows to determine rows to skip
use cols determines which columns to use

Code

def generate_df(filenm, number_columns):
    '''
    Returns dataframe from CSV file
    - ignores rows more than max_cols

    '''
    # Find rows not to use
    with open("test.csv", 'r') as f:
        # get indexes of rows to skip
        skip_rows = [i  for i, line in enumerate(f) if i > 0 and len(line.split(",")) > number_columns]

    # Create dataframe with desired columns and rows
    return pd.read_csv('test.csv', usecols = range(number_columns), skiprows = skip_rows)

Example

File: test.csv

EventID,Station1,Station2,Station3,Station4
1,A,B
2,C,D,E
3,F,G
4,H,I,J
5,K,L

Will retrieve the first two stations

number_stations = 2
number_columns = 1   number_stations  # add 1 to include evenID column
df = generate_df('test.csv', number_columns)

    EventID Station1    Station2
0   1   A   B
1   3   F   G
2   5   K   L

CodePudding user response：

Instead of dropping rows that do have values in the extra columns, you can alternatively only keep the rows that have no values there.

Some test data (your example):

import pandas as pd
import numpy as np

value_counts = [7, 7, 10, 7, 7, 8, 9, 7, 9, 7]

df = pd.DataFrame([['x']*v   [np.nan]*(10-v) for v in value_counts],
                  columns=[f'Station{i}' for i in np.arange(10) 1],
                  index=np.arange(10) 1)

Result:

   Station1 Station2 Station3  ... Station8 Station9 Station10
1         x        x        x  ...      NaN      NaN       NaN
2         x        x        x  ...      NaN      NaN       NaN
3         x        x        x  ...        x        x         x
4         x        x        x  ...      NaN      NaN       NaN
5         x        x        x  ...      NaN      NaN       NaN
6         x        x        x  ...        x      NaN       NaN
7         x        x        x  ...        x        x       NaN
8         x        x        x  ...      NaN      NaN       NaN
9         x        x        x  ...        x        x       NaN
10        x        x        x  ...      NaN      NaN       NaN

Now you can keep the rows that only contain NaN in the extra columns:

columns_to_include = 7

df[df.iloc[:, columns_to_include:].isna().all(axis=1)]

Result:

   Station1 Station2 Station3  ... Station8 Station9 Station10
1         x        x        x  ...      NaN      NaN       NaN
2         x        x        x  ...      NaN      NaN       NaN
4         x        x        x  ...      NaN      NaN       NaN
5         x        x        x  ...      NaN      NaN       NaN
8         x        x        x  ...      NaN      NaN       NaN
10        x        x        x  ...      NaN      NaN       NaN