Well, I have a very big data, in short, I can say I have 10K events (rows), and in each event, I have different stations (columns, maximum for example 30 stations), and I want to set the maximum amount of columns (e.g. 20 columns) and those events which have more than 20 columns be excluded,
my question is how can I find the rows (within 10K) which have more than 20 columns and then drop()
them.
for example in the picture below, I want to set 7 stations (columns) as length, therefore the rows which have more than 7 stations should be deleted. enter image description here
I do not know making the loop would be worked or not
for i in range(7,10):
if df.iloc[i]['station8','statoin9','station10']== True:
print i,df.iloc[i]['station8','statoin9','station10']
CodePudding user response:
Using Pandas function read_csv and attributes:
- skiprows to determine rows to skip
- use cols determines which columns to use
Code
def generate_df(filenm, number_columns):
'''
Returns dataframe from CSV file
- ignores rows more than max_cols
'''
# Find rows not to use
with open("test.csv", 'r') as f:
# get indexes of rows to skip
skip_rows = [i for i, line in enumerate(f) if i > 0 and len(line.split(",")) > number_columns]
# Create dataframe with desired columns and rows
return pd.read_csv('test.csv', usecols = range(number_columns), skiprows = skip_rows)
Example
File: test.csv
EventID,Station1,Station2,Station3,Station4
1,A,B
2,C,D,E
3,F,G
4,H,I,J
5,K,L
Will retrieve the first two stations
number_stations = 2
number_columns = 1 number_stations # add 1 to include evenID column
df = generate_df('test.csv', number_columns)
df
EventID Station1 Station2
0 1 A B
1 3 F G
2 5 K L
CodePudding user response:
Instead of dropping rows that do have values in the extra columns, you can alternatively only keep the rows that have no values there.
Some test data (your example):
import pandas as pd
import numpy as np
value_counts = [7, 7, 10, 7, 7, 8, 9, 7, 9, 7]
df = pd.DataFrame([['x']*v [np.nan]*(10-v) for v in value_counts],
columns=[f'Station{i}' for i in np.arange(10) 1],
index=np.arange(10) 1)
Result:
Station1 Station2 Station3 ... Station8 Station9 Station10
1 x x x ... NaN NaN NaN
2 x x x ... NaN NaN NaN
3 x x x ... x x x
4 x x x ... NaN NaN NaN
5 x x x ... NaN NaN NaN
6 x x x ... x NaN NaN
7 x x x ... x x NaN
8 x x x ... NaN NaN NaN
9 x x x ... x x NaN
10 x x x ... NaN NaN NaN
Now you can keep the rows that only contain NaN
in the extra columns:
columns_to_include = 7
df[df.iloc[:, columns_to_include:].isna().all(axis=1)]
Result:
Station1 Station2 Station3 ... Station8 Station9 Station10
1 x x x ... NaN NaN NaN
2 x x x ... NaN NaN NaN
4 x x x ... NaN NaN NaN
5 x x x ... NaN NaN NaN
8 x x x ... NaN NaN NaN
10 x x x ... NaN NaN NaN