I am writing a program where I want to count the number of columns in each row as each file has a different number of columns. It means I want to check if any row is missing a cell, and if it does, then I want to highlight the cell number. I am using pandas for that to read the file. I have multiple gzip files which contain another CSV file. My code for reading the files:
#running this under loop
data = pd.read_csv(files,
compression='gzip'
on_bad_lines='warn'
low_memory=False,
sep=r'|',
header=None,
na_values=['NULL',' ','NaN'],
keep_default_na = False
)
I checked StackOverflow but there's no answer related to this situation. I would be really glad if someone can help me out here.
CodePudding user response:
Not sure if i'm interpreting this right but if you want to count the number of columns in each pandas dataframe within a loop, there are plenty of options.
1) data.shape[1]
2) len(data.columns)
3) len(list(data))
Here is a minimal reproducibility code. Replace "data = pd.DataFrame(dict)" with "data = pd.read_csv(...)"
# Import Required Libraries
import pandas as pd
import numpy as np
# Create dictionaries for the dataframe
dict1 = {'Name': ['Anne', 'Bob', 'Carl'],
'Age': [22, 20, 22],
'Marks': [90, 84, 82]}
dict2 = {'Name': ['Dan', 'Ely', 'Fan'],
'Age': [52, 30, 12],
'Marks': [40, 54, 42]}
for i in [dict1, dict2]:
# Read data
data = pd.DataFrame(dict1)
# Get columns
shape = data.shape # (3,3)
col = shape[1] # 3
# Printing Number of columns
print(f'Number of columns for file <>: {col}')
"This works fine, but after trying your suggestion I am getting the total number of columns that we have in our data frame. I want to print the number of columns each row contains. For eg: S.no Name 1 Adam 2 George 3 NULL so, 1st row will print 2, the second will be 2, but the third will print one."
– Ramoxx
Below is the updated answer(s) for your specification
Get counts of non-nulls for each row
data.apply(lambda x: x.count(), axis=1)
data:
A B C
0: 1 2 3
1: 2 nan nan
2: nan nan nan
output:
0: 3
1: 1
2: 0
Add counts of non-nulls for each row into dataframe
data['count'] = data.apply(lambda x: x.count(), axis=1)
result:
A B C count
0: 1 1 3 3
1: 2 nan nan 1
2: nan nan nan 0