Concat created Nan values even after index

I want to create a csv file that combines the train and test data and labels to use it for a project. The problem is that in concat function, even after using the index reset, the labels continue being Nan and i don't understand what is wrong. The datasets are in this link : https://wetransfer.com/downloads/9f0562b7ec341ebb663262af78971b8020211228154538/84d58d

import pandas as pd
from sklearn.utils import shuffle
 
# remove first col from training dataset
data = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_data.csv')
first_column = data.columns[0]  
data = data.drop([first_column], axis=1)
data.to_csv('new1.csv', index=False)

# remove first col from testing dataset
data2 = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_data.csv')
first_column = data2.columns[0]  
data2 = data2.drop([first_column], axis=1)
data2.to_csv('new2.csv', index=False)

#read training labels
data_labels = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_label.csv')
#read testing labels
data2_labels = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_label.csv')

train = pd.concat([data_labels, data], axis=1, join='inner')
print(train.shape)
test = pd.concat([data2_labels, data2], axis=1, join='inner')
print(test.shape)
test.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)
frame = pd.concat([train, test], axis=0)
print(frame)

CodePudding user response：

I suspect what's happening is you have duplicate index values before the concat(). (They're possibly only duplicated between the train & test sets, not necessarily duplicates within the sets separately.) That might throw off concat(), since index values are assumed to be unique... and it might compensate by setting some to NaN. The calls to reset_index() are going to give each of them separately index values starting from 1.

To fix this: Set ignore_index=True in pd.concat(). From the docs:

ignore_index: bool, default False If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

If that doesn't work, check: Do test & train have NaNs in the index before concatenation and after reset_index()? They shouldn't, but check. If they do, those will carry over into the concat.

CodePudding user response：

I just did concats with different order and it worked. The nans were the result of no merging the labels right. Instead of creating one single col with labels I created two with half of them empty, one with the train_labels and one with test_labels.

import pandas as pd
from sklearn.utils import shuffle
 
# remove first col from training dataset
data = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_data.csv')
first_column = data.columns[0]  
data = data.drop([first_column], axis=1)
print(data.shape)

# remove first col from testing dataset
data2 = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_data.csv')
first_column = data2.columns[0]  
data2 = data2.drop([first_column], axis=1)
print(data2.shape)

#read training labels
data_labels = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_label.csv')
print(data_labels.shape)
#read testing labels
data2_labels = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_label.csv')
print(data2_labels.shape)
#concat data without labels
frames = [data, data2]
d = pd.concat(frames)

#concat labels
l = data_labels.append(data2_labels)

#create the original dataset 
print(d.shape, l.shape)
dataset = pd.concat([l, d], axis=1)
dataset = shuffle(dataset)
dataset