I want to create a Dataframe that contains columns that have 'X' above them in the header, but I'm unable to find a way to get column index numbers.
I would select the right dataframe once I have the column index numbers.
df_right_columns = df_right_column[df_right_column.columns[column_numbers]]
Sample df:
df = pd.DataFrame({'X': ['column1', 'cell 1', 'cell 2', 'cell 3', 'cell 4'],
'X': ['column2', 'cell 2', 'cell 3', 'cell 4', 'cell 6'],
'': ['column3', 'cell 3', 'cell 4', 'cell 5', 'cell 7'],
'X': ['column4', 'cell 4', 'cell 5', 'cell 6', 'cell 8']})
X X X
column1 column2 column3 column4
cell 1 cell 2 cell 3 cell 4
cell 2 cell 3 cell 4 cell 5
cell 3 cell 4 cell 5 cell 6
cell 4 cell 5 cell 6 cell 7
cell 5 cell 6 cell 7 cell 8
I have tried running this dataframe through for loop to try get the index numbers but haven't had any luck so far. I did this by 1) locating the X header row, 2) running this row through for loop to check columns that contain 'X' in the df.iloc[0] row.
df = df.iloc[0]
for cell in df:
if 'X' in cell:
print(cell.index) #this will return an object - <built-in method index of str object at 0x7f23be18a9b0>
print(cell) #this will return the cell value not the index, X in this case
I'm very close and any help would be grealty appreciated, many thanks
Solution
df = pd.DataFrame({'X': ['column1', 'cell 1', 'cell 2', 'cell 3', 'cell 4'],
'X': ['column2', 'cell 2', 'cell 3', 'cell 4', 'cell 6'],
'': ['column3', 'cell 3', 'cell 4', 'cell 5', 'cell 7'],
'X': ['column4', 'cell 4', 'cell 5', 'cell 6', 'cell 8']})
print(df)
X X X
column1 column2 column3 column4
cell 1 cell 2 cell 3 cell 4
cell 2 cell 3 cell 4 cell 5
cell 3 cell 4 cell 5 cell 6
cell 4 cell 5 cell 6 cell 7
cell 5 cell 6 cell 7 cell 8
df_header = df.iloc[0]
column_number = []
i = 0
while i < len(df_header):
for column_index in df_header:
if 'X' in column_index:
column_number.append(i)
i = 1
df = df[df.columns[column_number]]
print(df)
X X X
column1 column2 column4
cell 1 cell 2 cell 4
cell 2 cell 3 cell 5
cell 3 cell 4 cell 6
cell 4 cell 5 cell 7
cell 5 cell 6 cell 8
CodePudding user response:
You can use a MultiIndex:
Xs = ['X', 'X', None, 'X']
df.columns = pd.MultiIndex.from_arrays([Xs, df.columns])
or, from a list of positions:
pos = (0,1,3)
Xs = ['X' if i in pos else '' for i in range(len(df.columns))]
df.columns = pd.MultiIndex.from_arrays([Xs, df.columns])
output:
X X NaN X
column1 column2 column3 column4
0 cell 1 cell 2 cell 3 cell 4
1 cell 2 cell 3 cell 4 cell 5
2 cell 3 cell 4 cell 5 cell 6
3 cell 4 cell 5 cell 6 cell 7
4 cell 5 cell 6 cell 7 cell 8
input:
df = pd.DataFrame({'column1': ['cell 1', 'cell 2', 'cell 3', 'cell 4', 'cell 5'],
'column2': ['cell 2', 'cell 3', 'cell 4', 'cell 5', 'cell 6'],
'column3': ['cell 3', 'cell 4', 'cell 5', 'cell 6', 'cell 7'],
'column4': ['cell 4', 'cell 5', 'cell 6', 'cell 7', 'cell 8']})
CodePudding user response:
try it:
table = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
{'a': 100, 'b': 200, 'c': 300, 'd': 400},
{'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000}]
df = pd.DataFrame(table)
print("total size row: ", df.index.size)
for value in df.values:
print("size col: ", value.size, "value:", value)
output:
total size row: 3
size col: 4 value: [1 2 3 4]
size col: 4 value: [100 200 300 400]
size col: 4 value: [1000 2000 3000 4000]