Home > Back-end >  Pandas/Python Drop columns base on syntax ':' existence in cell
Pandas/Python Drop columns base on syntax ':' existence in cell

Time:06-22

Consider a dataframe from txt file.

Data txt file: txt

I removed the unwanted lines that contains "Iteration"

import pandas as pd
data = pd.read_csv('results.txt', sep=" ", header=None)
data = data[~data[1].astype(str).str.startswith('Iteration')]

Now its like this: pd

I wanted to remove all the columns that contains ":" to result something like: {2, 7}{2, 7}{2, 7}{2, 7}{1, 2, 7}{1, 2, 6, 7} {1, 2, 6, 7}{1, 2, 3, 6, 7}{1, 2, 3, 6, 7} {1, 3, 4, 6, 7}{1, 3, 4, 5, 6} {3, 4, 5, 6, 7}{1, 3, 4, 5, 7} {1, 4, 5, 6, 7}, .......etc.

and to merge columns such as: "{2," " 7}," into one column "{2,7}" Then I can replace all the column names with timestamps (6:30, 7:00 etc) to make it a proper table like :

6:30 7:00 7:30 8:00 8:30 9:00

{2,7} {2,7} {2,7} {2,7} {1,2,7} {1,2,6,7}

{5,6} {5,6} {5,6} {5,6} {1,5,6} {1,4,5,6}

....

My first step now is to remove columns with ":" , I tried these:

data.loc[:, ~(data[0:122].astype(str).apply(lambda x: x.str.contains(':'))).any()]
data.loc[:, ~(data.contains(':')).any()]

obviously they won't work since contains() cannot be use on data frames, I am not very familier with pandas. I searched for many other answers and wasn't able to get it right. I wonder what is the correct way to remove columns with cells that contains ":"

Or if my approch full of unnecessary steps, its there a better approach to turn the txt file into a dataframe of

6:30 7:00 7:30 8:00 8:30 9:00

{2,7} {2,7} {2,7} {2,7} {1,2,7} {1,2,6,7}

{5,6} {5,6} {5,6} {5,6} {1,5,6,} {1,4,5,6}

like sort it when reading the txt file at first place and loop through lines and make dictionaries instead? '6:30': {2, 7} ....

CodePudding user response:

Preprocess your file before create dataframe:

import pandas as pd
import ast

data = []
with open('results.txt') as fp:
    for line in fp:
        if line.startswith('Shifts:'):
            data.append(ast.literal_eval(line.split(':', maxsplit=1)[1].strip()))
df = pd.DataFrame.from_dict(data)

Important update: BUG FIX

ast.literal_eval works well here but I can guarantee the order of the sets after evaluation

>>> ast.literal_eval('{8, 2, 3, 7}')
{2, 3, 7, 8}  # We don't want that!

Maybe you can use re.findall to get the same output:

import pandas as pd
import re

data = []
with open('results.txt') as fp:
    for line in fp:
        if line.startswith('Shifts:'):
            data.append(dict(re.findall(r"'([^'] )': (\{[^\}] \})", d[1:-1])))
df = pd.DataFrame.from_dict(data)

The output still the same but the cell values are not set but str:

>>> df
       6:30    7:00    7:30    8:00       8:30  ...         16:30   17:00   17:30   18:00   18:30
0    {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  ...  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}
1    {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  ...  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}
2    {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  ...  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}
3    {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  ...  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}
4    {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  ...  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}
..      ...     ...     ...     ...        ...  ...           ...     ...     ...     ...     ...
142  {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  ...  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}
143  {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  ...  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}
144  {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  ...  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}
145  {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  ...  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}
146  {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  ...  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}

[147 rows x 25 columns]

CodePudding user response:

data = []
with open('results.txt') as f:
    for line in f:
        if line.startswith('Shifts: '):
            data.append(eval(line[8:]))
df = pd.DataFrame(data)
print(df.head())

Output:

     6:30    7:00    7:30    8:00       8:30          9:00          9:30  \
0  {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  {1, 2, 6, 7}  {1, 2, 6, 7}   
1  {5, 6}  {5, 6}  {5, 6}  {5, 6}  {1, 5, 6}  {1, 4, 5, 6}  {1, 4, 5, 6}   
2  {8, 1}  {8, 1}  {8, 1}  {8, 1}  {8, 1, 7}  {8, 1, 2, 7}  {8, 1, 2, 7}   
3  {1, 6}  {1, 6}  {1, 6}  {1, 6}  {1, 5, 6}  {1, 4, 5, 6}  {1, 4, 5, 6}   
4  {3, 4}  {3, 4}  {3, 4}  {3, 4}  {3, 4, 5}  {2, 3, 4, 5}  {2, 3, 4, 5}   

             10:00            10:30            11:00  ...            14:00  \
0  {1, 2, 3, 6, 7}  {1, 2, 3, 6, 7}  {1, 3, 4, 6, 7}  ...  {1, 3, 4, 6, 7}   
1  {1, 4, 5, 6, 7}  {1, 4, 5, 6, 7}  {1, 4, 5, 6, 7}  ...  {1, 4, 5, 6, 7}   
2  {1, 2, 3, 7, 8}  {1, 2, 3, 7, 8}  {1, 2, 3, 6, 7}  ...  {1, 2, 3, 6, 7}   
3  {1, 3, 4, 5, 6}  {1, 3, 4, 5, 6}  {3, 4, 5, 6, 7}  ...  {1, 3, 4, 5, 7}   
4  {1, 2, 3, 4, 5}  {1, 2, 3, 4, 5}  {1, 2, 4, 5, 7}  ...  {1, 2, 3, 5, 7}   

             14:30               15:00            15:30            16:00  \
0  {1, 3, 4, 5, 6}  {1, 3, 4, 5, 6, 8}  {1, 3, 4, 5, 8}  {1, 3, 4, 5, 8}   
1  {1, 2, 3, 4, 7}  {1, 2, 3, 4, 7, 8}  {1, 2, 3, 7, 8}  {1, 2, 3, 7, 8}   
2  {2, 3, 5, 6, 7}  {2, 3, 4, 5, 6, 7}  {3, 4, 5, 6, 7}  {3, 4, 5, 6, 7}   
3  {3, 4, 5, 7, 8}  {2, 3, 4, 5, 7, 8}  {2, 3, 5, 7, 8}  {2, 3, 5, 7, 8}   
4  {1, 2, 5, 6, 7}  {1, 2, 5, 6, 7, 8}  {1, 5, 6, 7, 8}  {1, 5, 6, 7, 8}   

          16:30   17:00   17:30   18:00   18:30  
0  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}  
1  {8, 2, 3, 7}  {8, 3}  {8, 3}  {8, 3}  {8, 3}  
2  {3, 4, 5, 6}  {4, 5}  {4, 5}  {4, 5}  {4, 5}  
3  {8, 2, 3, 7}  {2, 7}  {2, 7}  {2, 7}  {2, 7}  
4  {8, 1, 6, 7}  {8, 7}  {8, 7}  {8, 7}  {8, 7}  

[5 rows x 25 columns]
  • Related