Import a text file with Pandas as a Dataframe where there are several records per date-CodePudding

The structure of the file is like this:

Time,Weight,Change,BMI,Body Fat,Skeletal Muscle Mass,Bone Mass,Body Water,
" Oct 10, 2020",
5:15 PM,82.7 kg,1.1 kg,22.4,17.4 %,35.6 kg,5.8kg,60.3 %,
" Nov 2, 2020",
11:40 PM,84.0 kg,0.1 kg,22.8,17.5 %,35.9 kg,6.0 kg,60.3 %,
10:20 AM,83.9 kg,0.5 kg,22.8,18.1 %,35.9 kg,5.9 kg,59.8 %,
13:40 AM,84.4 kg,0.8 kg,22.9,17.3 %,36.0 kg,6.0 kg,60.4 %,

The problem is that importing the file usually doesn't recognise the date as the same line, and some dates have several records.

Is there a way to import the file somehow so that each row has the corresponding date? I suppose there should be two delimiters: start of new line and comma and it has to be combined somehow.

The usual way I tried only gives the output where each date is in a separate line.

data = pd.read_csv('test.txt', sep=',')
data.columns = ["Date","Time", "Weight", "Change", "BMI", "Body Fat", "Skeletal Muscle Mass", "Bone Mass", "Body Water"]

Output:

Date	Time	Weight	Change	BMI	Body Fat	Skeletal Muscle Mass	Bone Mass	Body Water
Sep 21, 2020	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5:46 PM	76.5 kg	1.1 kg	22.4	17.4 %	35.6 kg	5.8 kg	60.3 %	NaN
Oct 12, 2020	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8:01 AM	75.3 kg	1.0 kg	22.2	16.8 %	33.2 kg	4.9 kg	50.8 %	NaN
10:00 AM	75.4 kg	1.1 kg	22.2	16.9 %	33.2 kg	4.9 kg	50.8 %	NaN

CodePudding user response：

Your file looks like not structured "well".

e.g., your date is in the one line and their corresponding time records with data are in a different line. Pandas is expecting each line is a record according to the column definition you provided.

So you need to correct the input file and append date infront of the time records and remove the date line. You may write a small script to correct that if the file is big. Afterwards, you can use pandas.

CodePudding user response：

Code:

with open('test.txt') as file:
        contents = file.read()                               #First, Read the file
        
li = contents.split('\n')                         #Convert to the list based on line
col = list(filter(None,li[0].split(',')))
col.insert(0, 'Date') if 'Date' not in col else col          #Insert Date 
col

Col list look like:

['Date',
 'Time',
 'Weight',
 'Change',
 'BMI',
 'Body Fat',
 'Skeletal Muscle Mass',
 'Bone Mass',
 'Body Water']

Now getting the data

D = None                                                     #Temporary variable
data=[]                                                      #New List to story data
for i, j in enumerate(li):
    if len(j) < 20:
        D = j.replace('"', '')                 #" Oct 10, 2020"
    else:
        if D is not None:
            data.append([D[:-1]] li[i][:-1].split(','))
data

data something look like this:

[[' Oct 10, 2020',
  '5:15 PM',
  '82.7 kg',
  '1.1 kg',
  '22.4',
  '17.4 %',
  '35.6 kg',
  '5.8kg',
  '60.3 %'],
 [' Nov 2, 2020',
  '11:40 PM',
  '84.0 kg',
  '0.1 kg',
  '22.8',
  '17.5 %',
  '35.9 kg',
  '6.0 kg',
  '60.3 %'],
 [' Nov 2, 2020',
  '10:20 AM',
  '83.9 kg',
  '0.5 kg',
  '22.8',
  '18.1 %',
  '35.9 kg',
  '5.9 kg',
  '59.8 %'],
 [' Nov 2, 2020',
  '13:40 AM',
  '84.4 kg',
  '0.8 kg',
  '22.9',
  '17.3 %',
  '36.0 kg',
  '6.0 kg',
  '60.4 %']]

Now merging the column and data lists and converting that to the DataFrame

data.insert(0, col) if col not in data else data

pd.DataFrame(data[1:],columns=data[0])

Output:

        Date    Time    Weight  Change  BMI Body Fat    Skeletal Muscle Mass    Bone Mass   Body Water
0   Oct 10, 2020 5:15 PM    82.7 kg 1.1 kg  22.4    17.4 %  35.6 kg 5.8kg   60.3 %
1   Nov 2, 2020 11:40 PM    84.0 kg 0.1 kg  22.8    17.5 %  35.9 kg 6.0 kg  60.3 %
2   Nov 2, 2020 10:20 AM    83.9 kg 0.5 kg  22.8    18.1 %  35.9 kg 5.9 kg  59.8 %
3   Nov 2, 2020 13:40 AM    84.4 kg 0.8 kg  22.9    17.3 %  36.0 kg 6.0 kg  60.4 %