Load csv files with multiple columns into several dataframe-CodePudding

I am trying to load some large csv files which appear to have multiple columns and I am struggling with it.

I don't know who design these csv files, but they appear to have event data as well as log data in each csv. At the start of each csv file there is some initial status liens as well

Everything is in a separate rows The Event data uses 2 columns (Data and Event comment) The Log data has multiple columns( Date and 20 columns.

I give an example of the type of data setup below:

Initial; [Status] The Zoo is Closed;
Initial; Status] The Sun is Down;
Initial; [Status] Monkeys ar sleeping;

Time;No._Of_Monkeys;Monkeys_inside;Monkeys_Outside;Number_of_Bananas
06:00; 5;5;0;10
07:00; 5;5;0;10
07:10;[Event] Sun is up
08:00; 5;5;0;10
08:30; [Event] Monkey Doors open and Zoo Opens
09:00; 5;5;0;10
08:30; [Event] Monkey Goes out
09:00; 5;4;1;10
08:30; [Event] Monkey Eats Banana
09:00; 5;4;1;9
08:30; [Event] Monkey Goes out
09:00; 5;5;2;9

Now what I want to do is to put the Log data into one data frame and the Initial and Event data into another.

Now I can read the csv files with csv_reader and go row by row but this is proving very slow, especially when trying to go thorough multiple files and each file containing about 40k rows

Below is code I am using below

csv_files = [f for f in os.listdir('.') if f.endswith('.log')]


for file in csv_files:

# Open the CSV file in read mode
  with open(file, 'r') as csv_file:
    # Use the csv module to parse the file
    csv_reader = csv.reader(csv_file, delimiter=';')

    # Loop through the rows of the file
    for row in csv_reader:
      # If the row has event data
      if len(row) == 2:
        # Add the row to the Eventlog
          EventLog = EventLog.append(pd.Series(row), ignore_index=True)
      # If the row is separated by a single separator
      elif len(row) > 2:
        #First row entered into data log will be the column headers
        if DataLog.empty:
          DataLog=pd.DataFrame(columns=row)
        else:
        # Add the row to the single_separator_df DataFrame
          DataLog = DataLog.append(pd.Series(row), ignore_index=True)

Is there a better way to do this....preferably faster

IF I use pandas read_csv it seems to only load the Initial data. i.e first 3 lines of my data above. I can use skip rows to skip down to where the data is and then it will load the rest, but I can't see to figure out how to separate out the event and log data from there

so looking for ideas before i lose what little hair I have left.

CodePudding user response：

If I understood your data format corectly, I would do something like this:

# simply read data as one column data without headers and indexes
df = pd.read_csv("your_file_name.log", header=None, sep=',')
# split values in this column by ; (in each row will be list of values)
tmp_df = df[0].str.split(";")

# delete empty values in the first 3 rows (because we have ; in the end of these rows)
tmp_df = tmp_df.map(lambda x: [y for y in x if y != ''])
# those rows which have 2 values we insert in one dataframe 
EventLog = pd.DataFrame(tmp_df[tmp_df.str.len() == 2].to_list())
# other ones we inset in another dataframe (in the first row will be column names)
data_log_tmp = tmp_df[tmp_df.str.len() != 2].to_list()
DataLog = pd.DataFrame(data_log_tmp[1:], columns=data_log_tmp[0])

CodePudding user response：

Here is an example of loading a CSV file, assuming that Monkeys_inside field is always NaN in Event data and assigned in log data, because I used it as a condition to retrieve the event data :

df = pd.read_csv('huge_data.csv',  skiprows=3, sep=';')
log_df = df.dropna().reset_index(drop=True)
event_df = df[~df['Monkeys_inside'].notnull()].reset_index(drop=True)

And assuming also that all your CSV file contains those 3 Status lines.

Keep in mind that the dataframe will hold duplicated rows if you have some in your csv files, to remove them, you need just to call the drop_duplicates function and you good :

event_df = event_df.drop_duplicates()