Home > OS >  Sorting Numpy data frames and removing duplicate rows Python
Sorting Numpy data frames and removing duplicate rows Python

Time:11-02

I am trying to combine all 3 pandas data frames together data, data2, data3 sort them in synchronous order in terms of date as well as removing all duplicate rows. No more than 1 date value must be the same however the date of '2021-10-21 00:03:00' is both present in data2 and data3 so there should only be a single row present in the output. What would I be able to add to the coed so that I achieve the Expected Output?

Code:

import pandas as pd 

data = {'Unix Timesamp': [1444311600000, 1444311660000, 1444311720000], 
        'date': ['2015-10-08 13:40:00', '2015-10-08 13:41:00', '2015-10-08 13:42:00'],
        'Symbol': ['BTCUSD', 'BTCUSD', 'BTCUSD'],
        'Open': [10384.54, 10389.08,10387.15],
        'High': [10389.08, 10389.08, 10388.36],
        'Low': [10340.2, 10332.8, 10385]}

data2 = {'Unix Timesamp': [1634774460000, 1634774520000, 1634774580000], 
        'date': ['2021-10-21 00:01:00', '2021-10-21 00:02:00', '2021-10-21 00:03:00'],
        'Symbol': ['BTCUSD', 'BTCUSD', 'BTCUSD'],
        'High': [4939.97, 4961.75, 4964.33],
        'Open': [4939.95, 4959.18,4964.33]}

data3 = {'Unix Timesamp': [1634774640000, 1634774640000], 
        'date': ['2021-10-21 00:03:00', '2021-10-21 00:04:00'],
        'High': [4964.33, 4867.33],
        'Symbol': ['BTCUSD', 'BTCUSD'],
        'Open': [4964.33, 4800.2]}

dataset = pd.DataFrame.from_dict(data)
dataset2 = pd.DataFrame.from_dict(data2)
dataset3 = pd.DataFrame.from_dict(data3)

dataset.drop('Low',1).append([dataset2, dataset3], ignore_index=True).drop_duplicates()

Output:

enter image description here

Expected Output (The 6th row in Output should not exist):

enter image description here

enter image description here

CodePudding user response:

The below code should satisfy your requirement. Make sure you include 'subset=['date']' within paranthesis of the .drop_duplicates() method. Example: .drop_duplicates(subset=['date'])

dataset.drop('Low',1).append([dataset2, dataset3],ignore_index=True).drop_duplicates(subset=['date'])

For more info refer https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

  • Related