Home > Back-end >  How to remove duplicates in a row that is separated by a space | Pandas
How to remove duplicates in a row that is separated by a space | Pandas

Time:10-23

I just finished cleaning my data and I noticed that there are columns that have duplicate values in the rows that are separated by a space. I was wondering what is the best way to remove the space after the first value and then all values after it.

There are some columns that have multiple duplicates such as 42.35 42.25 23.12 in the same cell.

The data is large with 1500 columns and around 4000 rows so there is no way I could manually fix the issue.

Thanks!

Here is an example

    PAG
0   54.36
1   50.3
2   46.81
3   47.71
4   42.35 42.35          <------ This is one of the many rows (cell) that have this duplicate. 
5   43.91
6   43.54

Here is what the data looks like and the dtypes Dtype

MthCalDt      int64
A           float64
AA          float64
AAL         float64
AAON        float64
             ...   
ZD          float64
ZEN         float64
ZION        float64
ZTS         float64
ZWS         float64
Length: 1583, dtype: object
cleaned_data = pd.read_csv("Wrds_Data\wrds_data_clean.csv")
cleaned_data.head()

    MthCalDt    A         AA    AAL     AAON    AAP     AAPL    AAT     AAWW    AB  ...     YUMC    YY         Z      ZBH     ZBRA    ZD    ZEN     ZION    ZTS     ZWS
0   20170131    48.97   36.45   44.25   33.95   164.24  121.35  42.93   52.75   23.35   ... 27.48   41.08   35.38   118.33  83.67   83.81   23.93   42.19   54.94   22.09
1   20170228    51.30   34.59   46.36   33.65   156.61  136.99  44.00   56.85   23.70   ... 26.59   44.29   33.94   117.08  90.71   81.42   27.23   44.90   53.31   22.17
2   20170331    52.87   34.40   42.30   35.35   148.26  143.66  41.84   55.45   22.85   ... 27.20   46.11   33.67   122.11  91.25   83.91   28.04   42.00   53.37   23.08
3   20170428    55.05   33.73   42.62   36.65   142.14  143.65  42.83   58.00   22.90   ... 34.12   48.97   39.00   119.65  94.27   90.24   28.75   40.03   56.11   24.40
4   20170531    60.34   32.94   48.41   36.18   133.63  152.76  39.05   48.70   22.55   ... 38.41   58.34   43.52   119.21  104.34  84.62   25.98   40.07   62.28   22.80

What I tried So Far Comes up with the error could not convert a string into float. The error is caused by the duplicate with a space in one of the rows (Cell)

cleaned_data = pd.read_csv("Wrds_Data\wrds_data_clean.csv")
cleaned_data.astype(float)
cleaned_data.applymap(lambda x: x.split(" ")[0])


error: 
could not convert string to float: '126.49 126.49'    <--- This is referencing a duplicate cell that is considered a string since there is a space. 

Here is the newer code with regex that tries to replace cells with spaces but there are still cells with duplicates.

cleaned_data = pd.read_csv("Wrds_Data\wrds_data_clean.csv")
columns = cleaned_data.columns.copy().drop('MthCalDt')
cleaned_data[columns] = cleaned_data[columns].replace(r"^(\d \.\d ) .*", r"\1", regex=True).astype(float)
cleaned_data



Error: 

Cell In [114], line 3
      1 cleaned_data = pd.read_csv("Wrds_Data\wrds_data_clean.csv")
      2 columns = cleaned_data.columns.copy().drop('MthCalDt')
----> 3 cleaned_data[columns] = cleaned_data[columns].replace(r"^(\d \.\d ) .*", r"\1", regex=True).astype(float)
      4 cleaned_data


169     # Explicit copy, or required since NumPy can't view from / to object.
--> 170     return arr.astype(dtype, copy=True)
    172 return arr.astype(dtype, copy=copy)

ValueError: could not convert string to float: '60 55.66'

CodePudding user response:

  • applymap: apply in every cell
  • x.split(" ")[0]: choose the first value split by a space
df.applymap(lambda x: float(str(x).split(" ")[0]))

CodePudding user response:

You could replace with a regex that looks for digits followed by a space.

df.replace(r"^(\d \.\d ) .*", r"\1", regex=True, inplace=True)

You could also take the opportunity to change the values to floats while you are at it. In your example, you have many columns that should be floats and one that looks like it should stay like it is. So, just to keep life complicated, I'll drop the columns that should not be visited in the conversion.

columns = df.columns.copy().drop('MthCalDt')
df[columns] = df[columns].replace(r"^(\d \.\d ) .*", r"\1", regex=True).astype(float)
  • Related