I just finished cleaning my data and I noticed that there are columns that have duplicate values in the rows that are separated by a space. I was wondering what is the best way to remove the space after the first value and then all values after it.
There are some columns that have multiple duplicates such as 42.35 42.25 23.12 in the same cell.
The data is large with 1500 columns and around 4000 rows so there is no way I could manually fix the issue.
Thanks!
Here is an example
PAG
0 54.36
1 50.3
2 46.81
3 47.71
4 42.35 42.35 <------ This is one of the many rows (cell) that have this duplicate.
5 43.91
6 43.54
Here is what the data looks like and the dtypes Dtype
MthCalDt int64
A float64
AA float64
AAL float64
AAON float64
...
ZD float64
ZEN float64
ZION float64
ZTS float64
ZWS float64
Length: 1583, dtype: object
cleaned_data = pd.read_csv("Wrds_Data\wrds_data_clean.csv")
cleaned_data.head()
MthCalDt A AA AAL AAON AAP AAPL AAT AAWW AB ... YUMC YY Z ZBH ZBRA ZD ZEN ZION ZTS ZWS
0 20170131 48.97 36.45 44.25 33.95 164.24 121.35 42.93 52.75 23.35 ... 27.48 41.08 35.38 118.33 83.67 83.81 23.93 42.19 54.94 22.09
1 20170228 51.30 34.59 46.36 33.65 156.61 136.99 44.00 56.85 23.70 ... 26.59 44.29 33.94 117.08 90.71 81.42 27.23 44.90 53.31 22.17
2 20170331 52.87 34.40 42.30 35.35 148.26 143.66 41.84 55.45 22.85 ... 27.20 46.11 33.67 122.11 91.25 83.91 28.04 42.00 53.37 23.08
3 20170428 55.05 33.73 42.62 36.65 142.14 143.65 42.83 58.00 22.90 ... 34.12 48.97 39.00 119.65 94.27 90.24 28.75 40.03 56.11 24.40
4 20170531 60.34 32.94 48.41 36.18 133.63 152.76 39.05 48.70 22.55 ... 38.41 58.34 43.52 119.21 104.34 84.62 25.98 40.07 62.28 22.80
What I tried So Far Comes up with the error could not convert a string into float. The error is caused by the duplicate with a space in one of the rows (Cell)
cleaned_data = pd.read_csv("Wrds_Data\wrds_data_clean.csv")
cleaned_data.astype(float)
cleaned_data.applymap(lambda x: x.split(" ")[0])
error:
could not convert string to float: '126.49 126.49' <--- This is referencing a duplicate cell that is considered a string since there is a space.
Here is the newer code with regex that tries to replace cells with spaces but there are still cells with duplicates.
cleaned_data = pd.read_csv("Wrds_Data\wrds_data_clean.csv")
columns = cleaned_data.columns.copy().drop('MthCalDt')
cleaned_data[columns] = cleaned_data[columns].replace(r"^(\d \.\d ) .*", r"\1", regex=True).astype(float)
cleaned_data
Error:
Cell In [114], line 3
1 cleaned_data = pd.read_csv("Wrds_Data\wrds_data_clean.csv")
2 columns = cleaned_data.columns.copy().drop('MthCalDt')
----> 3 cleaned_data[columns] = cleaned_data[columns].replace(r"^(\d \.\d ) .*", r"\1", regex=True).astype(float)
4 cleaned_data
169 # Explicit copy, or required since NumPy can't view from / to object.
--> 170 return arr.astype(dtype, copy=True)
172 return arr.astype(dtype, copy=copy)
ValueError: could not convert string to float: '60 55.66'
CodePudding user response:
applymap
: apply in every cellx.split(" ")[0]
: choose the first value split by a space
df.applymap(lambda x: float(str(x).split(" ")[0]))
CodePudding user response:
You could replace with a regex that looks for digits followed by a space.
df.replace(r"^(\d \.\d ) .*", r"\1", regex=True, inplace=True)
You could also take the opportunity to change the values to floats while you are at it. In your example, you have many columns that should be floats and one that looks like it should stay like it is. So, just to keep life complicated, I'll drop the columns that should not be visited in the conversion.
columns = df.columns.copy().drop('MthCalDt')
df[columns] = df[columns].replace(r"^(\d \.\d ) .*", r"\1", regex=True).astype(float)