My column has over 9000 rows of Mb and Kb objects appearing as numbers, that need to be converted all into Kb values (by multiplying Mb values to 1000). This is a snippet of the column values:
array(['5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M',
'72M', '43M', '7.7M', '6.3M', '334k', '93M', '65M', '23k',
'7.3M', '6.5M', '1.5M', '7.5M'])
I used str.extract below, and it works in converting the objects into int, however I lose the M character and I cannot track which numbers to multiply to 1000
df.Size = df['Size'].str.extract('(\d )').astype(int)
I've also tried creating a list of the Mb values to then use a custom function with .apply() to multiply, but I believe there is something wrong with my code:
mblist = df['Size'].str.extract('(\d [M$])').astype(int)
I need the column to all become int and all in Kb metric, so I can conduct analysis on it. Is there a better pathway to get there?
CodePudding user response:
We can use np.where()
here along with str.extract
:
df["Size"] = np.where(df["Size"].str.contains(r'M$', regex=True),
1000.0*df["Size"].str.extract('(\d (?:\.\d )?)').astype(float),
df["Size"].str.extract('(\d (?:\.\d )?)').astype(float))
The above logic extracts the float (or integer) number from the size field, and multiplies by 1000, in the case that the unit be M
for megabytes. Data will then be reported everywhere as KB.