I am trying to extract an engine size from a pandas dataframe column. I have tried using df.query(), the replace function and this df['4Cyl'] = df['Engine'].str.contains('4 Cyl|4-Cyl', regex=True, na=True)
. The last one seems to work for what I am trying to do, it is just super time consuming to write because there can be a lot of other potential values as well:
- I4
- 4 Cylinder Engine
- V4
- v4
and then all the bigger engine sizes -- 6 to 12 cylinder -- in the same format. The engine size is only at the beginning of the string so that makes it nicer.
This way seems to be very time consuming but if this is the best way to do that, great. However I do feel like there is a way to do this all at once, or at least in a les time consuming way.
Here is a snippet of the data I am working with:
{'V6 Cyl Gas Fuel 3.5L': 30856, '4 Cyl Gas Fuel 2.0L Turbocharged': 23965, '4 Cyl Gas Fuel 2.4L': 22897, '4 Cyl Gas Fuel 2.5L': 22593, 'V6 Cyl Gas Fuel 3.6L': 19690, 'V6, 3.5 Liter': 16173, '4 Cyl Gas Fuel 2.0L': 16101, '4 Cyl Gas Fuel 1.8L': 13852, '4-Cyl, 2.5 Liter': 12204, '4-Cyl, Turbo, 2.0 Liter': 10037, '8 Cyl Gas Fuel 5.3L': 9686, '8 Cyl Gas Fuel 5.7L': 9592}
where the format is "Engine : count of how many times it appears"
I dont really care about the count at this point, I am just trying to extract the engine size.
Thanks for the help.
CodePudding user response:
Try:
df["Engine Type"] = df["Engine"].str.extract(r"^\D*?(\d )")
print(df)
Prints:
Engine Value Engine Type
0 4 Cyl Gas Fuel 1.8L 13852 4
1 4 Cyl Gas Fuel 2.0L 16101 4
2 4 Cyl Gas Fuel 2.0L Turbocharged 23965 4
3 4 Cyl Gas Fuel 2.4L 22897 4
4 4 Cyl Gas Fuel 2.5L 22593 4
5 4-Cyl, 2.5 Liter 12204 4
6 4-Cyl, Turbo, 2.0 Liter 10037 4
7 8 Cyl Gas Fuel 5.3L 9686 8
8 8 Cyl Gas Fuel 5.7L 9592 8
9 V6 Cyl Gas Fuel 3.5L 30856 6
10 V6 Cyl Gas Fuel 3.6L 19690 6
11 V6, 3.5 Liter 16173 6
If you want to have different columns for each type of engine you can use afterwards:
df = pd.concat(
[df, pd.crosstab(df.index, df["Engine Type"]).add_suffix("Cyl")], axis=1
)
print(df)
Prints:
Engine Value Engine Type 4Cyl 6Cyl 8Cyl
0 4 Cyl Gas Fuel 1.8L 13852 4 1 0 0
1 4 Cyl Gas Fuel 2.0L 16101 4 1 0 0
2 4 Cyl Gas Fuel 2.0L Turbocharged 23965 4 1 0 0
3 4 Cyl Gas Fuel 2.4L 22897 4 1 0 0
4 4 Cyl Gas Fuel 2.5L 22593 4 1 0 0
5 4-Cyl, 2.5 Liter 12204 4 1 0 0
6 4-Cyl, Turbo, 2.0 Liter 10037 4 1 0 0
7 8 Cyl Gas Fuel 5.3L 9686 8 0 0 1
8 8 Cyl Gas Fuel 5.7L 9592 8 0 0 1
9 V6 Cyl Gas Fuel 3.5L 30856 6 0 1 0
10 V6 Cyl Gas Fuel 3.6L 19690 6 0 1 0
11 V6, 3.5 Liter 16173 6 0 1 0
Input data used:
Engine Value
0 4 Cyl Gas Fuel 1.8L 13852
1 4 Cyl Gas Fuel 2.0L 16101
2 4 Cyl Gas Fuel 2.0L Turbocharged 23965
3 4 Cyl Gas Fuel 2.4L 22897
4 4 Cyl Gas Fuel 2.5L 22593
5 4-Cyl, 2.5 Liter 12204
6 4-Cyl, Turbo, 2.0 Liter 10037
7 8 Cyl Gas Fuel 5.3L 9686
8 8 Cyl Gas Fuel 5.7L 9592
9 V6 Cyl Gas Fuel 3.5L 30856
10 V6 Cyl Gas Fuel 3.6L 19690
11 V6, 3.5 Liter 16173