Home > front end >  How to match either 2 characters or 5 characters with regex in python
How to match either 2 characters or 5 characters with regex in python

Time:10-05

I am trying to extract an engine size from a pandas dataframe column. I have tried using df.query(), the replace function and this df['4Cyl'] = df['Engine'].str.contains('4 Cyl|4-Cyl', regex=True, na=True). The last one seems to work for what I am trying to do, it is just super time consuming to write because there can be a lot of other potential values as well:

  • I4
  • 4 Cylinder Engine
  • V4
  • v4

and then all the bigger engine sizes -- 6 to 12 cylinder -- in the same format. The engine size is only at the beginning of the string so that makes it nicer.

This way seems to be very time consuming but if this is the best way to do that, great. However I do feel like there is a way to do this all at once, or at least in a les time consuming way.

Here is a snippet of the data I am working with:

{'V6 Cyl Gas Fuel 3.5L': 30856, '4 Cyl Gas Fuel 2.0L Turbocharged': 23965, '4 Cyl Gas Fuel 2.4L': 22897, '4 Cyl Gas Fuel 2.5L': 22593, 'V6 Cyl Gas Fuel 3.6L': 19690, 'V6, 3.5 Liter': 16173, '4 Cyl Gas Fuel 2.0L': 16101, '4 Cyl Gas Fuel 1.8L': 13852, '4-Cyl, 2.5 Liter': 12204, '4-Cyl, Turbo, 2.0 Liter': 10037, '8 Cyl Gas Fuel 5.3L': 9686, '8 Cyl Gas Fuel 5.7L': 9592}

where the format is "Engine : count of how many times it appears"

I dont really care about the count at this point, I am just trying to extract the engine size.

Thanks for the help.

CodePudding user response:

Try:

df["Engine Type"] = df["Engine"].str.extract(r"^\D*?(\d )")
print(df)

Prints:

                              Engine  Value Engine Type
0                4 Cyl Gas Fuel 1.8L  13852           4
1                4 Cyl Gas Fuel 2.0L  16101           4
2   4 Cyl Gas Fuel 2.0L Turbocharged  23965           4
3                4 Cyl Gas Fuel 2.4L  22897           4
4                4 Cyl Gas Fuel 2.5L  22593           4
5                   4-Cyl, 2.5 Liter  12204           4
6            4-Cyl, Turbo, 2.0 Liter  10037           4
7                8 Cyl Gas Fuel 5.3L   9686           8
8                8 Cyl Gas Fuel 5.7L   9592           8
9               V6 Cyl Gas Fuel 3.5L  30856           6
10              V6 Cyl Gas Fuel 3.6L  19690           6
11                     V6, 3.5 Liter  16173           6

If you want to have different columns for each type of engine you can use afterwards:

df = pd.concat(
    [df, pd.crosstab(df.index, df["Engine Type"]).add_suffix("Cyl")], axis=1
)
print(df)

Prints:

                              Engine  Value Engine Type  4Cyl  6Cyl  8Cyl
0                4 Cyl Gas Fuel 1.8L  13852           4     1     0     0
1                4 Cyl Gas Fuel 2.0L  16101           4     1     0     0
2   4 Cyl Gas Fuel 2.0L Turbocharged  23965           4     1     0     0
3                4 Cyl Gas Fuel 2.4L  22897           4     1     0     0
4                4 Cyl Gas Fuel 2.5L  22593           4     1     0     0
5                   4-Cyl, 2.5 Liter  12204           4     1     0     0
6            4-Cyl, Turbo, 2.0 Liter  10037           4     1     0     0
7                8 Cyl Gas Fuel 5.3L   9686           8     0     0     1
8                8 Cyl Gas Fuel 5.7L   9592           8     0     0     1
9               V6 Cyl Gas Fuel 3.5L  30856           6     0     1     0
10              V6 Cyl Gas Fuel 3.6L  19690           6     0     1     0
11                     V6, 3.5 Liter  16173           6     0     1     0

Input data used:

                              Engine  Value
0                4 Cyl Gas Fuel 1.8L  13852
1                4 Cyl Gas Fuel 2.0L  16101
2   4 Cyl Gas Fuel 2.0L Turbocharged  23965
3                4 Cyl Gas Fuel 2.4L  22897
4                4 Cyl Gas Fuel 2.5L  22593
5                   4-Cyl, 2.5 Liter  12204
6            4-Cyl, Turbo, 2.0 Liter  10037
7                8 Cyl Gas Fuel 5.3L   9686
8                8 Cyl Gas Fuel 5.7L   9592
9               V6 Cyl Gas Fuel 3.5L  30856
10              V6 Cyl Gas Fuel 3.6L  19690
11                     V6, 3.5 Liter  16173

  • Related