Home > OS >  How to have read_csv treat version numbers
How to have read_csv treat version numbers

Time:08-11

I have a heap of tab-separated log files like this that I need to analyze.

163.116.197.120 2022-03-01  00:00:00.592    9.8 2493044316  3   en  JPN public  10.0.19042  48  64-bit  Microsoft Windows 10 Pro    2.0.50727.4927SP2-3.0.30729.4926SP2-3.5.30729.4926SP1-4.8.04084C-4.8.04084  1
181.209.195.130 2022-03-01  00:00:07.049    9.10    2540301398  2   en  GTM public  10.0.19043  100 64-bit  Microsoft Windows 10 Home Single Language   2.0.50727.4927SP2-3.0.30729.4926SP2-3.5.30729.4926SP1-4.8.04084C-4.8.04084  1
106.117.110.195 2022-03-01  00:00:11.856    9.1 3489778528  3   zh-Hans CHN public  6.1.7601    1   64-bit  Microsoft Windows 7 ×0H     2.0.50727.5420SP2-3.0.30729.5420SP2-3.5.30729.5420SP1-4.7.03062C-4.7.03062  1

To get just the columns I need, I use

df = pandas.read_csv(in_file, sep="\t", usecols=[0,1,3,6,7], dtype={"IP_Addr": str, "Date": str, "Version": str, "Lang": str, "Country": str, })

But when I print this dataframe, I get this result:

163.116.197.120  2022-03-01  9.8       en  JPN
0  181.209.195.130  2022-03-01  9.1       en  GTM
1  106.117.110.195  2022-03-01  9.1  zh-Hans  CHN

So despite setting the datatype to "str", "9.10" becomes "9.1". Same happens with dtype=object.

What am I missing? Thanks.

CodePudding user response:

I forgot to concatenate the header to the beginning of the files, and that prevented the dtype dictionary from working. Thanks, @Stef!

CodePudding user response:

read_cvs() threads first row as header with names of columns and you have to use header=None to change it

df = pandas.read_csv(..., header=None)

Second problem is that your file doesn't have line with headers

IP_Addr, Date, Version, Lang, Country

and read_csv() doesn't know which column is Version and you many need to use column's numbers instead of column's names. OR you have to add header with names.

  • Related