I have a heap of tab-separated log files like this that I need to analyze.
163.116.197.120 2022-03-01 00:00:00.592 9.8 2493044316 3 en JPN public 10.0.19042 48 64-bit Microsoft Windows 10 Pro 2.0.50727.4927SP2-3.0.30729.4926SP2-3.5.30729.4926SP1-4.8.04084C-4.8.04084 1
181.209.195.130 2022-03-01 00:00:07.049 9.10 2540301398 2 en GTM public 10.0.19043 100 64-bit Microsoft Windows 10 Home Single Language 2.0.50727.4927SP2-3.0.30729.4926SP2-3.5.30729.4926SP1-4.8.04084C-4.8.04084 1
106.117.110.195 2022-03-01 00:00:11.856 9.1 3489778528 3 zh-Hans CHN public 6.1.7601 1 64-bit Microsoft Windows 7 ×0H 2.0.50727.5420SP2-3.0.30729.5420SP2-3.5.30729.5420SP1-4.7.03062C-4.7.03062 1
To get just the columns I need, I use
df = pandas.read_csv(in_file, sep="\t", usecols=[0,1,3,6,7], dtype={"IP_Addr": str, "Date": str, "Version": str, "Lang": str, "Country": str, })
But when I print this dataframe, I get this result:
163.116.197.120 2022-03-01 9.8 en JPN
0 181.209.195.130 2022-03-01 9.1 en GTM
1 106.117.110.195 2022-03-01 9.1 zh-Hans CHN
So despite setting the datatype to "str", "9.10" becomes "9.1". Same happens with dtype=object.
What am I missing? Thanks.
CodePudding user response:
I forgot to concatenate the header to the beginning of the files, and that prevented the dtype dictionary from working. Thanks, @Stef!
CodePudding user response:
read_cvs()
threads first row as header with names of columns and you have to use header=None
to change it
df = pandas.read_csv(..., header=None)
Second problem is that your file doesn't have line with headers
IP_Addr, Date, Version, Lang, Country
and read_csv()
doesn't know which column is Version
and you many need to use column's numbers instead of column's names. OR you have to add header with names.