I have N number of .txt files each having thousands of IP addresses in a directory(for eg. /var/logs). The file's name starts with 'log', for example, log01.test.co.net, log02.hello.co.net, log01.domain.co.net, etc. I need to add the IPs from each .txt file as a column to an excel file and compare all the columns to find the common IPs. Then print the common IPs as another column called "Common IPs".
How can I achieve the requirement with the Python Pandas library? Can anyone please help?
I am new to Python and was able to find from the web that we can convert txt files into excel format using the Python Pandas library.
#!/usr/bin/env python3
import pandas as pd
df = pd.read_table('log01.txt')
df.to_excel('output.xlsx', 'Sheet2')
CodePudding user response:
Use pathlib
to get all log files and load them with pd.read_csv
. Extract common IP addresses using value_counts
then add them to common_ips
column:
import pandas as pd
import pathlib
# Parse all log files
data = {}
for logfile in pathlib.Path('/var/logs').glob('log*'):
df = pd.read_csv(logfile, squeeze=True).drop_duplicates() \
.sort_values().reset_index(drop=True)
data[logfile.name] = df
# Merge all data from logs
df = pd.concat(data)
# Find common IP addresses through all files
common_ips = df.value_counts().eq(df.index.levels[0].nunique()) \
.loc[lambda x: x].index.to_frame(index=False, name='common_ips')
# Create final dataframe and export it to Excel
df = df.unstack(level=0).join(common_ips)
df.to_excel('report.xlsx', index=False)
Sample output:
log02.hello.co.net log01.domain.co.net log01.test.co.net common_ips
0 1.114.104.134 1.156.243.176 1.131.142.79 99.19.144.176
1 1.124.207.151 1.243.206.110 1.207.189.107 NaN
2 1.13.108.19 1.62.221.173 1.249.176.100 NaN
3 1.147.63.14 1.63.85.25 1.252.71.6 NaN
4 1.90.129.211 1.147.154.71 1.27.141.156 NaN
.. ... ... ... ...
957 98.162.171.47 NaN NaN NaN
958 99.19.144.176 NaN NaN NaN
959 99.20.58.108 NaN NaN NaN
960 99.220.179.58 NaN NaN NaN
961 99.97.17.252 NaN NaN NaN
[962 rows x 4 columns]