Comparing dynamic files for common entries using Python Pandas-CodePudding

I have N number of .txt files each having thousands of IP addresses in a directory(for eg. /var/logs). The file's name starts with 'log', for example, log01.test.co.net, log02.hello.co.net, log01.domain.co.net, etc. I need to add the IPs from each .txt file as a column to an excel file and compare all the columns to find the common IPs. Then print the common IPs as another column called "Common IPs".

How can I achieve the requirement with the Python Pandas library? Can anyone please help?

I am new to Python and was able to find from the web that we can convert txt files into excel format using the Python Pandas library.

#!/usr/bin/env python3

import pandas as pd

df = pd.read_table('log01.txt')

df.to_excel('output.xlsx', 'Sheet2')

CodePudding user response：

Use pathlib to get all log files and load them with pd.read_csv. Extract common IP addresses using value_counts then add them to common_ips column:

import pandas as pd
import pathlib

# Parse all log files
data = {}
for logfile in pathlib.Path('/var/logs').glob('log*'):
    df = pd.read_csv(logfile, squeeze=True).drop_duplicates() \
           .sort_values().reset_index(drop=True)
    data[logfile.name] = df

# Merge all data from logs
df = pd.concat(data)

# Find common IP addresses through all files
common_ips = df.value_counts().eq(df.index.levels[0].nunique()) \
               .loc[lambda x: x].index.to_frame(index=False, name='common_ips')

# Create final dataframe and export it to Excel
df = df.unstack(level=0).join(common_ips)
df.to_excel('report.xlsx', index=False)

Sample output:

    log02.hello.co.net log01.domain.co.net log01.test.co.net     common_ips
0        1.114.104.134       1.156.243.176      1.131.142.79  99.19.144.176
1        1.124.207.151       1.243.206.110     1.207.189.107            NaN
2          1.13.108.19        1.62.221.173     1.249.176.100            NaN
3          1.147.63.14          1.63.85.25        1.252.71.6            NaN
4         1.90.129.211        1.147.154.71      1.27.141.156            NaN
..                 ...                 ...               ...            ...
957      98.162.171.47                 NaN               NaN            NaN
958      99.19.144.176                 NaN               NaN            NaN
959       99.20.58.108                 NaN               NaN            NaN
960      99.220.179.58                 NaN               NaN            NaN
961       99.97.17.252                 NaN               NaN            NaN

[962 rows x 4 columns]