My CSV is not uploading to Pandas (Python) - getting error message-CodePudding

I've searched exhaustively for a fix with what is wrong with my Pandas upload but no luck. I would greatly appreciate some help.

I'm trying to run a machine learning algorithm (apriori) in Python and I have a CSV file to upload.

What my CSV file looks like in Notepad

Here is my Python code and resulting error message:

Photo of Python code and error message

I've tried pasting the code using CTRL K but it's not working.

CodePudding user response：

You'll want to use delim_whitespace=True parameter which will end up giving you one row per transaction, which you can then split and apply set and feed into apriori.

Given a sample text file containing:

test
also,test

You can run the following:

import pandas as pd
from apyori import apriori


df = pd.read_csv('Ready Apriori DWK.csv', header=None, delim_whitespace=True, names=['data'])
results = list(apriori(df['data'].str.split().apply(set)))
print(results)

Output

[RelationRecord(items=frozenset({'also,test'}), support=0.5, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'also,test'}), confidence=0.5, lift=1.0)]),
 RelationRecord(items=frozenset({'test'}), support=0.5, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'test'}), confidence=0.5, lift=1.0)])]

CodePudding user response：

This approach uses more-than-enough columns, reads the CSV, then gets rid of the unused columns via isna() and loc.

df = pd.read_csv('your.csv', header=None, names=range(20)) \
    .loc[:,lambda x: ~x.isna().all()]

print(df)

Result

                0             1
0        W010638C           NaN
1     07-3000-300    7-3000-300
2         W010665       W216962
3         W015015           NaN
4        W015183A           NaN
5        W001013J           NaN
6        W000102C           NaN
7        07-0017N       7-0017N
8        WC000286           NaN
9         W017221           NaN
10       W000120C           NaN
11        W017814           NaN
etc ...

Note your data has more than 2 columns but my test data subset only had a max of 2 columns.