Home > Enterprise >  Read a csv file from bitbucket using Python and convert it to a df
Read a csv file from bitbucket using Python and convert it to a df

Time:12-01

I am trying to read a url csv file from bitbucket and I want to read it into a df using python. Also for the work I am doing I can not read it locally , it has to be from bitbucket all the time.

Any ideas on how to do this? Thank you!

Here is my example:

url = 'https://bitbucket.EXAMPLE.com/EXAMPLE/EXAMPLE/EXAMPLE/EXAMPLE/raw/wpcProjects.csv?at=refs/heads/master'

colnames=['project_id','project_name','gourmet_url']

df7 = pd.read_csv(url, names =colnames)

However, the output is not correct, its not the df being outputted its some bad data.

CodePudding user response:

You have multiple options, but your question is actually 2 separate questions.

  1. How to get a file (.csv in this case) from a remote location.
  2. How to load a csv into a "df" which is a pandas data frame.

For #2, you simply import pandas, and use the df = pandas.read_csv() function call. See the documentation! If the CSV file was in the current directory, you would do pandas.read_csv('myfile.csv')

The CSV is on a server somewhere. In this case, it happens to be on bitbucket's servers accessed from their website. You can fetch it and save it locally, then access it, or you can fetch it to a temporary location, read it into pandas, and discard it. You could even read the data from the file into python as a string. However, having a lot of options doesn't mean they are all useful. I am just listing them for completeness. Looking at the documentation, pandas already has remote fetching built into the read_csv() function. If the passed in path is a valid URL scheme, where, in pandas,

"Valid URL schemes include http, ftp, s3, gs, and file".

If you want to locally save it, you can use pandas to do so once again, using the .write() method of a data frame.

FOR BITBUCKET SPECIFICALLY: You need to make sure to link to the 'raw' file on bitbucket. Get the link to the raw file, and pass that in. The link used to view the file on your web browser is not the direct link to the raw file by default, it's a webpage that offers a view into that file. Get the raw file link, then pass that into pandas.

Code example: Assume we want (a random csv file I found on bitbucket): https://bitbucket.org/pedrorijo91/nodejstutorial/src/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv?at=master

What you need is a link to the raw file! clicking on ... and pressing 'open raw' we get:

https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv

Let's look at this in detail, the link is the same up to the project name: https://bitbucket.org/pedrorijo91/nodejstutorial/

afterwards, the raw file is under raw/

then it's the same pointer (random but same letters and numbers) db4c991864e65c4d72e98a1dc94e33606e3adde9/

Finally, it's the same directory structure:

node_modules/levelmeup/data/horse_js.csv

The first link ends with a ?at=master which is parsed by the web server and originates from src/ at the web server. The second link, the actual link to the raw file, starts from raw/ and ends with .csv

import pandas as pd
RAW_Bitbucket_URL = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
df = pd.read_csv(RAW_Bitbucket_URL)

The above code is successful for me.

CodePudding user response:

 You may need to download the entire file so you can try to make the request with requests and then read it as a file in pandas.read_csv().

>>> import pandas as pd
>>> import requests
>>> url = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
>>> r = requests.get(url, allow_redirects=True)
>>> open('file.csv', 'wb').write(r.content)
>>> pd.read_csv('file.csv', encoding='utf-8-sig').head()

                   ID                                              Tweet                 Date                 Via
0  374667940827635712             So, yes, a 100% JS App is 100% awesome  08:59:32, 9-3, 2013                 web
1  374656867466637312  "vituperating priests" who rail against JavaSc...  08:15:32, 9-3, 2013                 web
2  374654221292806144    Node/Browserify/CJS folks, is there any benefit  08:05:01, 9-3, 2013  Twitter for iPhone
3  374640446955212800     100% JavaScript applications. You may get some  07:10:17, 9-3, 2013  Twitter for iPhone
4  374613490763169792       A node.js app that will order you a sandwich  05:23:10, 9-3, 2013                 web
  • Related