I am writing up a scrapy python project that scraps data for a future ML project. I decided to containerize my project in Docker - below is my DockerFile
:
FROM python:3.9.12-slim-buster
WORKDIR /app
RUN apt-get update && apt-get install -y git
RUN pip3 install --upgrade pip
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
ADD . /app
I am able to run the following command and my scraper will run successfully:
docker run -it ufc-stats-scraper scrapy crawl ufc_future_fights -o future.csv -t csv
output:
....
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 53,
'scheduler/dequeued/memory': 53,
'scheduler/enqueued': 53,
'scheduler/enqueued/memory': 53,
'start_time': datetime.datetime(2022, 4, 20, 2, 4, 7, 365309)}
2022-04-20 02:04:08 [scrapy.core.engine] INFO: Spider closed (finished)
However, the scraped data is stored in the future.csv file, which is local to the container. I read online that I should use the -v
command and mount the containers folder. Below is the command that I am trying to use:
docker run -it -v ${PWD}:/app ufc-stats-scraper scrapy crawl ufc_future_fights -o future.csv -t csv
However, I am getting the error message below when running this command:
Scrapy 2.6.1 - no active project
Unknown command: crawl
Use "scrapy" to see available command
I'm fairly new to Docker but was wondering if anyone had any input on what I could be doing wrong here. Thanks!
Update:
I started a bash session on two versions of my docker image. One session was unmounted and the other was mounted. The app folder contained all of the repos files in the unmounted session. The weird thing is that the app folder is completely empty on the mounted session. This would explain why the error message reads "no active project". I am really confused why the mounted image is empty though.
I feel like I might be misunderstanding how the the Docker bind mount stuff works.
CodePudding user response:
hope you are enjoying your containers journey !
Since I have no more informations to reproduce exactly your situation, I will just propose you what could work for you:
Here you got an error that said: "Unknown command: crawl", it means that the docker binary is interpreting your "crawl" argument of the scrapy command as a standalone command.
To avoid this, instead of running:
docker run -it -v ${PWD}:/app ufc-stats-scraper scrapy crawl ufc_future_fights -o future.csv -t csv
you should run your scrapy command within "bash -c", like this:
docker run -it -v ${PWD}:/app ufc-stats-scraper bash -c "scrapy crawl ufc_future_fights -o future.csv -t csv"
For you information, you can directly add the VOLUME configuration and your entrypoint command directly into your dockerfile, this way, you should be able to run your container just with docker run -it (or -d) ufc-stats-scraper (cf https://kapeli.com/cheat_sheets/Dockerfile.docset/Contents/Resources/Documents/index)
Hope this will help you ! bguess
CodePudding user response:
The issue is that I was mounting over the app/ directory and deleting all of the directories files. Instead of mounting the app directory, I created a new data folder within the app directory and mounted it.