3

I'm opening a new question as I'm having an issue with Scrapy and Channels in a Django application and I would appreciate if someone could guide me in the right direction.

The reason why I'm using channels is because I want to retrieve in real-time the crawl statuses from Scrapyd API, without having to use setIntervals all the time, as this is supposed to become a SaaS service which could potentially be used by many users.

I've implemented channels correctly, if I run:

python manage.py runserver

I can correctly see that the system is now using ASGI:

System check identified no issues (0 silenced).
September 01, 2020 - 15:12:33
Django version 3.0.7, using settings 'seotoolkit.settings'
Starting ASGI/Channels version 2.4.0 development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.

Also, the client and server connect correctly via the WebSocket:

WebSocket HANDSHAKING /crawler/22/ [127.0.0.1:50264]
connected {'type': 'websocket.connect'}
WebSocket CONNECT /crawler/22/ [127.0.0.1:50264]

So far so good, the problem comes when I run scrapy via the Scrapyd-API

2020-09-01 15:31:25 [scrapy.core.scraper] ERROR: Error processing {'url': 'https://www.example.com'}
raceback (most recent call last):
  File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/scrapy/utils/defer.py", line 157, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  File "/private/var/folders/qz/ytk7wml54zd6rssxygt512hc0000gn/T/crawler-1597767314-spxv81dy.egg/webspider/pipelines.py", line 67, in process_item
  File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/db/models/query.py", line 411, in get
    num = len(clone)
  File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/db/models/query.py", line 258, in __len__
    self._fetch_all()
  File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/db/models/query.py", line 1261, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/db/models/query.py", line 57, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 1150, in execute_sql
    cursor = self.connection.cursor()
  File "/Users/Andrea/anaconda3/envs/DjangoScrape/lib/python3.6/site-packages/django/utils/asyncio.py", line 24, in inner
    raise SynchronousOnlyOperation(message)
django.core.exceptions.SynchronousOnlyOperation: You cannot call this from an async context - use a thread or sync_to_async.

I think the error message is quite clear: You cannot call this from an async context - use a thread or sync_to_async = I guess that by enabling ASGI there is a conflict with Scrapy library that prevents it from working correctly.

Unfortunately I cannot understand the reason behind this and neither where I should use a "thread or sync_to_async" as suggested.

Note that WebSockets are only used to check crawl status and nothing else.

Can anyone try to explain to me the reason behind this incompatibility and give me some hints on how to overcome this obstacle? I spend a lot of hours looking for an answer but could not find any.

Thanks a lot.

Askew
  • 71
  • 8
  • Nobody has a hint on how to solve this issue? I still have no clue on how it should be set up and I would really appreciate some sort of guidance. Thank you – Askew Sep 07 '20 at 14:46

1 Answers1

0

You can solve this error by simply going to your pipelines.py file. Importing sync_to_async from asgiref.sync.

from asgiref.sync import sync_to_async

After importing sync_to_async, you need to use it as a decorator on the function you are using for storing data to the database.

For instance

from itemadapter import ItemAdapter
from crawler.models import Movie
from asgiref.sync import sync_to_async


class MovieSpiderPipeline:
    @sync_to_async
    def process_item(self, item, spider):
        movie = Movie(**item)
        movie.save()
        return item

Dharman
  • 21,838
  • 18
  • 57
  • 107
Daud Ahmed
  • 44
  • 7