scrapy how to load urls from file at scrapinghub

Question

I know how to load data into Scrapy spider from external source when working localy. But I strugle to find any info on how to deploy this file to scrapinghub and what path to use there. Now i use this approach from SH documentation - enter link description here but recieve NONE object.

import pkgutil

class CodeSpider(scrapy.Spider):
name = "code"
allowed_domains = ["google.com.au"]

def start_requests(self, ):

    f = pkgutil.get_data("project", "res/final.json")
    a = json.loads(f.read())

Thanks. My setup file

from setuptools import setup, find_packages

setup(
    name         = 'project',
    version      = '1.0',
    packages     = find_packages(),
    package_data = {'project': ['res/*.json']
    },
    entry_points = {'scrapy': ['settings = au_go.settings']},

    zip_safe=False,

)

The error i got.

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/tmp/unpacked-eggs/__main__.egg/au_go/spiders/code.py", line 16, in start_requests
    a = json.loads(f.read())
AttributeError: 'NoneType' object has no attribute 'read'

Have you checked and tried https://helpdesk.scrapinghub.com/support/solutions/articles/22000200416-deploying-non-code-files ? — paul trmbrth, Aug 09 '17 at 10:24
@paul trmbrth Thanks. Not sure why whould i need this file in settings and how do i import it into my spider. Got an idea about it? — Billy Jhon, Aug 09 '17 at 11:27
i tries and it returns none for imported content. The only thing i imported it directly to spider and not into settings file as shown in example. — Billy Jhon, Aug 09 '17 at 12:47
Do not bother with the reference to `settings.py`. The important points are `package_data` section in `setup.py` and `pkgutil.get_data()` to access the data — paul trmbrth, Aug 09 '17 at 16:24
yeah, thats what i did. Returns None. So finally i just pasted entire list inside spider code. Not the best solution but it works. I will edit the post with what it looks like now for reference. — Billy Jhon, Aug 09 '17 at 20:16
Shouldnt 'project' be replaced with 'au_go' in your setup.py? Does it change anything? — paul trmbrth, Aug 09 '17 at 22:43
Yes. it does work this way. It is strange as I would assume project setup file whoul already have correct project name. Please write your comment as a post so I could accept it. Thanks for your help. — Billy Jhon, Aug 10 '17 at 11:03

score 5 · Accepted Answer · answered Aug 10 '17 at 14:17

From the traceback you supplied, I assume that your project files look like this:

au_go/
  __init__.py
  settings.py
  res/
     final.json
  spiders/
      __init__.py
      code.py
scrapy.cfg
setup.py

With this assumption, the setup.py's package_data needs to refer to the package named au_go:

from setuptools import setup, find_packages

setup(
    name         = 'au_go',
    version      = '1.0',
    packages     = find_packages(),
    package_data = {
        'au_go': ['res/*.json']
    },
    entry_points = {'scrapy': ['settings = au_go.settings']},
    zip_safe=False,
)

And then you can use pkgutil.get_data("au_go", "res/final.json").

More details about package data here on scrapinghub support center : https://support.scrapinghub.com/support/solutions/articles/22000200416-deploying-non-code-files — Brown nightingale, Dec 09 '18 at 09:48

scrapy how to load urls from file at scrapinghub

1 Answers1