How to produce a JSON tree with nested data from Scrapy

This was an interesting puzzle: creating one single well formed JSON from a hierarchy of web pages. E.g., the sporting goods hierarchy of an e-commerce site could be Categories, Brands, Products. And so you’d like to output JSON like this:

{
"categories": [
{
"kind": "Category",
"number": "101",
"name": "Skateboards",
"url": "https://sports.com/cat-101",
"brands": [
{
"kind": "Brand",
"number": "19",
"name": "Landyachtz",
"url": "https://sports.com/19-landyachtz",
"products": [
{
"kind": "Product",
"plu": "4736",
"name": "Switch Longboard",
"url": "https://sports.com/4736",

[Etc.]

As an aside, I like architecting my systems to generate this kind of output from my first-level scrapers: they simply mirror the source’s structure. But since it’s now in clean, well formed JSON, the importing code that follows can be simple.

My recipe

First, I give the my Spider subclass a couple of instance variables:

def __init__(self):
super().__init__()
# A flag, set after post-processing is finished, to
# avoid an infinite loop.
self.data_submitted = False
# The object to return for conversion to a JSON tree.
# All the parse methods add their results to this
# structure.
self.sports = items.SportsInventory(categories=[])
view raw spider-setup.py hosted with ❤ by GitHub

Next, my top-level parse method returns its data by creating an Item and adding it directly into the structure. Finally, it yields the new Item to the next page’s parser:

# Create a new Category to hold the scraped info. Also,
# prepare it for holding its brands.
category = items.Category(number="…", name="…", url="…", brands=[])
# Save the category into the tree structure.
self.sports["categories"].append(category)
# Create a request for the Category's page, which
# will list all its Brands.
# Pass the Category Item in the meta dict.
request = scrapy.Request(category["url"], callback=self.parse_category_page)
request.meta["category"] = category
yield request

In the parse method for the “next level down”, I do the same thing. Except now, I save the newly created Item in the passed-in Category:

# Pull the category back out of the meta dict.
parent_category = response.meta["category"]
# Create a new items.Brand with the scraped data.
# …
# Add the new brand to its parent in the tree.
parent_category["brands"].append(brand)

That finishes up the scraping code. At this point, the spider would run, but produce no output. That’s because we’re not transmitting the root object, self.SportsInventory, back to Scrapy. So we need to hand it over, but only after all the spiders have finished. Here’s that code, also in the Spider subclass:

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
"""Register to receive the idle event"""
spider = super(SecureSosStateOrUsSpider, cls).from_crawler(
crawler, *args, **kwargs
)
crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
return spider
def spider_idle(self, spider):
"""Schedule a simple request in order to return the collected data"""
if self.data_submitted:
return
# This is a hack: I don't yet know how to schedule a request to just
# submit data _without_ also triggering a scrape. So I provide a URL
# to a simple site that we're going to ignore.
null_request = scrapy.Request("http://neverssl.com/", callback=self.submit_data)
self.crawler.engine.schedule(null_request, spider)
raise scrapy.exceptions.DontCloseSpider
def submit_data(self, _):
"""Simply return the collection of all the scraped data. Ignore the actual
scraped content. I haven't figured out another way to submit the merged
results.
To be used as a callback when the spider is idle (i.e., has finished scraping.)
"""
self.data_submitted = True
return self.sportsInventory
Reliable, but depends on a hack.

You can see, this code uses a hack — it schedules a scrape just so that it can return data. It ignores the actual scrape results. Is there a more direct way to schedule a data return?

Finally, to get a proper JSON instance (with a hash at the top level), use the JSON Lines Feed Exporter.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s