How to produce a JSON tree with nested data from Scrapy

This was an interesting puzzle: creating one single well formed JSON from a hierarchy of web pages. E.g., the sporting goods hierarchy of an e-commerce site could be Categories, Brands, Products. And so you’d like to output JSON like this:

	{
	"categories": [
	{
	"kind": "Category",
	"number": "101",
	"name": "Skateboards",
	"url": "https://sports.com/cat-101",
	"brands": [
	{
	"kind": "Brand",
	"number": "19",
	"name": "Landyachtz",
	"url": "https://sports.com/19-landyachtz",
	"products": [
	{
	"kind": "Product",
	"plu": "4736",
	"name": "Switch Longboard",
	"url": "https://sports.com/4736",

view raw json-fragment.json hosted with ❤ by GitHub

[Etc.]

As an aside, I like architecting my systems to generate this kind of output from my first-level scrapers: they simply mirror the source’s structure. But since it’s now in clean, well formed JSON, the importing code that follows can be simple.

My recipe

First, I give the my Spider subclass a couple of instance variables:

	def __init__(self):
	super().__init__()

	# A flag, set after post-processing is finished, to
	# avoid an infinite loop.
	self.data_submitted = False

	# The object to return for conversion to a JSON tree.
	# All the parse methods add their results to this
	# structure.
	self.sports = items.SportsInventory(categories=[])

view raw spider-setup.py hosted with ❤ by GitHub

Next, my top-level parse method returns its data by creating an Item and adding it directly into the structure. Finally, it yields the new Item to the next page’s parser:

	# Create a new Category to hold the scraped info. Also,
	# prepare it for holding its brands.
	category = items.Category(number="…", name="…", url="…", brands=[])

	# Save the category into the tree structure.
	self.sports["categories"].append(category)

	# Create a request for the Category's page, which
	# will list all its Brands.
	# Pass the Category Item in the meta dict.
	request = scrapy.Request(category["url"], callback=self.parse_category_page)
	request.meta["category"] = category
	yield request

view raw scrapy-saving-data.py hosted with ❤ by GitHub

In the parse method for the “next level down”, I do the same thing. Except now, I save the newly created Item in the passed-in Category:

	# Pull the category back out of the meta dict.
	parent_category = response.meta["category"]

	# Create a new items.Brand with the scraped data.
	# …

	# Add the new brand to its parent in the tree.
	parent_category["brands"].append(brand)

view raw scrapy-next-level-down.py hosted with ❤ by GitHub

That finishes up the scraping code. At this point, the spider would run, but produce no output. That’s because we’re not transmitting the root object, self.SportsInventory, back to Scrapy. So we need to hand it over, but only after all the spiders have finished. Here’s that code, also in the Spider subclass:

	@classmethod
	def from_crawler(cls, crawler, args, *kwargs):
	"""Register to receive the idle event"""
	spider = super(SecureSosStateOrUsSpider, cls).from_crawler(
	crawler, args, *kwargs
	)
	crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
	return spider

	def spider_idle(self, spider):
	"""Schedule a simple request in order to return the collected data"""
	if self.data_submitted:
	return

	# This is a hack: I don't yet know how to schedule a request to just
	# submit data _without_ also triggering a scrape. So I provide a URL
	# to a simple site that we're going to ignore.
	null_request = scrapy.Request("http://neverssl.com/", callback=self.submit_data)
	self.crawler.engine.schedule(null_request, spider)
	raise scrapy.exceptions.DontCloseSpider

	def submit_data(self, _):
	"""Simply return the collection of all the scraped data. Ignore the actual
	scraped content. I haven't figured out another way to submit the merged
	results.

	To be used as a callback when the spider is idle (i.e., has finished scraping.)
	"""
	self.data_submitted = True
	return self.sportsInventory

view raw scrapy_example_return_tree_object.py hosted with ❤ by GitHub

Reliable, but depends on a hack.

You can see, this code uses a hack — it schedules a scrape just so that it can return data. It ignores the actual scrape results. Is there a more direct way to schedule a data return?

Finally, to get a proper JSON instance (with a hash at the top level), use the JSON Lines Feed Exporter.

Dogweather