How to produce a JSON tree with nested data from Scrapy
This was an interesting puzzle: creating one single well formed JSON from a hierarchy of web pages. E.g., the sporting goods hierarchy of an e-commerce site could be Categories, Brands, Products. And so you’d like to output JSON like this:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
As an aside, I like architecting my systems to generate this kind of output from my first-level scrapers: they simply mirror the source’s structure. But since it’s now in clean, well formed JSON, the importing code that follows can be simple.
My recipe
First, I give the my Spider subclass a couple of instance variables:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Next, my top-level parse method returns its data by creating an Item and adding it directly into the structure. Finally, it yields the new Item to the next page’s parser:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In the parse method for the “next level down”, I do the same thing. Except now, I save the newly created Item in the passed-in Category:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
That finishes up the scraping code. At this point, the spider would run, but produce no output. That’s because we’re not transmitting the root object, self.SportsInventory, back to Scrapy. So we need to hand it over, but only after all the spiders have finished. Here’s that code, also in the Spider subclass:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
You can see, this code uses a hack — it schedules a scrape just so that it can return data. It ignores the actual scrape results. Is there a more direct way to schedule a data return?
Finally, to get a proper JSON instance (with a hash at the top level), use the JSON Lines Feed Exporter.
Leave a comment