This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
tesco_scraper [2022/05/02 15:23] admin created |
tesco_scraper [2022/05/02 15:38] (current) admin |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Tesco Scraper ====== | ====== Tesco Scraper ====== | ||
I threw this together one afternoon as I was frustrated that even though Tesco displays the 'price-per-ml' for things like cans and bottles of Coca Cola, it doesn't let you sort by 'price-per-ml'. So I wrote a quick script that will download an entire sub category of products, extract the data from them, then output them all to a CSV that I can filter/sort myself. | I threw this together one afternoon as I was frustrated that even though Tesco displays the 'price-per-ml' for things like cans and bottles of Coca Cola, it doesn't let you sort by 'price-per-ml'. So I wrote a quick script that will download an entire sub category of products, extract the data from them, then output them all to a CSV that I can filter/sort myself. | ||
+ | |||
+ | It was easier than I expected (I didn't have to do much scraping) as there is a data-element in the body tag of the page that has all the product detail in json format. | ||
<code php tesco-to-csv.php> | <code php tesco-to-csv.php> | ||
Line 46: | Line 48: | ||
} | } | ||
- | $lines = array('Product Name', 'Tesco URL', 'Brand', 'Price', 'Price per unit', 'Unit Measure'); | + | $lines = array(); |
+ | $lines[] = array('Product Name', 'Tesco URL', 'Brand', 'Price', 'Price per unit', 'Unit Measure'); | ||
// read through each of the pages, parse the json and output as csv | // read through each of the pages, parse the json and output as csv | ||
echo "Parsing JSON to CSV\n"; | echo "Parsing JSON to CSV\n"; | ||
Line 53: | Line 56: | ||
$file_content = file_get_contents($output_file."-".$i.".html"); | $file_content = file_get_contents($output_file."-".$i.".html"); | ||
$doc= new DOMDocument(); | $doc= new DOMDocument(); | ||
- | $doc->loadHTML($file_content); | + | @$doc->loadHTML($file_content); |
$body = $doc->getElementsByTagName('body')->item(0); | $body = $doc->getElementsByTagName('body')->item(0); | ||
// get all product date on the page in json format (it's a data-attribute of the body element) | // get all product date on the page in json format (it's a data-attribute of the body element) | ||
Line 81: | Line 84: | ||
fputcsv($fp, $line); | fputcsv($fp, $line); | ||
} | } | ||
+ | |||
</code> | </code> | ||
+ | |||
+ | <code bash> | ||
+ | php tesco-to-csv.php | ||
+ | Downloading https://www.tesco.com/groceries/en-GB/shop/drinks/fizzy-and-soft-drinks/all?sortBy=price-descending&count=48 | ||
+ | Downloading https://www.tesco.com/groceries/en-GB/shop/drinks/fizzy-and-soft-drinks/all?sortBy=price-descending&count=48&page=2 | ||
+ | Downloading https://www.tesco.com/groceries/en-GB/shop/drinks/fizzy-and-soft-drinks/all?sortBy=price-descending&count=48&page=3 | ||
+ | Downloading https://www.tesco.com/groceries/en-GB/shop/drinks/fizzy-and-soft-drinks/all?sortBy=price-descending&count=48&page=4 | ||
+ | Downloading https://www.tesco.com/groceries/en-GB/shop/drinks/fizzy-and-soft-drinks/all?sortBy=price-descending&count=48&page=5 | ||
+ | Parsing JSON to CSV | ||
+ | Writing CSV | ||
+ | </code> | ||
+ | |||
+ | {{:pasted:20220502-153759.png}} |