Quantcast
Channel: Scrapy and invalid cookie found in request - Stack Overflow
Viewing all articles
Browse latest Browse all 3

Scrapy and invalid cookie found in request

$
0
0

Web Scraping Needs

To scrape the title of events from the first page on eventbrite link here.

Approach

Whilst the page does not have much javascript and the page pagination is simple, grabbing the titles for every event on the page is quite easy and don't have problems with this.

However I see there's an API which I want to re-engineer the HTTP requests, for efficiency and more structured data.

Problem

I'm able to mimic the HTTP request using the requests python package, using the correct headers, cookies and parameters. Unfortunately when I use the same cookies with scrapy it seems to be complaining about three key's in the cookie dictionary that are blank 'mgrefby': '', 'ebEventToTrack': '', 'AN': '',. Despite the fact that they are blank in the HTTP request used with the request package.

Requests Package Code Example

import requestscookies = {'mgrefby': '','G': 'v%3D2%26i%3Dbff2ee97-9901-4a2c-b5b4-5189c912e418%26a%3Dd24%26s%3D7a302cadca91b63816f5fd4a0a3939f9c9f02a09','ebEventToTrack': '','eblang': 'lo%3Den_US%26la%3Den-us','AN': '','AS': '50c57c08-1f5b-4e62-8626-ea32b680fe5b','mgref': 'typeins','client_timezone': '%22Europe/London%22','csrftoken': '85d167cac78111ea983bcbb527f01d2f','SERVERID': 'djc9','SS': 'AE3DLHRwcfsggc-Hgm7ssn3PGaQQPuCJ_g','SP': 'AGQgbbkgEVyrPOfb8QOLk2Q893Bkx6aqepKtFsfXUC9SW6rLrY3HzVmFa6m91qZ6rtJdG0PEVaIXdCuyQOL27zgxTHS-Pn0nHcYFr9nb_gcU1ayxSx4Y0QXLDvhxGB9EMsou1MZmIfEBN7PKFp_enhYD6HUP80-pNUGLI9R9_CrpFzXc48lp8jXiHog_rTjy_CHSluFrXr2blZAJfdC8g2lFpc4KN8wtSyOwn8qTs7di3FUZAJ9BfoA',}headers = {'Connection': 'keep-alive','User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36','X-Requested-With': 'XMLHttpRequest','X-CSRFToken': '85d167cac78111ea983bcbb527f01d2f','Content-Type': 'application/json','Accept': '*/*','Origin': 'https://www.eventbrite.com','Sec-Fetch-Site': 'same-origin','Sec-Fetch-Mode': 'cors','Sec-Fetch-Dest': 'empty','Referer': 'https://www.eventbrite.com/d/ny--new-york/human-resources/?page=2','Accept-Language': 'en-US,en;q=0.9',}data = '{"event_search":{"q":"human resources","dates":"current_future","places":\n["85977539"],"page":1,"page_size":20,"online_events_only":false,"client_timezone":"Europe/London"},"expand.destination_event":["primary_venue","image","ticket_availability","saves","my_collections","event_sales_status"]}'response = requests.post('https://www.eventbrite.com/api/v3/destination/search/', headers=headers, cookies=cookies, data=data)

Scrapy Code example

class TestSpider(scrapy.Spider):    name = 'test'    allowed_domains = ['eventbrite.com']    start_urls = []    cookies = {'mgrefby': '','G': 'v%3D2%26i%3Dbff2ee97-9901-4a2c-b5b4-5189c912e418%26a%3Dd24%26s%3D7a302cadca91b63816f5fd4a0a3939f9c9f02a09','ebEventToTrack': '','eblang': 'lo%3Den_US%26la%3Den-us','AN': '','AS': '50c57c08-1f5b-4e62-8626-ea32b680fe5b','mgref': 'typeins','client_timezone': '%22Europe/London%22','csrftoken': '85d167cac78111ea983bcbb527f01d2f','SERVERID': 'djc9','SS': 'AE3DLHRwcfsggc-Hgm7ssn3PGaQQPuCJ_g','SP': 'AGQgbbkgEVyrPOfb8QOLk2Q893Bkx6aqepKtFsfXUC9SW6rLrY3HzVmFa6m91qZ6rtJdG0PEVaIXdCuyQOL27zgxTHS-Pn0nHcYFr9nb_gcU1ayxSx4Y0QXLDvhxGB9EMsou1MZmIfEBN7PKFp_enhYD6HUP80-pNUGLI9R9_CrpFzXc48lp8jXiHog_rTjy_CHSluFrXr2blZAJfdC8g2lFpc4KN8wtSyOwn8qTs7di3FUZAJ9BfoA',}    headers = {'Connection': 'keep-alive','User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36','X-Requested-With': 'XMLHttpRequest','X-CSRFToken': '85d167cac78111ea983bcbb527f01d2f','Content-Type': 'application/json','Accept': '*/*','Origin': 'https://www.eventbrite.com','Sec-Fetch-Site': 'same-origin','Sec-Fetch-Mode': 'cors','Sec-Fetch-Dest': 'empty','Referer': 'https://www.eventbrite.com/d/ny--new-york/human-resources/?page=1','Accept-Language': 'en-US,en;q=0.9',    }    data = '{"event_search":{"q":"human resources","dates":"current_future","places":\n["85977539"],"page":1,"page_size":20,"online_events_only":false,"client_timezone":"Europe/London"},"expand.destination_event":["primary_venue","image","ticket_availability","saves","my_collections","event_sales_status"]}'    def start_requests(self):        url = 'https://www.eventbrite.com/api/v3/destination/search/'        yield scrapy.Request(url=url, method='POST',headers=self.headers,cookies=self.cookies,callback=self.parse)    def parse(self,response):        print('request')

Output

2020-08-01 11:55:33 [scrapy.core.engine] INFO: Spider opened2020-08-01 11:55:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2020-08-01 11:55:33 [test] INFO: Spider opened: test2020-08-01 11:55:33 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in C:\Users\Aaron\projects\scrapy\eventbrite\.scrapy\httpcache2020-08-01 11:55:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:60232020-08-01 11:55:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.eventbrite.com/robots.txt> (referer: None) ['cached']2020-08-01 11:55:33 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'mgrefby', 'value': ''} ('value' is missing)2020-08-01 11:55:33 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'ebEventToTrack', 'value': ''} ('value' is missing)2020-08-01 11:55:33 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'AN', 'value': ''} ('value' is missing)   2020-08-01 11:55:33 [scrapy.core.engine] DEBUG: Crawled (401) <POST https://www.eventbrite.com/api/v3/destination/search/> (referer: https://www.eventbrite.com/d/ny--new-york/human-resources/?page=1) ['cached']   2020-08-01 11:55:33 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <401 https://www.eventbrite.com/api/v3/destination/search/>: HTTP status code is not handled or not allowed2020-08-01 11:55:33 [scrapy.core.engine] INFO: Closing spider (finished)2020-08-01 11:55:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 1540,'downloader/request_count': 2,'downloader/request_method_count/GET': 1,'downloader/request_method_count/POST': 1,'downloader/response_bytes': 32163,'downloader/response_count': 2,'downloader/response_status_count/200': 1,'downloader/response_status_count/401': 1,'elapsed_time_seconds': 0.187986,'finish_reason': 'finished','finish_time': datetime.datetime(2020, 8, 1, 10, 55, 33, 202931),'httpcache/hit': 2,'httperror/response_ignored_count': 1,'httperror/response_ignored_status_count/401': 1,'log_count/DEBUG': 3,'log_count/INFO': 12,'log_count/WARNING': 3,'response_received_count': 2,'robotstxt/request_count': 1,'robotstxt/response_count': 1,'robotstxt/response_status_count/200': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetime.datetime(2020, 8, 1, 10, 55, 33, 14945)}2020-08-01 11:55:33 [scrapy.core.engine] INFO: Spider closed (finished)

Attempts to solve issue

The 401 status seems to refer to authorisation, for which I can only presume it's not liking the cookie I'm sending.

  1. I've set COOKIES_ENABLED = True with the same output as before
  2. I've set COOKIES_DEBUG = True and see output below

Output with cookies_debug=True

2020-08-01 12:05:15 [scrapy.core.engine] INFO: Spider opened2020-08-01 12:05:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2020-08-01 12:05:15 [test] INFO: Spider opened: test2020-08-01 12:05:15 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in C:\Users\Aaron\projects\scrapy\eventbrite\.scrapy\httpcache2020-08-01 12:05:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:60232020-08-01 12:05:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.eventbrite.com/robots.txt> (referer: None) ['cached']2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'mgrefby', 'value': ''} ('value' is missing)2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'ebEventToTrack', 'value': ''} ('value' is missing)2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] WARNING: Invalid cookie found in request <POST https://www.eventbrite.com/api/v3/destination/search/>: {'name': 'AN', 'value': ''} ('value' is missing)   2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST https://www.eventbrite.com/api/v3/destination/search/>Cookie: G=v%3D2%26i%3Dbff2ee97-9901-4a2c-b5b4-5189c912e418%26a%3Dd24%26s%3D7a302cadca91b63816f5fd4a0a3939f9c9f02a09; eblang=lo%3Den_US%26la%3Den-us; AS=50c57c08-1f5b-4e62-8626-ea32b680fe5b; mgref=typeins; client_timezone=%22Europe/London%22; csrftoken=85d167cac78111ea983bcbb527f01d2f; SERVERID=djc9; SS=AE3DLHRwcfsggc-Hgm7ssn3PGaQQPuCJ_g; SP=AGQgbbkgEVyrPOfb8QOLk2Q893Bkx6aqepKtFsfXUC9SW6rLrY3HzVmFa6m91qZ6rtJdG0PEVaIXdCuyQOL27zgxTHS-Pn0nHcYFr9nb_gcU1ayxSx4Y0QXLDvhxGB9EMsou1MZmIfEBN7PKFp_enhYD6HUP80-pNUGLI9R9_CrpFzXc48lp8jXiHog_rTjy_CHSluFrXr2blZAJfdC8g2lFpc4KN8wtSyOwn8qTs7di3FUZAJ9BfoA2020-08-01 12:05:15 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <401 https://www.eventbrite.com/api/v3/destination/search/>Set-Cookie: SP=AGQgbbno_KHLNiLzDpLHcdI4kotUbRiTxMMY5N0t7VudPU_QGCm2Q0nH7-J99aoRZvGLxXfREH5YfPAtK52iiiLcEpnjh1G43ZBxKuo9qvJHykLV23ZIjaFK0iIr6ptOaczMoQhkaqE-7nJ8t2Ykt18CN196pKZ5QhFuXy6SnspZ0toEGChZcQgmrAAAVPfuoiiUmbTG_wJC8_KikL2sYl2s6-KWUOOpjRFJCko5RGgiyC2Osu9vxZ8; Domain=.eventbrite.com; httponly; Path=/; secureSet-Cookie: G=v%3D2%26i%3D5cebebd2-2a7f-4638-9912-0abf19111a0c%26a%3Dd33%26s%3Df967e32d15dda2f06b392f22451af935d93f88d1; Domain=.eventbrite.com; expires=Sat, 31-Jul-2021 22:46:28 GMT; httponly; Path=/; secure     Set-Cookie: ebEventToTrack=; Domain=.eventbrite.com; expires=Sun, 30-Aug-2020 22:46:28 GMT; httponly; Path=/; secureSet-Cookie: SS=AE3DLHRgTIL46n9XiOZiJRSkccGnNXSMkA; Domain=.eventbrite.com; httponly; Path=/; secureSet-Cookie: eblang=lo%3Den_US%26la%3Den-us; Domain=.eventbrite.com; expires=Sat, 31-Jul-2021 22:46:28 GMT; httponly; Path=/; secureSet-Cookie: AN=; Domain=.eventbrite.com; expires=Sun, 30-Aug-2020 22:46:28 GMT; httponly; Path=/; secureSet-Cookie: AS=350def0c-ed27-45ab-b12c-02e9fb68a8ae; Domain=.eventbrite.com; httponly; Path=/; secureSet-Cookie: SERVERID=djc44; path=/; HttpOnly; Secure2020-08-01 12:05:15 [scrapy.core.engine] DEBUG: Crawled (401) <POST https://www.eventbrite.com/api/v3/destination/search/> (referer: https://www.eventbrite.com/d/ny--new-york/human-resources/?page=1) ['cached']2020-08-01 12:05:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <401 https://www.eventbrite.com/api/v3/destination/search/>: HTTP status code is not handled or not allowed2020-08-01 12:05:15 [scrapy.core.engine] INFO: Closing spider (finished)
  1. I've tried a scrapy custom cookies downloader middleware for cookies persistence and again same error as before
  2. I've considered using browser automation to grab a cookie, again thinking about this as in future scrapes where I don't want to continuosly grab a cookie.

What I don't understand is with the same cookies, headers and parameters in the requests python package, the JSON object response is there. With scrapy it's complaining about blank dictionary values.

I would be grateful if anyone could look at the code if I've made a glaring mistake or see why the cookie which is accepted by the API endpoint via requests does not seem to work in Scrapy.


Viewing all articles
Browse latest Browse all 3

Latest Images

Trending Articles





Latest Images