Using urllib3 for Proxy Headers
When you’re preparing to scrape the web using a specific IP address, it’s common to obtain that address from a response header through a proxy tunnel. In this article we’ll suggest a useful approach:
- Open a proxy tunnel to get the IP, to be passed into the request in a custom header.
- Switch to urllib3 to send the request to the remote site.
Getting the Response
Generally, using http.client in Python, you can get the IP address for the header via code that looks like
def connect_to_proxy(self, scrape_url: str, headers:Dict, proxy_url: str): host, port = proxy_url.split(':') conn = http.client.HTTPConnection(host, port, 60) conn.set_tunnel(scrape_url, 443, headers) conn.connect() foo = conn.request("GET", scrape_url, None, headers) r = conn.getresponse() return r
The headers you pass in should look something like this:
{ "user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36", "x-requested-with":"XMLHttpRequest", "x-requested-by-react":"true", "referer":"https://www.yelp.com/biz/", "Cache-Control":"no-cache", "Accept":"*/*", "Accept-Encoding":"gzip, deflate, br", "Connection":"keep-alive", "Proxy-Authorization": proxy_auth, }
If you receive an error reading 503 "Tunnel connection failed: 503 server error",
you can try the following alternative script and retrieve the IP from httpbin. First, if necessary, authenticate your IP and authorize a proxy server; in this example we’re using us-wa.proxymesh.com
import urlparse import httplib import base64 proxy_uri = "http://us-wa.proxymesh.com:31280" host = 'httpbin.org' port = 443 url = urlparse.urlparse(proxy_uri) conn = httplib.HTTPSConnection(url.hostname, url.port) headers ={} if url.username and url.password: auth ='%s:%s'%(url.username, url.password) headers['Proxy-Authorization'] = 'Basic '+ base64.b64encode(auth) conn.set_tunnel(host, port, headers) conn.request("GET","/ip") response = conn.getresponse() print(response.status,response.reason) output = response.read() print(output)
This will output something like this:
(200,'OK') { "origin":"NN.NN.NNN.NN" }
Passing in the Proxy Headers
For passing in the header and sending your request, we recommend you use Python’s urllib3 HTTP client. urllib3 is a powerful, user-friendly HTTP client for Python. It supports thread safety, connection pooling, client-side SSL/TLS verification, file uploads with multipart encoding, helpers for retrying requests and dealing with HTTP redirects, gzip and deflate encoding, and proxy for HTTP and SOCKS.
With urllib3, custom headers for a proxied request are passed in to the proxy manager and are then sent to the proxy. Request headers can be sent on the request function call. Here's some sample code:
import urllib3 proxy = urllib3.ProxyManager('http://PROXY.proxymesh.com:31280', proxy_headers={'X-ProxyMesh-IP': PROXY_IP}) r = proxy.request('GET', URL) return r
proxy_headers
is a dictionary containing headers that will be sent to the proxy. In an HTTP request, they are sent with each request. With HTTPS/CONNECT, they are sent only once. They can be used for proxy authentication or for passing in the X-ProxyMesh-IP header with an IP that you retrieved from the previous code.
For more details, please see urllib3 ProxyManager
If you encounter other tunnel-related errors, such as 407
or 501
, and need help with them, please contact Support.