Using urllib3 for Proxy Headers

When you’re preparing to scrape the web using a specific IP address, it’s common to obtain that address from a response header through a proxy tunnel. In this article we’ll suggest a useful approach:

  1. Open a proxy tunnel to get the IP, to be passed into the request in a custom header.
  2. Switch to urllib3 to send the request to the remote site.

Getting the Response

Generally, using http.client in Python, you can get the IP address for the header via code that looks like

def connect_to_proxy(self, scrape_url: str, headers:Dict, proxy_url: str):
    host, port = proxy_url.split(':')
    conn = http.client.HTTPConnection(host, port, 60)
    conn.set_tunnel(scrape_url, 443, headers)
    conn.connect()
    foo = conn.request("GET", scrape_url, None, headers)
    r = conn.getresponse()
    return r

The headers you pass in should look something like this:

{
  "user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36",
  "x-requested-with":"XMLHttpRequest",
  "x-requested-by-react":"true",
  "referer":"https://www.yelp.com/biz/",
  "Cache-Control":"no-cache",
  "Accept":"*/*",
  "Accept-Encoding":"gzip, deflate, br",
  "Connection":"keep-alive",
  "Proxy-Authorization": proxy_auth,
}

If you receive an error reading 503 "Tunnel connection failed: 503 server error", you can try the following alternative script and retrieve the IP from httpbin. First, if necessary, authenticate your IP and authorize a proxy server; in this example we’re using us-wa.proxymesh.com

import urlparse
import httplib
import base64
proxy_uri = "http://us-wa.proxymesh.com:31280"
host = 'httpbin.org'
port = 443
url = urlparse.urlparse(proxy_uri)
conn = httplib.HTTPSConnection(url.hostname, url.port)
headers ={}
if url.username and url.password:
  auth ='%s:%s'%(url.username, url.password)
  headers['Proxy-Authorization'] = 'Basic '+ base64.b64encode(auth)
conn.set_tunnel(host, port, headers)
conn.request("GET","/ip")
response = conn.getresponse()
print(response.status,response.reason)
output = response.read()
print(output)

This will output something like this:

(200,'OK')
{
"origin":"NN.NN.NNN.NN"
}

Passing in the Proxy Headers

For passing in the header and sending your request, we recommend you use Python’s urllib3 HTTP client. urllib3 is a powerful, user-friendly HTTP client for Python. It supports thread safety, connection pooling, client-side SSL/TLS verification, file uploads with multipart encoding, helpers for retrying requests and dealing with HTTP redirects, gzip and deflate encoding, and proxy for HTTP and SOCKS.

With urllib3, custom headers for a proxied request are passed in to the proxy manager and are then sent to the proxy. Request headers can be sent on the request function call. Here's some sample code:

import urllib3
proxy = urllib3.ProxyManager('http://PROXY.proxymesh.com:31280', proxy_headers={'X-ProxyMesh-IP': PROXY_IP})
r = proxy.request('GET', URL)
return r

proxy_headers is a dictionary containing headers that will be sent to the proxy. In an HTTP request, they are sent with each request. With HTTPS/CONNECT, they are sent only once. They can be used for proxy authentication or for passing in the X-ProxyMesh-IP header with an IP that you retrieved from the previous code.

For more details, please see urllib3 ProxyManager

If you encounter other tunnel-related errors, such as 407 or 501, and need help with them, please contact Support.

Still need help? Contact Us Contact Us