CAPTCHAs and Data Requests
CAPTCHAs, often installed on remote sites, challenge visitors to prove they're human and not bots. Bots have a bad reputation with remote sites; at best, they’re seen as appropriating competitive data from websites. To block bots, CAPTCHAs present challenges that bots typically are unable to deal with, such as displaying random, distorted numbers or letters that humans can read. Without passing the CAPTCHA test, users can't complete their data research.
CAPTCHAs sometimes also appear to ProxyMesh users, for example on the World Proxy. This article describes ways you can bypass CAPTCHAs and do your research without their distraction.
Scripting human behavior
Most often, CAPTCHAs are encountered by users logging onto a website or purchasing items online. Scrapers with a “sneaker proxy” may encounter CAPTCHAs too.
Typically, humans leave random intervals between actions on the web. If you program scraping bots or scripts to stagger web actions like humans, you stand a good chance of avoiding CAPTCHAs, and you can potentially remain undetected.
It’s good practice to code in a random delay of 15 to 20 seconds between queries, or to increase from that point if necessary.
Working with the proxy
Here are some things you can do with just the proxy server alone, that is, without add-on tools.
Rotate your IP addresses
When requests all come from one IP, and stay on a remote site too long, the site may block that IP. But sending requests with several IP addresses will make it appear as though the requests are coming from different sources. When you sign up with ProxyMesh, you have a pool of IP addresses that automatically rotate for you. Along with staggering your requests, IP address rotation lets you scrape more efficiently.
Rotate user agents
User agent strings provide websites with various information about how you’re visiting them, such as what application and operating software you use. When a large number of requests is being sent, it’s important to switch user agent strings to prevent servers from detecting and blocking you. Be sure to customize your user agents as well, since servers can easily detect suspicious ones. You can check many user agents at WhatIsMyBrowser.com.
Proxy rate limit
Rate limiting regulates the amount of incoming and outgoing traffic on a remote site. If you’re using an API configured to allow 100 requests a minute, and your requests exceed that number, then you may be blocked.
You can use rate limiting to avoid triggering CAPTCHA. Even when using different proxies simultaneously, you still need to limit the rate of your queries. But, despite effective rate limiting, multiple similar queries will still raise red flags, wherever they appear to originate from. So it’s best to set your proxy rate limit at a minimum of 2-3 seconds between requests.
Below is a sampling of tools you can add on to the proxy server to increase effectiveness in bypassing CAPTCHAs.
Even though you program your scraping software to deal with CAPTCHA, there are some CAPTCHAs that you won’t be able to avoid. But you can integrate CAPTCHA solvers into your web scraping tools conveniently and efficiently. They automatically resolve CAPTCHAs, allowing your web crawlers to work smoothly. Services like Death by CAPTCHA and Bypass CAPTCHA allow you to connect to them via API to enable automatic CAPTCHA solving during the scraping process. These services come with varying features and price points to fit your needs. Reasonably priced options for large scale scraping projects are available. You can look at these options and choose what’s right for you.
A residential proxy differs from other proxies, in that it has an IP address that comes from a regular Internet Service Provider (ISP). This means that you get a reliable proxy server that is linked to a physical location. Residential proxies work in a way that helps you get around CAPTCHAs and collect the data you need. Here is a summary of the benefits.
By using IP addresses linked to authentic residential addresses such as consumers use, residential proxies provide legitimacy. This enables you to fly under the radar and avoid getting flagged. So it’s a great method for getting around CAPTCHAs. The more genuine you appear, the easier it will be to use the Internet without constantly worrying about CAPTCHAs.
IP rotation is a feature of both residential and datacenter proxies. As described above, it provides another great way of imitating human behavior. As long as web servers perceive you to be a legitimate user, you are less likely to be flagged or to encounter CAPTCHAs.
Lots of websites now have geo-restricted content, which can make your market research a lot harder. However, with a good quality residential proxy, you’ll get access to a proxy network that covers plenty of locations across the globe. With entrée to this network, accessing whatever content you need becomes vastly easier, regardless of location. This, too, can help keep CAPTCHAs away. That’s especially important for large-scale web scraping projects: you need legitimate IP addresses that won’t get you blocked.
Some tools specifically address CAPTCHAs through APIs. For example, ScraperAPI, designed to work with residential proxies, has an efficient CAPTCHA-resolving service and can also handle proxies and browsers. A simple API call enables you to get the HTML data from any web page. supports scraping on multiple platforms including Bash, Node, Python/Scrapy, PHP, Ruby, and Java.
ScraperAPI provides special tools for managing CAPTCHAs. You may be able to get the HTML from web pages without having to pass a CAPTCHA challenge. If you do choose ScraperAPI, sign up with the code PR10 for a 10% discount.
Please see our blog article Bypassing CAPTCHAs with Scraper API.
To avoid potential obstacles, practice good scraping etiquette so you won’t get blocked. A key point of etiquette is to follow the remote site’s robots.txt file. That file tells search engine bots which pages or files the bot can or cannot request from a site. The file can’t actually enforce these rules, but for bots to bypass the robots.txt is considered highly unethical and not the polite way to scrape. It can quickly get you banned.
The legal issues around robots.txt files aren’t entirely clear and can vary from region to region. But by “scraping politely,” you’ll always be on the right side.