A user agent is a computer program representing a person, for example, a browser in a Web context.
Besides a browser, a user agent could be a bot scraping webpages, a download manager, or another app accessing the Web. Along with each request they make to the server, browsers include a self-identifying User-Agent HTTP header called a user agent (UA) string. This string often identifies the browser, its version number, and its host operating system.
Spam bots, download managers, and some browsers often send a fake UA string to announce themselves as a different client. This is known as user agent spoofing.
A typical user agent string looks like this: "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0".
Doing so helps avoid getting blocked by overzealous DDoS protection services, and allows you to successfully scrape the data that you’re interested in while keeping site operators happy. Making realistic browsing patterns can get pretty complicated–we’ve previously explained some sophisticated techniques in articles like Making Chrome Headless Undetectable and It Is Not Possible to Detect and Block Chrome Headless.
The Into Smart Proxy Service even goes far beyond the methods that those articles describe in order to create browsing patterns that are completely indistinguishable from human users. The only problem is that there aren’t a lot of great resources out there for generating realistic user agents.
There are a handful of paid solutions out there, but the free lists only offer a limited slice of data and usually become outdated very quickly (you can check out Wikipedia’s Usage share of web browsers article to see this for yourself). The situation with open source libraries for random user agent generation is even worse; they’re typically published once or twice and then never updated.
User agent statistics are only really useful for web scraping when they’re up to date, and the few truncated lists that you find when you Google things like “most common user agents” are generally too limited to apply at scale. The Into website gets a pretty healthy amount of traffic–and we’re big fans of open information–so this seemed liked a natural opportunity for us to step in and provide the community with a useful resource for web scraping.
Also be sure to star the repository while you’re over there; it lets us know that people use the project and encourages us to devote more developer resources towards it! The basic idea is that we run a scheduled build on CircleCI every night that fetches the data from Google Analytics and digests it into an anonymized form that’s then committed to the repository and published on NPM.
You don’t need to understand these details in order to use the User -Agents package, but we thought it was interesting enough to share in case people were curious about how it works. Google Analytics tracks a variety of dimensions by default, but the browser user agent isn’t one of them.
In order to track the user agent directly, we needed to add a custom dimension to our analytics. To start actually collecting this data, we simply needed to use the Google Analytics set command to specify that the user agent is equal to the value of the navigator.sergeant property.
We also added additional custom dimensions for related quantities like navigator.appease, but we’ll skip over these in the code examples for brevity. If you would like to prevent analytics services from collecting this sort of information, then we highly recommend installing block Origin in your browser to block tracking.
After we started tracking the data, we needed to be able to access it via an API so that we could automate the process of updating the User -Agents package. Google has a concept of service accounts which can be used to allow exactly this sort of access.
Before we actually run the update-data script, we first populate the google-analytics-credentials.Jason file that is required to access the raw data from the Google Analytics API. The contents of this environment variable is obviously sensitive, so we also disable providing secrets to forked builds in the CircleCI project settings.
With this in place, the publish-new-version job was able to commit the updated data, create a new patch version for the project, and push up the changes to GitHub. To allow automatically publishing new package versions, we added CircleCI job called deploy.
Finally, we define a release workflow that runs the checkout, build, test, and deploy jobs. Well, we hope that you’ve enjoyed learning about how we keep the User Agents package consistently up to date.
If your web scraping tasks are still getting blocked when using random user agents, then be sure to check out the Into Smart Proxy Service. It integrates all the web-scraping best-practices that we’ve learned over the years, and it works with pretty much any web scraping software that you might be using.
The creator of F5Bot explains in detail how it works, and how it's able to scrape a million of Reddit comments per day. User-Agent test and override registry scripts Erica built this page long ago; it's now used as a quick place to poke interesting Web Browser APIs.
Otherwise, run one of the following scripts and restart all browser instances to see the change: The following script commands will generate a text file containing all UA-override registry settings.
For those not familiar with the concept, Sergeant (UA) is a string that contains information and details about the client’s browser and the platform it’s running on. That information contains the user ’s browser software and its version, the operating system, and even the platform architecture and model.
The UA is included in every request which gets sent by the browser and dependent on by many sites in various ways and for various purposes. In many cases, third-party groups, such as ad-tech companies or security vendors, also use the UA string to fingerprint and track the end-users.
Google Chrome will expose the UA information with Client Hints headers. It is important to mention that those headers are supported only over HTTPS, which makes it more secure and safe.
Additionally, it is must be noted that third-party vendors will have to get a permission from first-party domain owner in order to collect the information. You can also set an “Accept-CH-Lifetime”, which specifies the persistence of the Accept-CH header value for a given number of seconds, so that the client will send the Client Hints to the server for every request made to the same domain, even if the “Accept-CH” header is not sent for every request during this time period.
On the client side, the browser will include “Sec-CH-something” headers that store the information requested by the server. This header is currently supported but may be deleted in the final version of this feature, as Google has not yet decided if providing the exact device model is truly necessary and does not consider as private information.
The new API will be accessible under the navigator object and will expose the same information as the Client Hints headers. The low entropy values reveal whether the client uses a mobile device or not and the browser brand and its major version.
This change will provide better monitoring abilities, will let us know who is using this information and, in the bottom line, will improve the end-users’ privacy. The new implementation, of course, won’t be bypass-proof but will require the fraudsters and bot-developers to work harder in order to hide themselves and impersonate others.