Content scraping or web scraping is the process of extracting valuable data from websites using automated scripts or bots. If your website contains expensive data to collect or compute (e.g., flight connections, real-estate listings, product prices, or user data), a bad actor or competitor could steal and use it for nefarious purposes.
Bots vary in their ability to scrape content and avoid detection. Simple scripts using an HTTP library like wget can retrieve pages from a web server and parse information from the HTML response. They can be effective for scraping static sites but less efficient for client-rendered content. They are also easier to detect as your website can easily test its inability to execute JavaScript.
Headless browsers and browser automation tools like Puppeteer or Selenium are much more sophisticated. They can execute JavaScript, scroll, press buttons, wait for client-rendered content to load, and scrape it. They are full-featured browsers, only automated, which makes them more robust and harder to detect. Many also have “stealth” plugins, which try to make them resemble regular browsers.
A web application firewall can provide an essential layer of rule-based protection, such as blocking IP ranges, countries, and data centers known to host bots. This first line of defense is helpful but sometimes insufficient, as scrapers can use proxies to cycle through different IP addresses.
You can ask your visitors to prove their human by completing CAPTCHA challenges, like picking all the images that contain a sombrero. This is generally effective but also disruptive to the user experience. To fight bots without bothering humans, you can use a client-side library to detect bots at runtime by analyzing the visitor’s browser.
Fingerprint Pro Bot Detection collects vast amounts of browser data that bots leak (errors, network overrides, browser attribute inconsistencies, API changes, and more) to reliably distinguish real users from headless browsers, automation tools, their derivatives, and plugins.
It is based on BotD — a free and open-source library that detects simple bots running entirely in the client. Fingerprint Pro Bot Detection can detect a broader range of sophisticated bots and runs the analysis on the server side where it’s not vulnerable to tampering by bots themselves. See our documentation for a detailed comparison of BotD and Fingerprint Pro Bot Detection. The example below uses the non-open-source version.
First, sign up for a Fingerprint Pro account and contact our support to turn on Bot Detection. Bot Detection is currently in beta and limited to customers with annual contracts, but we look forward to a wider release soon.
Add the JavaScript agent to your website client. Once enabled, you can use the same JavaScript agent for visitor identification and Bot Detection. We have client libraries for all significant front-end frameworks, or you can load the script from our CDN as shown below:
// Initialize the agent
const fpPromise = import('<https://fpjscdn.net/v3/><your-public-api-key>')
.then(FingerprintJS => FingerprintJS.load({
endpoint: '<https://metrics.yourdomain.com>',
}));
Note: We recommend using a subdomain integration to proxy requests to our API through your website’s subdomain — that’s what the endpoint parameter is for. This protects the requests from interference by ad blockers and browser extensions.
Let’s use an airline website as an example. The visitor picks their destination and clicks “Search flights.” Before returning the results from the server, you want to make sure they are not a bot.
On the client, right before requesting the flight data, use the loaded fpPromise
to send browser parameters to Fingerprint Pro API for analysis. You will get a requestId
in the response. Include it in the search request you send to your server.
async function onClickSearchFlights(from, to) {
// Collect browser signals for bot detection and send them
// to Fingerprint Pro API. The response contains a requestId
const { requestId } = await (await fpPromise).get();
// Pass the requestId to your server alongside the flights query
const response = await fetch(`/api/web-scraping/flights`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({ from, to, requestId }),
});
}
Note: To detect bots, Fingerprint Pro needs to collect signals from the browser. Therefore, it is best used to protect data endpoints that are accessible from your website, as demonstrated in this article. It is not designed to protect server-rendered or static content that is sent to the browser on the initial page load, as browser signals are not available during server-side rendering.
On the server, send the requestId
to Fingerprint Pro Server API to get your bot detection result. If the requestId
is malformed or not found, it will not return the flight results. You can call the Server API REST endpoint directly, or use one of our Server SDKs. Here is an example using the Node.js SDK:
import {
FingerprintJsServerApiClient,
Region,
} from "@fingerprintjs/fingerprintjs-pro-server-api";
export default async function getFlightsEndpoint(req, res) {
const { from, to, requestId } = req.body;
// requestId in the wrong format can be rejected immediately
if (!/^\d{13}\.[a-zA-Z0-9]{6}$/.test(requestId)) {
res.status(403).json({
message: "malformed requestId, potential spoofing detected",
});
}
let botDetection;
try {
// Initialise Server API client
const client = new FingerprintJsServerApiClient({
region: Region.Global,
apiKey: "<YOUR_SERVER_API_KEY>",
});
// Get analysis event from the Server API using the requestId
const eventResponse = await client.getEvent(requestId);
botDetection = eventResponse.products?.botd?.data;
} catch (error) {
// If getting the event fails, it's likely that the
// requestId was spoofed, so don't return the results
res.status(500).json({
message: "requestId not found, potential spoofing detected",
});
}
// continue processing botDetection result...
}
The botDetection
result returned from the Server API tells you if Fingerprint Pro detected a good bot (for example, a search engine crawler), a bad bot (an automated browser), or no bot at all.
{
"bot": {
"result": "bad" // or "good" or "notDetected",
"type": "headlessChrome"
},
"userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/110.0.5481.177 Safari/537.36"
"url": "https://yourdomain.com/search",
"ip": "61.127.217.15",
"time": "2022-03-21T16:40:13Z"
}
If the visitor is a malicious bot, return an error. Optionally you could update your WAF rules to block the bot’s IP address in the future.
if (botDetection?.bot.result === 'bad') {
res.status(403).json({
message: "Malicious bot detected, scraping flight data is not allowed."
});
return;
}
Now you know that the fingerprinting request is real and it didn’t detect any malicious bots. But you need to verify that the result actually belongs to this search request. The bot could have replaced the real requestId
with an old one obtained manually some time ago. To check against replay attacks, you need to verify the freshness of the fingerprinting request:
// fingerprinting event must be max 3 seconds old
if (Date.now() - Number(new Date(botDetection.time)) > 3000) {
res.status(403).json({
message: "Old visit detected, potential replay attack.",
});
return;
}
You also want to verify that the origin of the fingerprinting request matches the origin of the search request itself. Usually, both will be coming from your website’s domain.
const fpRequestOrigin = new URL(botDetection.url).origin;
if (
fpRequestOrigin !== req.headers["origin"] ||
fpRequestOrigin !== "yourdomain.com" ||
req.headers["origin"] !== "yourdomain.com"
) {
res.status(403).json({
message: "Origin mismatch detected, potential spoofing attack.",
});
return;
}
Finally, verify that the IP of the fingerprinting request matches the IP of the search request.
if (botDetection.ip !== req.headers["x-forwarded-for"]?.split(",")[0]) {
res.status(403).json({
message: "IP mismatch detected, potential spoofing attack.",
});
return;
}
Having verified the authenticity of the bot detection result, you can now confidently return the flights:
const flights = await getFlightResults(from, to);
res.status(200).json({ flights });
Visit the Web Scraping Prevention Demo we built to demonstrate the concepts above. You can explore the open-source code on Github or run it in your browser with StackBlitz. The core of the use case is implemented in this component and this endpoint.
To see Fingerprint Pro Bot Detection in action, you need to visit the use case website as a bot. The easiest way is the use a Browserless debugger, which allows you to control an automated browser in the cloud from your browser.
Go to the Fingerprint’s Browserless instance.
Switch to the Web Scraping tab for a full bot example or just copy this snippet into the code editor:
export default async ({ page }: { page: Page }) => {
await page.goto('https://fingerprinthub.com/web-scraping');
};
Press the “Play” button in the top right to run the bot (and wait a few seconds).
You see the automated browser on the right-hand side can’t access the flight search results when Bot Detection is enabled.
If you prefer to explore and test locally, the demo contains end-to-end tests. Execute them to see that you can scrape the flight results with Bot Detection disabled, but not otherwise.
git clone https://github.com/fingerprintjs/fingerprintjs-pro-use-cases
cd fingerprintjs-pro-use-cases
yarn install
yarn dev
# in a second terminal window
yarn test:e2e:chrome e2e/scraping/protected.spec.js --debug
yarn test:e2e:chrome e2e/scraping/unprotected.spec.js --debug
If you have any questions, please reach out to our support.
Fingerprint’ open source technology is supported by contributing developers across the globe. Stay up to date on our latest technical use cases, integrations and updates.