What are third-party cookies? How do they aid in obtaining user data? Our experts, Konstantin Perikov, Chief Software Engineer II, and Vladimir Sergeev, Senior Data Scientist, shed some light on modern privacy on the web, cookie deprecation by Google, mechanisms of user tracking, and fingerprinting.
When it comes to data privacy on the web, we can say for sure that no one’s data is completely safe. We all leave a huge trail of data as we move around the web: search queries, website visits, personal information, credentials, etc. The websites we visit daily get us hooked. For example, Google tracks 80% of the global sites that its users visit, while Facebook and Amazon track 24% and 19% of all sites respectively.
With the right tools at hand, this scattered data can snowball into a significant set of information. Information that allows organizations like internet providers and other interested individuals and entities to identify users and to determine their web behaviors.
The default is that our actions on the web are not private. Let's review popular mechanisms that allow tech giants to track user data and behavior on their sites.
Even though the discussion about 3rd party cookies going away has been around for a while, cookies remain the most common data-tracking method today. More than 40% of all websites use some type of cookie: either first-party or third-party. Cookies and privacy is a controversial issue due to the manner in which they gather data.
So, what are cookies? Cookies are small blocks of data created by a web server while a user is browsing a website that are placed on a user's computer or other device by a browser.
Generally, cookies allow storing stateful information, like items added to a shopping cart, browsing history, or authentication information. They are why we don't have to log in to our favorite websites every time we access them. On the other hand, cookies and privacy contradict each other.
There are several types of cookies:
Let's review the Google replacement initiatives that will accelerate the death of the cookie era.
The most recent addition is Google's proprietary technology FLoC which stands for Federated Learning of Cohorts. The FLoC aims to replace bespoke third-party cookies by tracking a group of users with similar behavior or browsing content. The FLoC is still in development, and most major Chromium-based browsers don’t support it because of privacy reasons.
Another initiative is a Privacy Budget. The idea behind Privacy Budget is that it’s vital to have some tracking capabilities for a good user experience, but that harmful tracking should be prevented. The Privacy Budget will provide websites with a virtual currency budget to spend on specific fingerprinting capabilities.
With these promising technologies on the horizon, will we see 3rd-party cookies going away soon? Only time will tell. Until then, we'll keep you posted on the latest updates in this field.
For now, let's review some of the commonly used ways to track users:
1x1 tracking pixel. This is a tiny pixel-sized image that can be hidden anywhere, from a web banner to an email. Tracking pixels allow tracking of user behavior, site conversions, web traffic, and other metrics on the back-end.
Beacon API. The Beacon API is a set of protocols that permits small amounts of data to be sent to a server without waiting for a response and tracking a user's activity. Here is an example of JS code that can send data from your web beacon:
let result = navigator.sendBeacon(url, data);
Account tracking. Many websites track user activities with user accounts. For example, after you log into a Facebook page, Facebook knows where you put likes and which pages you visit.
User tracking has its pros and cons. On the bright side, user tracking enables users to receive relevant ads, content, and eCommerce products that match their activities and interests. User tracking also helps companies measure revenue streams, monitor site usability, and gain insight into user behavior.
The mechanism of user tracking, however, is still not clear to most users. The majority of users don't even realize that they're being tracked and for what purposes. Non-tech-savvy users are unable to turn the feature off, which can deteriorate the user experience.
User data collected by tracking can be shared with third parties, sold for profit without user consent, and can contribute to a variety of cybersecurity threats.
This technique originated in the late nineties and started to increase in the wake of 3rd party cookies' departure. It consists of all information that can be gathered from a user’s interaction with a web browser. Fingerprinting is entirely legal in most areas. Even strict GDPR rules allow fingerprinting since they only require asking users for consent for cookie tracking, not fingerprinting.
A fingerprint is a unique identifier for the configurations of the user’s web browser and operating system. It collects the information about the software and hardware of the user’s device for the purpose of identification (think Mac address, IP, and many others). Companies and network providers use browser fingerprints to prevent fraud and identity theft.
Here are typical fingerprint types and methods of tracking:
As you can see, even when cookies are not being utilized, users are still a potential target for security attacks. Methods like switching to an incognito mode, enabling a VPN or an ad blocker, and cleaning up cookies or your search history don't prevent fingerprinting. A special secure browser like Brave or TOR could distort some of the strong fingerprinting techniques like canvas or audio.
In the next section, we'll review a data science-based use case we implemented for one of the EPAM clients.
Now that we addressed the fingerprinting concept, let's review data science approaches to solving fingerprinting problems. To begin, let’s focus on the problem from the data science perspective. Once we’ve gathered user-agent information, any available information about fonts, canvases, attached devices, media codecs, etc., we want to understand – who is the user in front of us, with this set of features.
One of the simplest solutions is to calculate hash based on different values. If the hashes are the same, we’ve found our user. Or a row in our history with the same hash may say that two users are the same person, and enable us to collect more information about our user.
However, this approach has limitations. First, consider the "small changes problem." Basically, this may occur when we work with features that can be changed without any meaningful alterations in the target. Let's say we're analyzing a user-agent string that contains information about a user agent, an operational system, and a device. Different users most likely will have different devices or browsers, but what about browser versions or some minor patches of operational systems? Browser versions may change several times per day by some insignificant value without any action from the users’ side. Keeping this in mind, we (or our model) should prepare for unexpected changes.
At a modeling stage, the simplest solution could be to drop that feature, because we hope our model could "understand" that the exact value of a minor version of a web browser is not descriptive. But in the data preparation step, we'd keep such information because later we may use it for different purposes, for example, calculating distances between different users. To solve this “small changes” issue, we may work with different hashing techniques to keep original values close in terms of hashing results (simhash for example). The result, however, is hashed values that aren't representative and are hard to analyze.
We'll review this option using our original problem statement when we gathered some features and tried to predict a user. A classification approach may seem simple to those who have at least little experience with data science. With millions of users, however, and running classification on millions of classes in place, the task becomes a challenge. As a result, after a user classification step, we may want to identify user preferences based on other data we gathered. The most common business case being addressed in this situation is the detection of user interests to provide more relevant advertisements.
The clustering approach helps solve the challenge with an enormous number of users. Let's say we have a huge number of similar users. We split them into different groups that have common interests and, hopefully, similar users have similar features. That’s how we cluster our users. This approach may work if we need to cluster different user preferences. For example, we may predict that a user may be close to a cluster for males aged 25-34 years with an interest in advertisements about cars.
After the approaches described above, the recommendation approach may seem a bit complicated. However, it is still appropriate in certain cases.
The recommendation approach works only with information about categories of interests that we want to predict, which represents another challenge. In the simplest solution, as we collect fingerprinting features, we may also collect information about visited pages (it is another challenge to predict page’s context and that may be a topic for another presentation). Once we gather all the requested data, we may build a user-category matrix (or implement any other model except collaborative filtering) and try to find similarities between different users or make predictions on which categories might be interesting to the exact user.
This solution already sounds complicated, and there are also hidden difficulties. One of them relates to the time and memory consumption of the recommender system model. If we want to predict the interests of a user who came to our site, and we want to show some advertisements, we need to make predictions on the front-end side while our page is loading.
We’ve discussed possible data privacy solutions. Each of them has its own pros and cons, and challenges.
As you can see, no one can assume that the web is a safe place, but there are protection mechanisms and safety measures that can be employed to protect our data without deteriorating user experience. Stay tuned for more content on this topic from our team.
Contributed by Konstantin Perikov, Chief Software Engineer II at EPAM in collaboration with Vladimir Sergeev, Senior Data Scientist at EPAM.
Konstantin PerikovChief Software Engineer II at EPAM
We're inviting Software Developers, QA Engineers, DevOps Specialists, Business Analysts, Designers, Data Analysts, and other IT specialists to join our community and work from the comfort of your home.