Pageviews User Guide - Media & Marketing

Executive Summary

This document provides a description of the Gener8 Labs’ Pageview data product. It provides a detailed explanation of the analytics, including an overview of the collection methods. The aim is to enable the user to easily and effectively deploy the datasets related to Pageviews. The steps performed to transform the collected data into the end product are laid out.

Client Support

Technical
[email protected]

About Gener8

Since its launch in 2018, Gener8 has been at the forefront of the “open data” movement: the belief that people should be able to control and be rewarded from their data. Gener8 makes it easy for its users to anonymously share their digital data, in exchange for rewards through its mobile and desktop apps.

This clear and transparent value exchange means that Gener8 has access to a uniquely comprehensive dataset. Gener8 upholds the highest standards of informed consent on every data set a user chooses to share with us.

Introduction

Gener8 Labs’ analytics promptly delivers robust web traffic data, to derive pricing and forecasting indicators. These comprehensive deliverables include information at multiple levels of granularity, catering to the needs of any institution and relevant to a wide range of applications. The data is provided at an agreed frequency, (i.e., hourly, daily), with a maximum of 24hr lag, ensuring seamless integration into quantitative models. With this capability, customers can effectively utilise Gener8 Labs analytics to forecast stock returns, or incorporate it as a valuable high-frequency variable in mixed-frequency models.

Gener8 Labs actively monitors the panel’s browsing activities in real time, meticulously storing the data with millisecond precision. This rich dataset is then transformed into a structured format, allowing an easier integration into existing models. The provided datasets offer detailed insights into the visited domains, time spent on web pages, as well as the exact date and time of each visit.

Gener8 Labs allows users to connect the domains visited by users to its parent company, enabling both an independent analysis and a comprehensive examination of the entity related to the domain. This facilitates the investigation of various revenue-generating activities within the conglomerate while also offering a valuable view when predicting earnings announcements. Additionally, users can further leverage the dataset to explore the most sought-after sector or gauge the attention given to a specific company compared to others.

Pageviews Product

Gener8 Labs takes care of the needs of clients by creating a well-structured dataset. In doing so, Gener8 Labs is committed to adhering to robust industry standards, ensuring precision and seamless deployability.

In the following section, we dive into the schema of the Pageview product deliverable. The table reports the name of the column (i.e., Field) that can be used to access the specific observation; the format of the observation (i.e., Type) to ensure that the accessed raw can be manipulated according to the type of columns; whether the column might contain an empty observation (i.e., Y) or not (i.e., N) (i.e., Null Value); and finally the columns description, where the interpretation of the observation is reported.

Raw Dataset

The Raw Dataset comprises the observations gathered from the Gener8 consumer products. Gener8 Labs format and clean this dataset to ensure that each observation is relevant and precise.

Schema

NameTypeOptionalDescription
idSTRINGNPermanent and unique ID for each pageview.
received_atTIMESTAMPNUTC timestamp of when Gener8 received the pageview
timestampTIMESTAMPNUTC timestamp of when the pageview occurred
url_protocolSTRINGNThe protocol component of the pageview URL
url_domainSTRINGNThe domain component of the pageview URL
url_pathSTRINGYThe path component of the pageview URL
url_querySTRINGYThe query component of the pageview URL, if there was one
titleSTRINGYWebpage title, taken from the head element
referrer_protocolSTRINGYThe protocol component of the referring URL, if there was one
referrer_domainSTRINGYThe domain component of the referring URL, if there was one
referrer_pathSTRINGYThe path component of the referring URL, if there was one
referrer_querySTRINGYThe query component of the referring URL, if there was one
active_durationINTEGERYTime spent by the user actively browsing the webpage, reported in seconds.
deviceSTRINGYThe type of device the pageview came from. Either Desktop or Mobile.
key_phraseSTRINGYThe extracted search term from the request, available for a select number of domains, such as Google Search.
countrySTRINGYGeocoded two letter country code the pageview occurred in, based on client IP address at the time.
regionSTRINGYGeocoded region the pageview occurred in, based on client IP address at the time.
citySTRINGYGeocoded city this pageview occured in, based on client IP address at the time.
postal_codeSTRINGYGeocoded postal code this pageview occured in, based on client IP address at the time. Not available for all regions.
latitudeFLOATYGeocoded latitude, based on client IP address at the time
longitudeFLOATYGeocoded longitude, based on client IP address at the time
user_idSTRINGNPermanent and unique user ID.
tab_idINTEGERYThe tab identifier provided by the user's browser
timezoneSTRINGYThe timezone the pageview was made from
user_agentSTRINGYThe user agent of the browser from which the pageview was received

Referrer Details and Tab ID

The Tab ID is an integer identifier that can be used to indicate the specific Tab within a user's browser that a pageview took place in. The referrer details (referrer_protocol, referrer_domain, referrer_path and referrer_query) will indicate, if available, which page a user was on just prior to the pageview represented in the url details (url_protocol, url_domain, url_path and url_query).

With these two sets of information, it is possible to accurately reconstruct not only the journey a user takes within a single tab on a website, but also the additional tabs that a user opens and closes as they research or shop.

Key Phrase

When collecting and processing pageview URL data, we make a best effort attempt to extract & clean any keywords from the URL that appear to be a result of a user searching on a search engine or within a website.

They keywords are reliably extracted from the most popular search engines & retail websites, however, as a catch all we will also apply some wildcard extraction rules to extract searches from smaller or less known websites. This may have the unfortunate side effect of extracting phrases that are not search terms, but this will phenomenon will only occur on smaller, less visited websites.

Active Duration

This integer value serves to indicate the number of seconds that the given pageview was in focus for. That is to say, if a user opens the Gener8 home page & views it for 10 seconds, then switches to a different tab for a minute, then switches back for another 10 seconds before clicking a link and navigating to a different page. The pageview for the Gener8 home page will be recorded with an Active Duration of 20 seconds. The minute spent on the other tab is not counted.

In addition to the above, it is important to note that not all of the pageviews in the dataset will have a valid Active Duration. It's recorded on all of our first party pageviews from our browser & extensions, however, the pageviews we extract from our users' Google's takeout data does not have this information.

Geocoding Information and Timezone

The Geocoding information fields (country, region, city, postal_code, latitude & longitude) are inferred from the IP address of the users through a third-party dataset provider. This dataset takes the latitude and longitude and translates this information into an interpretable location. The dataset is updated twice a week, and it provides approximately 99.8% accuracy at the Country level and 68% at the City level. It is worth noticing that if the user uses a VPN system to browse the web, then this information will reflect the VPN location and not its real location.

Furthermore, Timezone is inferred directly from the user operating system. Since the two pieces of information are collected differently, these observations might not always be aligned. To ensure a full deployability of this column, some of the entries have been backfilled. Gener8 started collecting this information in June 2022, through an update of the client's apps. To harmonise the dataset as much as possible, for observations prior to that date, or for those who didn’t update the apps, the timezone is determined through the user's IP address. As per the COUNTRY and CITY, if the user is using VPN or another impression from the dataset from which Gener8 Labs collects this information, there might be some slight imprecisions.

PII Scrubbing

Due to the nature of search terms and URLs, some of the raw data that we collect from our users may contain Personally Identifiable Information (PII) - therefore we have undertaken analysis of the risk areas and redact high risk areas of PII and sensitive information prior to delivery.

Our current implementation addresses the following:

  • Never sharing specific PII fields, such as IP address.
  • Email Addresses: Any email address found in the data is replaced with <EMAIL_ADDRESS>.
  • We redact the pageviews of popular webmail clients, such as Gmail and Outlook, which contain email addresses and subject lines
  • We redact the pageviews of popular productivity tools, such as Google docs, Office 365, OneDrive and Dropbox, which may contain document names and IDs.
  • We redact searches on .gov.uk sites, where sensitive data is higher risk e.g. document and ID numbers like passport number, NI number, Driver’s license, vehicle reg.

These filters are applied to the following columns:

  • url_path
  • url_query
  • title
  • referrer_domain
  • referrer_path
  • referrer_query
  • key_phrase