e-Receipts User Guide - Media & Marketing
Executive Summary
This document provides a description of the Gener8 Labs’ e-Receipt data product. It explains the collection methodology and gives an in-depth explanation of the variables composing the dataset, with the aim of enabling the user to easily and effectively deploy the e-Receipt dataset.
Client Support
Technical
[email protected]
About Gener8
Since its launch in 2018, Gener8 has been at the forefront of the “open data” movement: the belief that people should be able to control and be rewarded from their data. Gener8 makes it easy for its users to anonymously share their digital data, in exchange for rewards through its mobile and desktop apps.
This clear and transparent value exchange means that Gener8 has access to a uniquely comprehensive dataset. Gener8 upholds the highest standards of informed consent on every data set a user chooses to share with us.
Gener8 allows users to connect their email accounts to automatically and anonymously share their e-Receipts in exchange for rewards. This works through an integration between Gener8 and the user’s email provider, such as Gmail and Outlook. Users can connect their accounts through Gener8’s mobile apps on iOS and Android, and website.
Introduction
Gener8 Labs’ e-Receipt product delivers robust transaction and basket-level information from a variety of different merchants. This product is in continuous development, with various verticals being added, starting with e-Commerce and food delivery/takeaway. The data is provided at an agreed frequency (i.e. hourly, daily), with a maximum lag of 48 hours, ensuring a seamless integration into quantitative models. With this capability, customers can effectively utilise the data as an indicator to forecast earnings announcements or as an additional high-frequency predictor in mixed-frequency models.
Gener8 Labs extracts the product information from the e-Receipts and stores it in a structured format, allowing clients to access unstructured information in a structured format, eliminating the need to apply Natural Language Processing (NLP) methodologies to extract text information. Financial institutions can utilise this information to monitor the demand for goods and services, assess the state of the economy, evaluate sector rotation, and use it as a trading signal to generate alpha or manage risk exposure in a timely manner.
Gener8 scans and stores anonymised e-Receipts from Google and Outlook accounts. This means that Gener8 has access to a rich history of transactions stretching back to May 2018. After a user has connected their accounts, Gener8 scans for new e-Receipts every 24 hours to capture any new purchases from all the folders connected to the account (e.g., trash).
Even though e-Receipt has been predominantly linked to purchases made online, more recently, physical stores started to share with their customers an e-Receipt even when in-store purchases are made. The Gener8 Labs e-Receipt product also delivers insights into the in-store purchase.
Product Overview
Gener8 Labs takes care of the needs of clients by creating a well-structured dataset. In doing so, Gener8 Labs is committed to adhering to robust industry standards, ensuring precision and seamless deployability.
Every table reports the name of the column (i.e., Field) that can be used to access the specific observation; the format of the observation (i.e., Type) to ensure that the accessed raw can be manipulated according to the type of columns; whether the column might contain an empty observation (i.e., Y) or not (i.e., N) (i.e., Null Value); and finally, the columns description, where the interpretation of the observation is reported.
Inbox history table
Gener8 continuously grows its panel of e-Receipt users. This means that the history and future data points will change as new users join the panel. The inbox history table allows clients to fix the history based on INBOX_IDs. User's are able to connect multiple inboxes from different providers, therefore a User has many Inboxes.
The table consists of all user's inboxes and the signup date. The entries are null if the inbox is not connected at the given date and true otherwise. Hence, to fix the users’ panel, select the relevant date column, and select all dates smaller than the desired date, which can be the starting date of the live feed. Now select only INBOX_ID where at least one entry is True. The resulting INBOX_IDs are those that were active prior to the specified date.
The identified list can be used to filter the data before consuming it to have a fixed INBOX_ID sample. Such an approach allows the data to be fixed and avoids any estimation error arising from the change in history.
Data Schema
The e-Receipt dataset comprises the observations gathered from the Gener8 app products. Gener8 Labs formats and cleans this dataset to ensure that, each observation, is relevant and precise.
| Name | Type | Nullable | Description |
|---|---|---|---|
| DATETIME | TIMESTAMP | N | The date and time (YYYY-MM-DD hh:mm:ss) that the users received the email in their inbox in Coordinated Universal Time (UTC). This is a proxy for when the transaction occurred. |
| ENTITY_NAME | STRING | N | The merchant of the transaction (e.g., Amazon). |
| ITEM_TITLE | STRING | Y | The name or description of the line item as displayed on the e-Receipt. If absent, it is represented as an empty string. |
| ITEM_QUANTITY | INTEGER | Y | The quantity of items purchased for the given ITEM_TITLE. |
| ITEM_PRICE | INTEGER | Y | The price of the item in the original currency as displayed on the e-Receipt in cents values. This represents the price per unit. This might or might not include product-level discounts and taxes. |
| ITEM_ORIGINAL_PRICE | INTEGER | Y | The price of the item in the original currency as displayed on the e-Receipt in cents values. This represents the raw price the merchant represented which may be an aggregate or the unit price. This might or might not include product-level discounts and taxes. |
| IS_ITEMISED | BOOLEAN | N | It represents whether the e-Receipt is itemised (e.g., it reports the item in the basket) or not (e.g., it reports only the transaction without item-level information). If false, the ITEM_TITLE will be represented as an empty string, and ITEM_PRICE and ITEM_QUANTITY will be reported as 0. |
| BASKET_VALUE | INTEGER | Y | It is calculated by summing the multiplications of each item's price (ITEM_PRICE) with its respective quantity (ITEM_QUANTITY) as reported on the e-Receipt. This calculation may or may not take into account product-level discounts and taxes. |
| TOTAL_TAX | INTEGER | Y | The amount of tax spent on the transaction as reported on the e-Receipt. If zeros, no tax value is found in the e-Receipt. |
| DISCOUNT_PRICE | INTEGER | Y | The total discount applied to the transaction. If negative, it represents a discount, if positive, it indicates a supercharge (variability imposed by the merchant). If some sort of point can be used to pay for the ITEM, then those will be represented as a discount after applying the conversion from points to currency. |
| SHIPPING_PRICE | INTEGER | Y | The delivery price of the items in the e-Receipt. If zeros, no shipping charge is found in the e-Receipt. |
| GIFT_VOUCHER_PRICE | INTEGER | Y | Whether part of the total amount in the e-Receipt has been paid with a voucher. If zeros, no voucher value is found in the e-Receipt. |
| TRANSACTION_VALUE | INTEGER | N | The total amount spent on the transaction in the local currency. |
| CURRENCY | STRING | Y | The currency that the transaction was made in |
| ORDER_REFERENCE | STRING | Y | The merchant's identifier for the transaction. Also known as order number. |
| TRANSACTION_ID | STRING | N | A permanent and unique ID is assigned by Gener8 to each unique email at the time of storing the purchased item information, which is comprised of 36 alphanumeric characters. |
| ORDER_ID | STRING | N | The order identification of the e-Receipt, composed by the ORDER_REFERENCE and TRANSACTION_ID. Used to identify an order when multiple e-Receipts are observed in one email. |
| ITEM_ID | STRING | Y | A permanent and unique ID assigned by Gener8 at the time of storing the purchased item information comprised ORDER_REFERENCE-integer or TRANSACTION_ID-integer. This is Null when IS_ITEMISED is FALSE. |
| INBOX_ID | STRING | N | A permanent and unique ID is assigned by Gener8 to each unique inbox when the user enrols on any of the Gener8 products. This ID is composed of 36 alphanumeric characters and is linked to the email of the USER_ID. One single USER_ID might have multiple INBOX_ID. |
| USER_ID | STRING | N | A permanent and unique user ID is assigned by Gener8 at the time that the user enrols on any of the Gener8 products. It is comprised of 36 alphanumeric characters. |
| RECEIVED_AT | TIMESTAMP | N | The date and time (YYYY-MM-DD hh:mm:ss.) that Gener8 ingested the e-Receipt in Coordinated Universal Time (UTC). |
| PROCESSED_DATE | TIMESTAMP | N | The date and time (YYYY-MM-DD hh:mm:ss) that the receipt was processed. Note that receipts may be reprocessed multiple times as we improve accuracy and quality. |
N.B. monetary values are always given in their minor currency unit, i.e. 1 USD = 100.
Travel and Bookings
For customers interested in travel and booking-related purchases, we provide additional fields containing specific information relevant to travel and bookings.
Travel and Booking Breakdown
Travel and booking merchants are categorized into three sub-categories:
- Mass Transit: Includes Plane, Train, Ferry, Coaches
- Accommodation: Includes Hotels, B&Bs, Flat/House Rentals
- Rideshare: Includes services like Uber, Lyft, and other ride-hailing platforms
the following additional fields are included & populated, where available for the merchant:
| Name | Type | Optional | Available For | Description |
|---|---|---|---|---|
| TRAVEL_ORIGIN_COUNTRY | STRING | Y | Mass Transit | The country where the journey begins.. |
| TRAVEL_ORIGIN_CITY | STRING | Y | Mass Transit | The city where the journey begins. |
| TRAVEL_START_DATE | TIMESTAMP | Y | All | The precise start date of the journey. |
| TRAVEL_DESTINATION_COUNTRY | STRING | Y | Mass Transit, Accommodation | The country where the journey ends. |
| TRAVEL_DESTINATION_CITY | STRING | Y | Mass Transit, Accommodation | The city where the journey ends. |
| TRAVEL_END_DATE | TIMESTAMP | Y | All | The precise end date of the journey. |
| TRAVEL_CARRIER | STRING | Y | Mass Transit | The carrier associated with the journey leg (e.g., airline or train service). |
| TRAVEL_ORIGIN_NAME | STRING | Y | Mass Transit | The name of the travel origin location (e.g., station or airport). |
| TRAVEL_DESTINATION_NAME | STRING | Y | All | The name of the travel destination location (e.g., station or airport). |
| TRAVEL_JOURNEY_DISTANCE | FLOAT | Y | Rideshare | The distance of the journey in miles. |
| TRAVEL_JOURNEY_TIME | INTEGER | Y | Rideshare | The duration of the journey in seconds. |
| TRAVEL_ACCOMMODATION_NAME | STRING | Y | Accommodation | The name of the booked accommodation. |
| TRAVEL_ACCOMMODATION_ADDRESS | STRING | Y | Accommodation | The address of the booked accommodation. |
| TRAVEL_NUMBER_OF_TRAVELLERS | INTEGER | Y | All | The number of travellers for this journey. |
Data Availability and Description
Gener8 extracts the values present in each e-Receipt verbatim to preserve the integrity of the original content. While the overarching goal is to represent accurate values for all fields pertaining to each purchase, it's important to acknowledge that not every e-Receipt encompasses a comprehensive dataset for extraction. Certain merchants, such as Amazon, deliberately obfuscate information on their e-Receipts to safeguard attempts to derive insights from this data source. In scenarios where the e-Receipt lacks specific information, the value will be represented as zero. Gener8 Labs employs inference to fill in missing values in select cases where appropriate.
Some examples where data is unavailable:
- TOTAL_TAX: Some merchants will not include a tax breakdown on each item, or report the total tax amount. If a value of zero is reported, it implies that no tax value was reported at the transaction level. Hence, TOTAL_TAX might be included in the BASKET_TOTAL or TRANSACTION_TOTAL.
- ITEM_TITLE: Some merchants do not provide the description of the purchased item, and therefore, the entry in represented as empty string.
Gener8 Labs endeavours to normalise the data wherever possible, but there are instances that remain where there is variability across the whole dataset across merchants or over time for the same merchant. For example, some merchants might include taxes or discounts in the BASKET_TOTAL in certain periods and in TRANSACTION_TOTAL in others. The same logic is also observed across different countries for the same merchant.
Furthermore, Gener8 Labs employs a meticulous process for deciphering CURRENCY information embedded within received emails, ensuring precision in currency identification. Whenever an ISO-4217 3-digit currency code is present, Gener8 Labs reports it as it is. However, in some cases, when the dollar currency symbol is reported, Gener8 Labs employs several rules to understand whether it refers to the Australian, Canadian or American dollar. In those instances, the location of the sender is used to determine the currency of the transaction. Since, on rare occasions, there might be conflict in this logic, Gener8 Labs uses the US dollar as the default value.
Price Normalisation
Across the dataset, merchants have represented the numbers that make up a receipt in various ways. For example, merchants will sometimes state line items as the unit price of the item being purchased, and other times will treat the line item price as the aggregate of price × quantity. To make the data easier to work with, we have normalised these values so ITEM_PRICE always represents the unit price of an item. We preserve how the merchant originally represented this value in the ITEM_ORIGINAL_PRICE field.
Adjustment Factor
Various factors may require the addition of an adjustment factor to receipts. For example, some merchants, like Shein, limit email receipts to six items, while others may omit order-level charges, discounts, or item-level costs.
To standardise our data, we validate and normalise each receipt to ensure that it reflects a simple calculation: unit cost * quantity = basket value or transaction value. When full reconciliation is not possible, we add an adjustment factor line item to represent the monetary delta needed to balance the receipt. This adjustment, positive or negative based on the direction of the delta, is labeled with an ITEM_TITLE of ADJUSTMENT_FACTOR and does not include an ITEM_ID.
Point in Time
Using the PROCESSED_DATE it is possible to fix your view of historical transactions, that may have been reprocessed. For example, you might want to only consider the first extraction of a receipt by selecting the MIN(PROCESSED_DATE) of a receipt, or alternatively always look at the latest revision by selecting the MAX(PROCESSED_DATE) instead. We recommend using the latest, as we are continuously improving the quality and accuracy of our dataset, however this is may not be suitable for every use case, so we provide the flexibility for you to filter the data as necessary.
It's important that you consider and apply the approach works best for you, as the same receipt will appear in the feed multiple times if they are reprocessed. If you do not handle the versions correctly and then attempt to perform aggregate queries on the data, you will get an incorrect result as the same values may be accounted for multiple times.