Receipts ingestion guide
Overview
Prior to this, you should be familiar with the dataset schema.
There are a number of nuances to working with the Receipts dataset, that are inherent in the nature of the product. To transform the incremental feed into a single, coherent easy to use dataset, the following steps must be taken:
- Handling Point-in-Time
- Unflattening
- De-Duplicating receipts
We do not recommend working with the dataset it it's 'raw' form, as there are a number of database technologies and ingestion strategies that will significantly improve the ergonomics of working with the data.
Handling Point-in-Time
The dataset supports Point-in-Time (PIT), so you are able interrogate the data based on when it was processed. The two key fields for this are TRANSACTION_ID and PROCESSED_DATE. We recommend always keeping the dataset in it's full form, that is with all versions of each receipt, so that you can reference or rebuild views based on the full history.
Solutions
Depending on your use case there will be differences in the approach, however the most common case will be representing only the newest version of each receipt. To do this you simply need to take the MAX(PROCESSED_DATE) for each receipt.
The following query uses a simple Common Table Expression (CTE) that shows how you can create a view of the dataset based on the most recent PROCESSED_DATE:
--- assumes `purchases_raw` is the table with all records.
--- first find the most recent version of each receipt
with latest_purchases as (
SELECT TRANSACTION_ID, MAX(PROCESSED_DATE) as latest_date
FROM purchases_raw
GROUP BY TRANSACTION_ID
)
--- Now join the result together and sort chronologically
SELECT p.*
FROM purchases_raw p
JOIN latest_purchases l ON
p.TRANSACTION_ID = l.TRANSACTION_ID
AND p.PROCESSED_DATE = l.latest_date
ORDER BY DATETIME;
We recommend making this a materialised view, or regenerating a table where this is stored on a periodic basis.
You can extend this query to fit your exact use cases, such as using the first version of a receipt by simply replacing the MAX with MIN.
Unflattening
The dataset is delivered in a 'flattened' format, so requires pre-processing to easily query from a transaction level, rather than at an item level.
To illustrate this, a receipt might originally 'look' like the following:
| Item | Quantity | Unit Price |
|---|---|---|
| Pen | 2 | 1.20 |
| Notebook | 1 | 5.50 |
| Subtotal | 7.90 | |
| Shipping | 2.00 | |
| Total | 9.30 |
This would be flattened in the feed as follows:
| Item | Quantity | Price | Subtotal | Shipping | Total |
|---|---|---|---|---|---|
| Pen | 2 | 1.20 | 7.90 | 2.00 | 9.30 |
| Notebook | 1 | 5.50 | 7.90 | 2.00 | 9.30 |
Note that not all merchants itemise their receipts, as indicated by IS_ITEMISED. Also note that this is not consistent within a merchant, there are periods of time when some merchants do not provide itemised receipts.
Solutions
This can be tackled in a few different ways depending on your use case and available technologies:
- Using nested fields for items
- Separate tables for
purchaseandpurchase_items
Nested fields
A number of database technologies support the concept of 'nested' and 'repeated' fields, which is essentially an array of structs. The advantage of this approach is that the dataset is denormalised, which can reduce data storage and improve query performance.
To bring multiple items into a nested field for a single row, you would need to group by both TRANSACTION_ID and PROCESSED_DATE
Separate tables
If your database does not support nested or repeated fields, then using a more traditional relational model may be appropriate. With a table for the core purchase fields, then a separate table for all the items within a purchase (if available).
The fundamental approach remains the same, grouping purchases by both TRANSACTION_ID and PROCESSED_DATE
De-duplicating receipts
Merchants will sometimes send multiple emails for a single transaction, we make a best effort to deduplicate these before including them in the feed, but there are some cases where this is not possible or practical.
Solutions
To deduplicate the feed we need to use both the ORDER_REFERENCE and TRANSACTION_ID. The former is (usually, but not always) provided by the merchant, the latter is our unique identifier for an email.
We ignore any ORDER_REFERENCE less than 6 characters as these are often used within physical stores - fast food chains especially - as monotonically incrementing integers which periodically reset. There are examples in the dataset of the same person with two receipts from the same merchant, with the same order reference, which are in fact distinct purchases.
Therefore these references are not unique enough to use for accurate deduplication.
We can use a CTEs to accomplish this, assuming the dataset has already been flattened and PIT has been handled:
--- assumes `purchases_raw` is the table with all records.
--- first group together purchases based on the most recently recieved
WITH latest_purchases AS (
SELECT
COALESCE(
CASE
WHEN LENGTH(ORDER_REFERENCE) < 6 THEN NULL
ELSE ORDER_REFERENCE
END,
TRANSACTION_ID
) as IDENTIFIER,
MERCHANT_SLUG,
MAX(DATETIME) as LATEST_DATE
FROM purchases_raw
GROUP BY IDENTIFIER, MERCHANT_SLUG
)
--- now join together to deduplicate and sort chronologically
SELECT
p.*,
COALESCE(
CASE
WHEN LENGTH(ORDER_REFERENCE) < 6 THEN NULL
ELSE ORDER_REFERENCE
END,
TRANSACTION_ID
) as IDENTIFIER
FROM purchases_raw p
JOIN latest_purchases l ON
IDENTIFIER = l.IDENTIFIER
AND p.DATETIME = l.LATEST_DATE
ORDER BY p.DATETIME;
Reference architecture
You may wish to consider the following table architecture for the data:
purchases_raw- An append only table, containing the raw data before any transformations are applied. The incremental feed files can be loaded directly into here.purchases- A cleaned view of all purchases, this is regenerated on a frequent interval from the raw table. This is where the majority of analysis will likely be done.purchase_items- (Optional) If you do not use nested fields, then the purchase items can be loaded into a separate table to be queried relationally. This will also need to be regenerated on the same cadence as the main table, to maintain consistency and accuracy.merchants- (Optional) If you receive the additional fields for finance, you may wish to store these on a separate table and query them relationally.
Indexes
When planning your indexes or partition fields, there are a number of considerations:
DATETIMEis when the purchase was made and will be frequently queried. However it should not be considered as static, that is historical receipts are constantly being added. You may receive a receipt in the feed today, that is 3 years old.PROCESSED_DATEis when a receipt was last updated by Gener8 and can be considered static. A receipt may be reprocessed and reappear in the feed with updated values.ORDER_REFERENCEis provided by the merchant and is not guaranteed to be unique, even within a merchant or merchant and customer.
Assumptions
History is 5+ years
We currently look back 5 years for receipts when user's connect, however this may increase in the future and there are receipts in the dataset that are more than 5 years old.
Data is mutable
We are always striving to provide the most accurate and detailed reflection of reality, this means that we may present different versions of the same receipt in the future.
Monetary values use minor currency units
When we represent any monetary value such as tax, shipping or sub-total this is always in the minor form of the currency i.e 'cents'. For example $1 USD would be represented as 100. This means that currency values are always whole numbers, never fractions or floats.
Known Issues
Snowflake
Our underlying infrastructure stores all timestamps with microsecond precision. When these are exported to Parquet they are encoded to an int64 of unix time, in microseconds.
To correctly parse this when transforming or loading from Snowflake you should use the TO_TIMESTAMP function:
SELECT TO_TIMESTAMP(RECEIVED_AT, 6) as RECEIVED_AT
FROM purchases