Receipts ingestion guide

Overview

Prior to this, you should be familiar with the dataset schema.

There are a number of nuances to working with the Receipts dataset, that are inherent in the nature of the product. To transform the incremental feed into a single, coherent easy to use dataset, the following steps must be taken:

  1. Handling Point-in-Time
  2. Unflattening
  3. De-Duplicating receipts

We do not recommend working with the dataset it it's 'raw' form, as there are a number of database technologies and ingestion strategies that will significantly improve the ergonomics of working with the data.

Handling Point-in-Time

The dataset supports Point-in-Time (PIT), so you are able interrogate the data based on when it was processed. The two key fields for this are TRANSACTION_ID and PROCESSED_DATE. We recommend always keeping the dataset in it's full form, that is with all versions of each receipt, so that you can reference or rebuild views based on the full history.

Solutions

Depending on your use case there will be differences in the approach, however the most common case will be representing only the newest version of each receipt. To do this you simply need to take the MAX(PROCESSED_DATE) for each receipt.

The following query uses a simple Common Table Expression (CTE) that shows how you can create a view of the dataset based on the most recent PROCESSED_DATE:

--- assumes `purchases_raw` is the table with all records.
--- first find the most recent version of each receipt
with latest_purchases as (
	SELECT TRANSACTION_ID, MAX(PROCESSED_DATE) as latest_date
	FROM purchases_raw
	GROUP BY TRANSACTION_ID
)

--- Now join the result together and sort chronologically
SELECT p.*
FROM purchases_raw p 
JOIN latest_purchases l ON
	p.TRANSACTION_ID = l.TRANSACTION_ID
	AND p.PROCESSED_DATE = l.latest_date
ORDER BY DATETIME;

We recommend making this a materialised view, or regenerating a table where this is stored on a periodic basis.

You can extend this query to fit your exact use cases, such as using the first version of a receipt by simply replacing the MAX with MIN.

Unflattening

The dataset is delivered in a 'flattened' format, so requires pre-processing to easily query from a transaction level, rather than at an item level.

To illustrate this, a receipt might originally 'look' like the following:

ItemQuantityUnit Price
Pen21.20
Notebook15.50
Subtotal7.90
Shipping2.00
Total9.30

This would be flattened in the feed as follows:

ItemQuantityPriceSubtotalShippingTotal
Pen21.207.902.009.30
Notebook15.507.902.009.30

Note that not all merchants itemise their receipts, as indicated by IS_ITEMISED. Also note that this is not consistent within a merchant, there are periods of time when some merchants do not provide itemised receipts.

Solutions

This can be tackled in a few different ways depending on your use case and available technologies:

  1. Using nested fields for items
  2. Separate tables for purchase and purchase_items

Nested fields

A number of database technologies support the concept of 'nested' and 'repeated' fields, which is essentially an array of structs. The advantage of this approach is that the dataset is denormalised, which can reduce data storage and improve query performance.

To bring multiple items into a nested field for a single row, you would need to group by both TRANSACTION_ID and PROCESSED_DATE

Separate tables

If your database does not support nested or repeated fields, then using a more traditional relational model may be appropriate. With a table for the core purchase fields, then a separate table for all the items within a purchase (if available).

The fundamental approach remains the same, grouping purchases by both TRANSACTION_ID and PROCESSED_DATE

De-duplicating receipts

Merchants will sometimes send multiple emails for a single transaction, we make a best effort to deduplicate these before including them in the feed, but there are some cases where this is not possible or practical.

Solutions

To deduplicate the feed we need to use both the ORDER_REFERENCE and TRANSACTION_ID. The former is (usually, but not always) provided by the merchant, the latter is our unique identifier for an email.

We ignore any ORDER_REFERENCE less than 6 characters as these are often used within physical stores - fast food chains especially - as monotonically incrementing integers which periodically reset. There are examples in the dataset of the same person with two receipts from the same merchant, with the same order reference, which are in fact distinct purchases. Therefore these references are not unique enough to use for accurate deduplication.

We can use a CTEs to accomplish this, assuming the dataset has already been flattened and PIT has been handled:

--- assumes `purchases_raw` is the table with all records.
--- first group together purchases based on the most recently recieved
WITH latest_purchases AS (
	SELECT
		COALESCE(
			CASE
				WHEN LENGTH(ORDER_REFERENCE) < 6 THEN NULL
				ELSE ORDER_REFERENCE
			END,
			TRANSACTION_ID
		) as IDENTIFIER,
		MERCHANT_SLUG,
		MAX(DATETIME) as LATEST_DATE
	FROM purchases_raw
	GROUP BY IDENTIFIER, MERCHANT_SLUG
)

--- now join together to deduplicate and sort chronologically 
SELECT 
  p.*,
  COALESCE(
      CASE
          WHEN LENGTH(ORDER_REFERENCE) < 6 THEN NULL
          ELSE ORDER_REFERENCE
      END, 
      TRANSACTION_ID
  ) as IDENTIFIER

FROM purchases_raw p 
JOIN latest_purchases l ON
	IDENTIFIER = l.IDENTIFIER
	AND p.DATETIME = l.LATEST_DATE
ORDER BY p.DATETIME;

Reference architecture

You may wish to consider the following table architecture for the data:

  • purchases_raw - An append only table, containing the raw data before any transformations are applied. The incremental feed files can be loaded directly into here.
  • purchases - A cleaned view of all purchases, this is regenerated on a frequent interval from the raw table. This is where the majority of analysis will likely be done.
  • purchase_items - (Optional) If you do not use nested fields, then the purchase items can be loaded into a separate table to be queried relationally. This will also need to be regenerated on the same cadence as the main table, to maintain consistency and accuracy.
  • merchants - (Optional) If you receive the additional fields for finance, you may wish to store these on a separate table and query them relationally.

Indexes

When planning your indexes or partition fields, there are a number of considerations:

  • DATETIME is when the purchase was made and will be frequently queried. However it should not be considered as static, that is historical receipts are constantly being added. You may receive a receipt in the feed today, that is 3 years old.
  • PROCESSED_DATE is when a receipt was last updated by Gener8 and can be considered static. A receipt may be reprocessed and reappear in the feed with updated values.
  • ORDER_REFERENCE is provided by the merchant and is not guaranteed to be unique, even within a merchant or merchant and customer.

Assumptions

History is 5+ years
We currently look back 5 years for receipts when user's connect, however this may increase in the future and there are receipts in the dataset that are more than 5 years old.

Data is mutable
We are always striving to provide the most accurate and detailed reflection of reality, this means that we may present different versions of the same receipt in the future.

Monetary values use minor currency units
When we represent any monetary value such as tax, shipping or sub-total this is always in the minor form of the currency i.e 'cents'. For example $1 USD would be represented as 100. This means that currency values are always whole numbers, never fractions or floats.

Known Issues

Snowflake

Our underlying infrastructure stores all timestamps with microsecond precision. When these are exported to Parquet they are encoded to an int64 of unix time, in microseconds.

To correctly parse this when transforming or loading from Snowflake you should use the TO_TIMESTAMP function:

SELECT TO_TIMESTAMP(RECEIVED_AT, 6) as RECEIVED_AT
FROM purchases