e-Receipts query cookbook

Introduction

The e-Receipts dataset can be used to perform a wide range of analysis based on purchase and item level data.

This guide assumes that you have ingested the dataset to prepare it for efficient querying.

Aggregate purchases

To start, a simple aggregate of all receipts for a calendar year, grouped by month and currency for a specific merchant:

select
	date_trunc('month', DATETIME) as month,
	count(*) as transactions,
	CURRENCY as currency,
	sum(TRANSACTION_VALUE) as ttv
from purchases_prepared
where 
	DATETIME >= '2023-01-01'
	and DATETIME < '2024-01-01'
	and ENTITY_NAME = 'AMAZON.COM INC.'
group by month, currency
order by month, currency

Note that monetary values are always presented in their minor unit, i.e $1 USD = 100. Dividing by 100 may be appropriate, depending on your exact needs.

This uses the date_trunc function to group transactions by month, which is not available in all SQL dialects, however alternatives are available.

Filtering by cohorts of users

You may wish to define cohorts of users, based on different criteria. The USER_ID field allows filtering to include or exclude groups of users.

We do not recommend filtering by INBOX_ID, since users are able to connect multiple inboxes.

select *
from purchases_prepared
where USER_ID IN (
	select USER_ID
	from purchase_cohort_alpha
)

In this case a simple subquery selects from a hypothetical cohort table which you have prepared according to criteria relevant to your use case.

Item level analysis

Assuming that you have unflattened the dataset in some form, following the ingestion guide, you can query the dataset at an item level.

For example looking at the 6 month average price of a dozen eggs across different merchants:

select
	ITEM_TITLE as item,
	ENTITY_NAME as merchant,
	round(avg(ITEM_PRICE / 100), 2) as avg,
	count(*) as rows
from purchase_items
where
	DATETIME >= "2023-01-01" and DATETIME < "2023-07-01"
	and ITEM_TITLE like "%eggs%"
	and (ITEM_TITLE like "%12%" or ITEM_TITLE like "%dozen%")
	and CURRENCY = "GBP"
group by item, merchant
having rows > 1
order by rows desc, merchant, item

Note that in this example the count is of the number of rows, not the number of distinct transactions. Someone might buy one box of medium eggs as well as one box of large eggs in a single transaction.

As always with statistical analysis you should be cautious using averages and the median may often be a more representative aggregation.

Receipt versions

If you want to examine how a receipt has been represented over different versions, you can use the PROCESSED_DATE and TRANSACTION_ID to group them together. This must be run on a dataset that has not already been processed to deduplicate for point in time.

SELECT *,
	DENSE_RANK() OVER (
		PARTITION BY TRANSACTION_ID
		ORDER BY PROCESSED_DATE
	) AS version
FROM purchases_raw
ORDER BY DATETIME, TRANSACTION_ID, version

Fixed view of history

If you want to fix a historical view of the dataset, either for the purposes of training a model, or for viewing the dataset as it was at a specific point in time you can leverage the PROCESSED_DATE field.

-- find all the purchases for a year, processed before our cutoff date
with fixed_window as (
select *
from purchases_raw
where
	DATETIME >= "2023-01-01" and DATETIME < "2024-01-01"
	and PROCESSED_DATE < "2024-01-01" -- date we want to fix 
),

-- find the most recently processed version of each purchase
with latest_purchases as (
	select 
		TRANSACTION_ID,
		max(PROCESSED_DATE) as latest_date
	from fixed_window
	group by TRANSACTION_ID
)

-- join the dataset to get a full, filtered view
select p.*
from purchases
join latest_purchases la on
	p.TRANSACTION_ID = la.TRANSACTION_ID
	and p.PROCESSED_DATE = la.latest_date
order by DATETIME;