GenAI Events - User Guide
Client Support
Technical
[email protected]
About Gener8
Since its launch in 2018, Gener8 has been at the forefront of the “open data” movement: the belief that people should be able to control and be rewarded from their data. Gener8 makes it easy for its users to anonymously share their digital data, in exchange for rewards through its mobile and desktop apps.
This clear and transparent value exchange means that Gener8 has access to a uniquely comprehensive dataset. Gener8 upholds the highest standards of informed consent on every data set a user chooses to share with us.
Dataset
Schema
| Name | Type | Optional | Description |
|---|---|---|---|
| id | STRING | N | A permanent identifier for the event |
| user_id | STRING | N | Permanent and unique user ID. |
| user_agent | STRING | Y | The user agent of the browser from which the event was received |
| latitude | FLOAT | Y | Geocoded latitude, based on client IP address at the time |
| longitude | FLOAT | Y | Geocoded longitude, based on client IP address at the time |
| postal_code | STRING | Y | Geocoded postal code this event occured in, based on client IP address at the time. Not available for all regions. |
| city | STRING | Y | Geocoded city this event occured in, based on client IP address at the time. |
| region | STRING | Y | Geocoded region the event occurred in, based on client IP address at the time. |
| country | STRING | Y | Geocoded two letter country code the event occurred in, based on client IP address at the time. |
| received_at | TIMESTAMP | N | UTC timestamp of when Gener8 received the pageview |
| event | STRING | N | The type of event, one of 'prompt' or 'response' |
| content_type | STRING | N | The MIME type of the content |
| content | STRING | N | The content of the event |
| conversation_id | STRING | Y | The unique conversation identifier, to group prompts and responses |
| vendor | STRING | N | The vendor of the product the event was collected from, e.g 'OpenAI' or 'Google' |
| product | STRING | N | The common product name for the source of the events, e.g. "ChatGPT" or "AI Overview" |
| model | STRING | Y | The model name/version, as given by the vendor |
| timestamp | TIMESTAMP | Y | The UTC timestamp of when the pageview occurred, according to the user's device |
| timezone | STRING | Y | The timezone the event was made from, according to the user's device |
| sequence | INT | Y | The order of the message in the conversation, zero indexed |
| session_id | STRING | Y | A unique session identifier for the user |
| package_name | STRING | Y | The app which the event was collected from |
| package_version | STRING | Y | The app version the event was collected from |
| source | STRING | N | The collection source of the event |
| sources | STRUCT | Y | This is a repeated field of structs, containing sources from responses |
| sources.url | STRING | N | The URL of the source page |
| sources.title | STRING | N | The title from the source page |
| sources.summary | STRING | N | A short description of the source page |
| sources.author | STRING | Y | The author name, if available |
Geocoded fields
The Geocoding fields (country, region, city, postal_code, latitude and longitude) are inferred from the IP address of the users through a third-party dataset provider. This dataset takes the latitude and longitude and translates this information into an interpretable location. The dataset is updated twice a week, and it provides approximately 99.8% accuracy at the Country level and 68% at the City level. It is worth noting that if the user uses a VPN system to browse the web, then this information will reflect the VPN location and not its real location.
Session IDs
A new session ID is created after more than 30 minutes since the last event.
Response content types
Some collection sources are able to capture both the plain text response as well as the HTML, making it possible to understand formatting as well as extracting inline links. These responses represented as two rows where all fields are identical, including the id, except for the content_type and content. For example:
| id | event | content_type | content |
|---|---|---|---|
| 123 | prompt | text/plain | Hello :) |
| 456 | response | text/plain | Hi there! How can I help? |
| 456 | response | text/html | <p>Hi there! How can I help?</p> |
Events without timestamps
Due to our collection methodology, we are able to collect historical events from some sources, from some vendors, however we are not able to identify when the event occurred. These events will have a null value in the timestamp field. The sequence field can be used to correctly order the messages in a conversation.
Usage with pageviews
It's possible to combine GenAI events with Pageview events into a single feed to analyse online behavior surrounding a conversation, such as clicking on links, or making web searches. There are a number of viable approaches, this simple example creates a combined dataset using a UNION in SQL:
with genai_events as (
select
id,
event,
timestamp,
timezone,
content,
'' as title,
vendor,
user_id
from gener8_genai_events
where date(received_at) >= current_date
and date(timestamp) >= current_date
and content_type = 'text/plain'
),
pageviews as (
select
id,
'pageview' as event,
timestamp,
timezone,
concat(url_domain, url_path, url_query) as content,
title,
'Browser' as vendor,
user_id
from gener8_pageviews
where date(received_at) >= current_date
and date(timestamp) >= current_date
and user_id in (select user_id from genai_events)
),
merged_events as (
select * from pageviews
union all
select * from genai_events
)
select *
from merged_events
order by user_id, timestamp