In the past year, we made some advanced reports for our revenue analysis which was quite challenging. In order to make the next one easier I have been trying to find patterns in all those analyses. Almost every analysis goes through the following phases:
- Extraction: This is where you select the tables and columns and their joins that you need for the analysis.
- Transformation: Defining the additional columns that are derived from the raw data.
- Aggregation: Grouping and aggregating the data.
- Structuring: Collecting the derived and grouped data into the final required structure.
- Slicing: Creating summary reports based on various dimensions.
To explain let me take a real example of finding the Net Revenue Retention for our business. First, we opened a new spreadsheet and started defining what the final data would look like. In this case, we listed down the following breakup of numbers:
Due Date | MRR Start | Expansion | Contraction | Upsell | Churn | Net Revenue |
---|---|---|---|---|---|---|
Jan | 5000 | 1000 | 700 | 800 | 900 | 5200 |
We then listed down dimensions to break down the above table like Territory, Is Partner, Is Multi-Site Team, Is Enterprise etc.
Extraction
The first step is to extract the data we need from the tables & columns that contain the raw data. The columns needed were:
- Invoice: Date, Team, Total
- Team: Email, Is Partner, Territory
Transformation
With a sample raw data, we figured out the formula for the Expansion, Contraction, Upsell & Churn
- Expansion = Invoice total of the first invoice of the team
- Contraction = Decrease in invoice total compared to the previous invoice of a team
- Upsell = Increase in invoice total compared to the previous invoice of a team
- Churn = Invoice total of the last invoice of the team
- MRR Start = Previous month's net revenue
Based on the above calculations we were able to create a single query to create raw invoice data and mark each invoice as Expansion or Contraction or Upsell but calculating the Churn value and month wasn't straightforward.
Churn is the Invoice Total of the invoices which aren't paid in the given month but were paid in the previous month. So, January churn would be the invoice paid amount in December month but not paid in January month.
To calculate this, we created a separate query to know the last paid invoices per team. Then we calculated the churned month i.e paid invoice date + 1 month
So we had these two view now,
View 1:
Inv.Date | Territory | Invoice Total | Change Type | Change Amount | |
---|---|---|---|---|---|
Jan | abc@example.com | India | $9 | Expansion | $9 |
Jan | def@example.com | USA | $10 | Expansion | $2 |
View 2:
Churned Date | Last Inv. Date | Territory | Invoice Total | Change Type | Change Amount | |
---|---|---|---|---|---|---|
Feb | Jan | jhs@example.com | India | $9 | Churn | $9 |
Feb | Jan | 135@example.com | USA | $10 | Churn | $10 |
The raw data preparation was done. Now, we needed to aggregate the data.
Aggregation
We bring both these views into a common structure
Pivoting View 1 on the "Change Type" column we got:
Month | Expansion | Contraction | Upsell |
---|---|---|---|
Jan | $9 | - | - |
Feb | $3 | $7 | $2 |
Then we summarized the Churned Teams (View 2) table to a monthly level and joined it with the summarized table above and got:
Month | Expansion | Contraction | Upsell | Churn |
---|---|---|---|---|
Jan | $9 | - | - | - |
Feb | $3 | $7 | $2 | $6 |
Structuring
The calculation for Net Revenue (NR) is - NR = Previous NR + Expansion - Contraction + Upsell - Churn
For this, we used LAG()
function to get the Previous NR and performed the calculation to get:
Month | Expansion | Contraction | Upsell | Churn | Net Revenue |
---|---|---|---|---|---|
Jan | $9 | $9 | |||
Feb | $3 | $1 | $2 | $2 | $11 |
With this, we knew the overall picture of the revenue growth. At this point, we had around 5-6 queries, which were tough to manage as we had to jump between different tabs for different queries. However, the next task of breaking down these numbers by dimensions was even more difficult.
Slicing
To filter a query based on a column, that column has to exist in the source of the query. We tried to push down each of the dimensions as a column in every query which broke the relative calculations based on row positions.
Let's take the Territory filter as an example and for better understanding I will refer to the above 5 tables as A, B, C and so on
- Table E has to be filtered by Territory. The source of Table E is Table D which doesn't have a Territory column so we can't filter
- To add Territory to Table D we need to add Territory to its source. The source of Table D was a join between two tables that are summarised versions of Table A & Table B. So Table D = Table A(s) <> Table B(s)
- Table A & Table B had a territory column, so we included that column in the summarized table as well.
- Table A(s) & Table B(s) were joined on the Month column, but after including Territory, the Month column wasn't a primary key so we had to include Territory as well in the join condition.
- Now Table D had a Territory column, so we included it and then Table E had access to it
Conclusion
I realized later, that the join between summary tables could have been avoided if the first two tables were union-ed together. But making that change itself is too much work because changing the dependencies of a query is very difficult.
Regardless, this was done for every dimension we had, which I was very much frustrated by. Not to mention, when dimensions are spread across various data sources then the analysis has to continue on a local database by storing the intermediate/raw data into a local database.
Going through such an analysis after months, without any documentation or any visual hierarchy of queries and making changes that aren't easily undoable has been a nightmare.
So, I am currently working on a major refactor, In an attempt to make such analysis easier. I am also trying to figure out a way to improve the filtering of a composite query in a much easier way.