Getting insights from raw data

In this article, I talk about my experience with a complex analysis and how I am trying to make it easier for the future.

Saqib Ansari

13 April 2024

min read

In the past year, we made some advanced reports for our revenue analysis which was quite challenging. In order to make the next one easier I have been trying to find patterns in all those analyses. Almost every analysis goes through the following phases:

Extraction: This is where you select the tables and columns and their joins that you need for the analysis.
Transformation: Defining the additional columns that are derived from the raw data.
Aggregation: Grouping and aggregating the data.
Structuring: Collecting the derived and grouped data into the final required structure.
Slicing: Creating summary reports based on various dimensions.

To explain let me take a real example of finding the Net Revenue Retention for our business. First, we opened a new spreadsheet and started defining what the final data would look like. In this case, we listed down the following breakup of numbers:

Due Date	MRR Start	Expansion	Contraction	Upsell	Churn	Net Revenue
Jan	5000	1000	700	800	900	5200

We then listed down dimensions to break down the above table like Territory, Is Partner, Is Multi-Site Team, Is Enterprise etc.

Extraction

The first step is to extract the data we need from the tables & columns that contain the raw data. The columns needed were:

Invoice: Date, Team, Total
Team: Email, Is Partner, Territory

Transformation

With a sample raw data, we figured out the formula for the Expansion, Contraction, Upsell & Churn

Expansion = Invoice total of the first invoice of the team
Contraction = Decrease in invoice total compared to the previous invoice of a team
Upsell = Increase in invoice total compared to the previous invoice of a team
Churn = Invoice total of the last invoice of the team
MRR Start = Previous month's net revenue

Based on the above calculations we were able to create a single query to create raw invoice data and mark each invoice as Expansion or Contraction or Upsell but calculating the Churn value and month wasn't straightforward.

Churn is the Invoice Total of the invoices which aren't paid in the given month but were paid in the previous month. So, January churn would be the invoice paid amount in December month but not paid in January month.

To calculate this, we created a separate query to know the last paid invoices per team. Then we calculated the churned month i.e paid invoice date + 1 month

So we had these two view now,

View 1:

Inv.Date	Email	Territory	Invoice Total	Change Type	Change Amount
Jan	abc@example.com	India	$9	Expansion	$9
Jan	def@example.com	USA	$10	Expansion	$2

View 2:

Churned Date	Last Inv. Date	Email	Territory	Invoice Total	Change Type	Change Amount
Feb	Jan	jhs@example.com	India	$9	Churn	$9
Feb	Jan	135@example.com	USA	$10	Churn	$10

The raw data preparation was done. Now, we needed to aggregate the data.

Aggregation

We bring both these views into a common structure

Pivoting View 1 on the "Change Type" column we got:

Month	Expansion	Contraction	Upsell
Jan	$9	-	-
Feb	$3	$7	$2

Then we summarized the Churned Teams (View 2) table to a monthly level and joined it with the summarized table above and got:

Month	Expansion	Contraction	Upsell	Churn
Jan	$9	-	-	-
Feb	$3	$7	$2	$6

Structuring

The calculation for Net Revenue (NR) is - NR = Previous NR + Expansion - Contraction + Upsell - Churn

For this, we used LAG() function to get the Previous NR and performed the calculation to get:

Month	Expansion	Contraction	Upsell	Churn	Net Revenue
Jan	$9				$9
Feb	$3	$1	$2	$2	$11

With this, we knew the overall picture of the revenue growth. At this point, we had around 5-6 queries, which were tough to manage as we had to jump between different tabs for different queries. However, the next task of breaking down these numbers by dimensions was even more difficult.

Slicing

To filter a query based on a column, that column has to exist in the source of the query. We tried to push down each of the dimensions as a column in every query which broke the relative calculations based on row positions.

Let's take the Territory filter as an example and for better understanding I will refer to the above 5 tables as A, B, C and so on

Table E has to be filtered by Territory. The source of Table E is Table D which doesn't have a Territory column so we can't filter
To add Territory to Table D we need to add Territory to its source. The source of Table D was a join between two tables that are summarised versions of Table A & Table B. So Table D = Table A(s) <> Table B(s)
Table A & Table B had a territory column, so we included that column in the summarized table as well.
Table A(s) & Table B(s) were joined on the Month column, but after including Territory, the Month column wasn't a primary key so we had to include Territory as well in the join condition.
Now Table D had a Territory column, so we included it and then Table E had access to it

Conclusion

I realized later, that the join between summary tables could have been avoided if the first two tables were union-ed together. But making that change itself is too much work because changing the dependencies of a query is very difficult.

Regardless, this was done for every dimension we had, which I was very much frustrated by. Not to mention, when dimensions are spread across various data sources then the analysis has to continue on a local database by storing the intermediate/raw data into a local database.

Going through such an analysis after months, without any documentation or any visual hierarchy of queries and making changes that aren't easily undoable has been a nightmare.

So, I am currently working on a major refactor, In an attempt to make such analysis easier. I am also trying to figure out a way to improve the filtering of a composite query in a much easier way.

Published by

Saqib Ansari

13 April 2024

Success!

Error

Comments

JayaPrakash J

May 6, 2024

Working with Data has always fascinated me, and I'm in awe and admiration of your willingness to go back and refactor such complex queries. Wishing you luck in attaining those moments of clarity where the query just clicks.

Discussion

Paul Mugambi

3 days

ago

Beautiful read, and an insight into an individual I respect and have learned a lot from. Am inspired to trust the process and never give up.

Anna Dane

5 days

ago

I must say this is a really amazing post, and for some of my friends who provide Best British Assignment Help, I must recommend this post to them.

Add your comment

Comment