Frappe Technologies
Screenshot 2023-10-25 at 10.48.41 PM.png
Getting insights from raw data
In this article, I talk about my experience with a complex analysis and how I am trying to make it easier for the future.
image7f7d4e.png

By

Saqib Ansari

·

13 April 2024

·

5

min read

In the past year, we made some advanced reports for our revenue analysis which was quite challenging. In order to make the next one easier I have been trying to find patterns in all those analyses. Almost every analysis goes through the following phases:

  1. Extraction: This is where you select the tables and columns and their joins that you need for the analysis.
  2. Transformation: Defining the additional columns that are derived from the raw data.
  3. Aggregation: Grouping and aggregating the data.
  4. Structuring: Collecting the derived and grouped data into the final required structure.
  5. Slicing: Creating summary reports based on various dimensions.

To explain let me take a real example of finding the Net Revenue Retention for our business. First, we opened a new spreadsheet and started defining what the final data would look like. In this case, we listed down the following breakup of numbers:

Due Date MRR Start Expansion Contraction Upsell Churn Net Revenue
Jan 5000 1000 700 800 900 5200

We then listed down dimensions to break down the above table like Territory, Is Partner, Is Multi-Site Team, Is Enterprise etc.

Extraction

The first step is to extract the data we need from the tables & columns that contain the raw data. The columns needed were:

  • Invoice: Date, Team, Total
  • Team: Email, Is Partner, Territory

Transformation

With a sample raw data, we figured out the formula for the Expansion, Contraction, Upsell & Churn

  • Expansion = Invoice total of the first invoice of the team
  • Contraction = Decrease in invoice total compared to the previous invoice of a team
  • Upsell = Increase in invoice total compared to the previous invoice of a team
  • Churn = Invoice total of the last invoice of the team
  • MRR Start = Previous month's net revenue

Based on the above calculations we were able to create a single query to create raw invoice data and mark each invoice as Expansion or Contraction or Upsell but calculating the Churn value and month wasn't straightforward.

Churn is the Invoice Total of the invoices which aren't paid in the given month but were paid in the previous month. So, January churn would be the invoice paid amount in December month but not paid in January month.

To calculate this, we created a separate query to know the last paid invoices per team. Then we calculated the churned month i.e paid invoice date + 1 month

So we had these two view now,

View 1:

Inv.Date Email Territory Invoice Total Change Type Change Amount
Jan abc@example.com India $9 Expansion $9
Jan def@example.com USA $10 Expansion $2

View 2:

Churned Date Last Inv. Date Email Territory Invoice Total Change Type Change Amount
Feb Jan jhs@example.com India $9 Churn $9
Feb Jan 135@example.com USA $10 Churn $10

The raw data preparation was done. Now, we needed to aggregate the data.

Aggregation

We bring both these views into a common structure

Pivoting View 1 on the "Change Type" column we got:

Month Expansion Contraction Upsell
Jan $9 - -
Feb $3 $7 $2

Then we summarized the Churned Teams (View 2) table to a monthly level and joined it with the summarized table above and got:

Month Expansion Contraction Upsell Churn
Jan $9 - - -
Feb $3 $7 $2 $6

Structuring

The calculation for Net Revenue (NR) is - NR = Previous NR + Expansion - Contraction + Upsell - Churn

For this, we used LAG() function to get the Previous NR and performed the calculation to get:

Month Expansion Contraction Upsell Churn Net Revenue
Jan $9 $9
Feb $3 $1 $2 $2 $11

With this, we knew the overall picture of the revenue growth. At this point, we had around 5-6 queries, which were tough to manage as we had to jump between different tabs for different queries. However, the next task of breaking down these numbers by dimensions was even more difficult.

Slicing

To filter a query based on a column, that column has to exist in the source of the query. We tried to push down each of the dimensions as a column in every query which broke the relative calculations based on row positions.

Let's take the Territory filter as an example and for better understanding I will refer to the above 5 tables as A, B, C and so on

  • Table E has to be filtered by Territory. The source of Table E is Table D which doesn't have a Territory column so we can't filter
  • To add Territory to Table D we need to add Territory to its source. The source of Table D was a join between two tables that are summarised versions of Table A & Table B. So Table D = Table A(s) <> Table B(s)
  • Table A & Table B had a territory column, so we included that column in the summarized table as well.
  • Table A(s) & Table B(s) were joined on the Month column, but after including Territory, the Month column wasn't a primary key so we had to include Territory as well in the join condition.
  • Now Table D had a Territory column, so we included it and then Table E had access to it

Conclusion

I realized later, that the join between summary tables could have been avoided if the first two tables were union-ed together. But making that change itself is too much work because changing the dependencies of a query is very difficult.

Regardless, this was done for every dimension we had, which I was very much frustrated by. Not to mention, when dimensions are spread across various data sources then the analysis has to continue on a local database by storing the intermediate/raw data into a local database.

Going through such an analysis after months, without any documentation or any visual hierarchy of queries and making changes that aren't easily undoable has been a nightmare.

So, I am currently working on a major refactor, In an attempt to make such analysis easier. I am also trying to figure out a way to improve the filtering of a composite query in a much easier way.

Published by

Saqib Ansari

on

13 April 2024
1

Share

Add your comment

Success!

Error

Comments

J
JayaPrakash J

· 

May 6, 2024

Working with Data has always fascinated me, and I'm in awe and admiration of your willingness to go back and refactor such complex queries. Wishing you luck in attaining those moments of clarity where the query just clicks.

Discussion

image7f7d4e.png

Paul Mugambi

·

3 days

ago

Beautiful read, and an insight into an individual I respect and have learned a lot from. Am inspired to trust the process and never give up.

image4c43d6.png

Anna Dane

·

5 days

ago

I must say this is a really amazing post, and for some of my friends who provide Best British Assignment Help, I must recommend this post to them.

Add your comment

Comment