Building Frappe Wiki
An open source wiki application that works at scale
ERPNext has lots of documentation that is used by internal and external users as a manual for implementing ERPNext. One consultant at Frappe even went ahead to say that reading ERPNext documentation feels like working towards an MBA degree. ERPNext documentation was written in markdown files, served via Frappe, and stored in a GitHub Repository.
Turns out our internal consultants and external business users considered the contribution flow unwelcoming. It would involve setting up a local Frappe Instance (Bench + Frappe + ERPNext Documentation) and then raising a PR via Github (wait for review and deploy). We started out by trying to design a solution that would help us provide a simpler way to contribute and deploy our documentation site. Since there was no good tool that could do all this we built Frappe Wiki. It powers the current documentation site and is Open Source.
I would try to cover the following topics related to the Frappe Wiki App in this blog
- Architecting the Wiki
- Data Structures/ Schema
- Migration Tool
- Editing Wiki
- Optimizations in the Frappe Framework
- Future Scope
1. Architecting the Wiki
While designing the solution, two approaches were proposed. One was to let the contributor edit the markdown of the page in their browser and then on submission, the app would raise a PR on the contributor’s behalf. Another solution was to solve the problem via a wiki-based approach. In this case, docs do not stay on GitHub; they are stored and served right from the database.
In one of our quarterly review meetings the team reconciled on using the wiki-based approach due to the following reasons
- The approach using GitHub would still require the contributor to have a sound understanding of Git - which means half the problem stays
- Integrations break more often
You may have a look at the POC of the GitHub based solution which allowed editing any content file in the
www/ directory in this repository
2. Data Structures/ Schema
The Number of Tables(DocTypes) required for the entire application is just 5. Right now I would like to discuss the tables that are required only for the render mode. Ignoring all tables that would be required for Edit flow and displaying revisions.
ERPNext documentation was build over time and was not uniform in terms of the layout. By layout, I mean the way in which the topics are arranged into folders. For example, the root folder included modules such as Accounting, Manufacturing, Buying, etc. These modules had a folder inside the root folder. Inside these modules can be topics or there can be another set of folders. There was no bound on the number of levels that can be required (Currently there are 3 levels but at one point we had thought of arranging topics into 5 levels). This structure (similar to a tree) had to be represented in the database and should also be present in the sidebar on each page. This relationship can actually be represented in multiple ways. I tried the Nested Set approach and the Adjacency List approach and chose the latter one.
Nested Set Approach
Nested Set is used to store trees. It involves giving a range of numbers (lft, rgt) to each node in the tree. Inorder traversal of the tree would generate the numbers in the ascending order. On each insertion or deletion, the numbers need to be reassigned. The Nested Set is used to implement tree doctype in Frappe.
The relationship between Wiki Page and Wiki Sidebar would look like this. We need the mapping table since It is possible to add a single Wiki Page to multiple sidebars
Nested Set would require an entire blog to explain, hence I do not want to go into details and jumping straight into the reasons why I decided to not go with it. The article on Wikipedia for Nested Set is pretty informative.
- Wiki Pages do not get covered in Nested Set mappings and would require a separate mapping table (Many to Many relationships) to map them to Wiki Sidebars.
- When the sidebars would be reordered the
rgtvalues of all the affected nodes would have to be updated. The method to achieve this would become very complex and updating was very expensive.
- The logic to build the nested sidebar from the Nested Set is not simple, hence the data structure is more suited for inorder traversal
- Handling two types of mapping - Between Sidebars and Between Sidebar and Page, would mean handling two types of parent-child relationships which is difficult to understand (Simple is better than complex - Zen of Python)
- It is not possible to reuse sidebars in this method
Adjacency List Approach
Another approach is to store the sidebar references with wiki pages. The type parameter decides whether the name points to a Page or a Sidebar. The method is pretty straightforward to understand, just links and no complex math (although I do like math, sadly this turned out to be just basic Foreign Keys). A single query for a sidebar would bring in all the sidebars and the pages. Updating the order of items in a sidebar just requires updating the Idx column and moving between sidebars means updating the parent value.
Patches and Revisions
When a user goes to the edit mode internally a document of type Wiki Page Patch is created. It tracks the changes the user requested. On acceptance (merge) the changes are added to the Database (Page and sidebars). The revisions are stored with the patch creator’s attribution
3. Migration Tool
ERPNext Docs was supposed to be migrated to the Frappe Wiki Application and this required writing a patch. Although the Folder structures were not uniform throughout the separation between media and assets was done nicely. The documentation was available in multiple languages and versions. So I started with writing a patch that would help me move the docs from the file system to the newly created DocTypes. Since there were unknowns and I did not want to write another patch when we wanted to move some other docs I decided to make it configurable. A Single DocType called Migrate to Wiki was added that takes parameters such as root directory of media and pages and also the constants to create all the required doctypes
Broadly it takes the following steps
- Go to the assets/ media directory and create a File doc for each image. Store the mapping between the filesystem path and the newly created file path.
- Walk through the page directories and create a Wiki Sidebar for each folder and a Wiki Page for each file. The link between a folder and the files, folders inside it needs to be recorded into the adjacency table.
- The file content also needs some processing.
- Remove any stray jinja variables
- Replace the image links with the map from step 1
- Replace the hyperlinks with the new one’s if they are getting changed
- If a contents or introduction page comes up use the index file to create a TOC page
- Handle exceptions (do not mean the python one ;p)
4. Editing Wiki
Editing is one of the core features of a Wiki. We added an edit button at the bottom of the page clicking on which opens a URL of this format ( <domainname>/<wikipage>/edit ). The markdown code of the page is displayed in the code editor and the second tab can display the live preview. The user can also view the diff in the third tab. If the user revisits a contribution he made earlier he can talk with the reviewer using the comment system.
Rich Text Editor (with drag and drop image) is also supported which helps lower the barrier to contribute even further
The sidebar can be edited by dragging and dropping the components and new components can be added using the controls.
So I went ahead created a new site on Frappe Cloud and migrated all the ERPNext docs to the new Frappe Wiki App. Sadly each page load took around 10 seconds which is unbearably slow for a static documentation site. In our older site, we had full markup caching which would cache each page according to the route. This would not work in our case due to the following reasons
- We have now a sidebar on each page. A reordering of the sidebar would involve invalidating all pages linked to the sidebar. In ERPNext Docs that would involve invalidating the cache of all pages of the language since there is only one sidebar per language in the cache.
- The sidebar markup is very big, storing it with each page is not acceptable!
The solution was to separate the sidebar and webpage caching. The standard Frappe Caching does not support this so I had to write a custom renderer (Really grateful for the Website Rendering Refactor) which would get the page and sidebar from the cache separately and then use simple regex to merge the result. The page would be cached on each invocation same for the sidebar. The page cache would be invalidated on changes in the page whereas the sidebar cache would be invalidated on changes in the sidebar. This method is similar to fragment caching.
6. Optimizations in Frappe Framework
Caching was inevitable but it did not solve the problem fully. Page loads were still slow even when the entire page was loaded from the cache. I firmly believe optimizations should be done only with proper data so I went ahead and profiled the entire cycle. After gazing through the cumulative time per function it was clear that the major cycles are happening inside of the framework code, specifically this function.
The init method of BeautifulSoup was the real culprit. I had no clue what this function did. On checking the original PR I found that this facilitates HTTP2 Server Push. Essentially flagging all assets to tell the server to push them before the browser asks for them. More on it here. So this isn’t stray code and I had to optimize it. After going through BeautifulSoup I found that I can reduce the processing by telling Beautiful Soap to build the tree only for specific tags. SoupStrainer became the simple utility that brought reduced load times by a huge factor
I raised a separate PR for it
Another problem that I faced was when I did a small JSON change and my migration took some hours to complete. I was clueless. On profiling found out that we render each page (rendering means - call get_context, add the context to markup using Jinja and then index it for search ) This is an expensive operation. I had close to 2000 pages in my WIKI Page DocType. My pages were all in markdown and required no context to index. The simple solution was to not get context and neither load Jinja (loading jinja is expensive too) just convert my markdown to HTML and give it to the indexer. This reduced the migration time by a factor of 2.5 and solved the migration problem. Raised a separate PR for this too
7. Future Scope
Frappe Wiki App turned out to be generic enough to be used for any content-heavy site, but it still has some jagged edges
- Editing UX can be improved.
- The number of SQL queries to build the sidebar can be reduced.
Since the project is open-source, hosted on Github (frappe/wiki) anyone can improve it by raising a PR.
This Project was full of learnings for me. One of the major learning for me was to always profile before optimize, bottlenecks are usually at places where you least expect. Frappe Wiki has its own product page and docs that are written on Frappe Wiki itself!
Learnt a lot from this blog, thank you
Nice one with detailed information. Thank you.