Debugging Network Connectivity Issues
We often forget that the Internet is a utility that is run on cables, pipes, power and a lot more and it can go down sometimes too.
I woke up to an email alert on my phone. It was from Prakash: Most Urgent - ERPNext not working! Okay - panic mode activated. I booted my laptop, opened two tabs in chrome for erpnext.com and google.com. Google was working, ERPNext was not.
Finding The Problem
I tried logging in to our server with no luck. First thought - the server might've crashed. But I also couldn't login to our backup server. Further, I was unable to access Hetzner's website (our server host). So, the issue might be with Hetzner's network. To confirm this, I used my phone's internet connection to access erpnext.com and hetzner.de - it worked! So the problem was not with Hetzner after all, but with my Internet Service Provider (ISP) - MTNL. Nothing to worry, customer's weren't affected. Or so I thought.
I logged into our ERPNext account and found that three of our customers were unable to access their accounts. They were from Mumbai, Delhi and Sri Lanka. The problem was not just with MTNL. Most probably, an under-sea cable or a router was malfunctioning.
A few days back, I learnt a unix command called "traceroute". When you type erpnext.com in your browser's address bar and click on Go, your request to see the home page of erpnext.com, is sent in the form of packet with the ip address of erpnext.com stamped on it (220.127.116.11). This packet is passed on from one router to another. Routers are electronic networking devices and act as couriers. They try to deliver this packet to the address mentioned on it. "traceroute" helps you detect which couriers are handling your packet and maps the path taken by the packet.
Here's how it looks when erpnext.com is working:
Anands-MacBook-Pro:~ anandpdoshi$ traceroute erpnext.com traceroute to erpnext.com (18.104.22.168), 64 hops max, 52 byte packets 1 192.168.2.1 (192.168.2.1) 13.686 ms 2.120 ms 1.169 ms 2 192.168.1.1 (192.168.1.1) 2.706 ms 2.440 ms 2.435 ms 3 triband-mum-22.214.171.124.mtnl.net.in (126.96.36.199) 36.411 ms 42.400 ms 36.267 ms 4 static-mum-188.8.131.52.mtnl.net.in (184.108.40.206) 37.901 ms 36.800 ms 57.553 ms 5 aes-static-220.127.116.11.airtel.in (18.104.22.168) 186.944 ms 184.137 ms 183.051 ms 6 22.214.171.124 (126.96.36.199) 186.071 ms 184.546 ms 183.587 ms 7 linx-1.init7.net (188.8.131.52) 183.628 ms 184.900 ms 184.228 ms 8 r1nue1.core.init7.net (184.108.40.206) 198.058 ms 197.872 ms 198.276 ms 9 gw-hetzner.init7.net (220.127.116.11) 200.627 ms 201.439 ms 199.985 ms 10 hos-bb2.juniper1.rz14.hetzner.de (18.104.22.168) 202.934 ms 326.391 ms 201.765 ms 11 hos-tr2.ex3k1.rz14.hetzner.de (22.214.171.124) 205.560 ms 306.970 ms 307.137 ms 12 static.126.96.36.199.clients.your-server.de (188.8.131.52) 201.694 ms 202.293 ms 309.299 ms
Here's how it looked at 9:00 AM today, when using MTNL's internet connection:
Anands-MacBook-Pro:~ anandpdoshi$ traceroute erpnext.com traceroute to erpnext.com (184.108.40.206), 64 hops max, 52 byte packets 1 192.168.2.1 (192.168.2.1) 1.463 ms 1.350 ms 1.408 ms 2 192.168.1.1 (192.168.1.1) 2.586 ms 2.302 ms 2.282 ms 3 triband-mum-220.127.116.11.mtnl.net.in (18.104.22.168) 37.350 ms 45.194 ms 37.013 ms 4 static-mum-22.214.171.124.mtnl.net.in (126.96.36.199) 36.141 ms 36.944 ms 37.395 ms 5 aes-static-188.8.131.52.airtel.in (184.108.40.206) 37.437 ms 41.538 ms 37.605 ms 6 220.127.116.11 (18.104.22.168) 63.176 ms 63.945 ms 62.497 ms 7 22.214.171.124 (126.96.36.199) 64.798 ms 63.947 ms 63.756 ms 8 * * * 9 188.8.131.52 (184.108.40.206) 66.929 ms 62.303 ms 220.127.116.11 (18.104.22.168) 63.521 ms 10 * * * 11 22.214.171.124 (126.96.36.199) 67.956 ms 188.8.131.52 (184.108.40.206) 65.184 ms 220.127.116.11 (18.104.22.168) 63.194 ms 12 * * *
The packet was unable to progress after reaching the router with ip address - 22.214.171.124. "whois", a command that reveals the ownership of an ip address, revealed that it belonged to Bharti Airtel Limited. So a router, belonging to Airtel, was unable to send the packet further along its path.
I am not a customer of Airtel. So I dialed MTNL's customer support. After exchanging a few phone calls and emails, they helped me escalate the issue to the person responsible for their network. However, before I can send him the traceroute, the issue was resolved. Airtel had re-routed their traffic via a router with ip address 126.96.36.199.
ERPNext was inaccessible to some users in the Indian Sub-continent for about two hours, owing to a router malfunction belonging to an ISP in India. I learnt that I could help by registering a complaint to the right people. Also, MTNL has a very co-operative staff. I appreciate them for their effort and time.
Anand is the Chief Technology Officer at ERPNext. He reads fiction, dabbles in photography and is always on the watch for the best ToDo app.
@Anand - You are right. Our setup isn't big enough for such complications. But it is a good sugge
I would recommend setting up nagios from a local office system. Checkout remote nrpe plugins. Nag
It was indeed a problem of Airtel and all those ISPs using the undersea cable o
Good blog Why not use a netwrok monitoring tool of nagios or zabbix to notify u the connectivity