Understanding DPDzero
DPDzero is your trusted partner in scaling collections effortlessly for banks, fintechs, NBFCs, and MFIs. We facilitate the recovery of capital across a wide range of loan products and stages, making it easier for financial institutions to achieve their goals.
The Challenge
One of our core functionalities is sending a large number of messages daily. As our business expanded the volume of messages being sent also increased significantly, leading to below challenges,
1. Burst requests
Since campaigns were being run at different intervals on a daily basis, we observed frequent bursts in requests.
2. Rate-limits by the provider(s)
Another significant challenge was dealing with the rate limits imposed by messages provider(s). Because of frequent bursts, message processing was not uniform and the system used to breach rate-limits very often, causing delay in overall processing and poor utilisation.
Initially we had tried to build queuing mechanism within the main service itself, but due to all the resources such as cache, queue and DB being shared with other functionalities, we were facing resource load issues. To mitigate this, we would switch from one provider to another as soon as one of the provider would give us a rate limit error. However, due to pricing differences between vendors, our costs were higher than what it could be. We had also committed volumes to some of the vendors in return for better commercials but we could not meet the volume commitments due to frequent redirections.
Also for some type of services such as IVR, one of the vendor’s pricing was fixed while the other was volume based which caused frequent rate-limits with fixed price vendor. So we wanted to optimise our cost by effectively utilising vendors’ services without hitting frequent rate-limits.
3. Handling massive message volumes
The main challenge was to have our system handle a massive number of requests with minimal delay and reprocess in case of failure. Most of the work in our application happens in async manner and scaling celery was proving to be expensive and challenging. During rate limits, the celery would just re-queue the messages to a different vendor and this would at times overwhelm the queue and we would end up spawning a lot of instances of the celery worker to just process the queue.
4. Lack of granularity and separation of concerns
Since the functionality was tightly coupled with the main service, it was difficult to have fine-grained control over individual messages.
To address the above mentioned challenges, we decided to separate the vendor messaging functionality from the main service and build a Microservice based messaging system with Go. Robust concurrency support and scalability were the main reasons for opting Go.
Initially we implemented a proxy service in Gofiber to handle all the messaging requests. For handling burst requests we made use of AWS SQS (Simple queuing service). We had to design a consumer to poll messages from the queue and process them. To meet the scalability requirements we had to process as many as 10k requests per minute. But SQS had a limitation of max 10 messages per poll. To address this limitation we implemented workers using goroutines and designed our system to concurrently poll SQS and process them.
To prevent the rate-limits from provider we had to implement self-limit mechanism in our system such a way that maximum possible requests are processed without breaching the rate-limit. Initially we implemented self-limit with fixed window algorithm using redis as caching service. Even though the self-limit was enforced, due to clock synchronisation issues with the provider(s) we had to come up with more reliable algorithm. Instead of keeping self-limit logic inside the service, we used Lua script and implemented sliding window algorithm with millisecond level precision, which allowed to have robust self-limit.
Flow diagram
- Proxy A new proxy service to handle all the incoming messaging requests and push them to AWS SQS along with support APIs
- Producer Producer is the main service which calls the proxy for pushing messaging requests to SQS.
- Consumer
- The consumer was designed to process individual messages with granular control. Since SQS allows at max 10 messages per poll, we leveraged Go’s robust concurrency and implemented multiple workers for concurrently polling messages from SQS and processing them.
- To prevent rate-limits we needed to handle rate-limit checks for each request effectively at milli-seconds levels. Since Lua works great with redis and guarantees atomicity, we implemented the sliding window algorithm with Lua script and cached it in redis server, to have a self-limit and wait time effectively.
Upon implementing this, we have a fine grained control over messaging provider, better handling of burst request and better utilisation of vendor services.