THE 11.11 GLOBAL SHOPPING GALA in China – also known as the “Singles’ Day” sales – in many ways represents the single biggest logistical challenge for China’s e-commerce giant in terms of cloud processing, logistical manoeuvring and Internet traffic management.
Over their years of experiencing the massive demand and traffic on their sites during the festival, Alibaba has innovated and found new technologies to help them cope with the staggering amount of load on their infrastructure and servers.
According to Zhang Liping, the principal system software engineer with Alibaba Infrastructure Services, the company is handling increasingly huge numbers of orders that their systems have to cope with in order to ensure consistent consumer experience.
“In 2009, at peak times we processed 400 orders per second, but last year we processed 17,000 per second, so you can see how epic the scale our scale is compared to other online events,” he said in a phone interview with Tech Wire Asia prior to the event.
Since Single’s Day came and went, it emerged that that number has dramatically increased. The 2017 iteration of the 11.11 festival saw as many as 325,000 orders processed per second at peak times, marking a 48 percent increase from the previous year. Zhang said that the business had grown increasingly complex which has increased pressure on the company’s infrastructure in terms of computing power, ordering and payment capabilities and logistical arrangements.
“11.11 is becoming more of an online shopping festival, so now we have more entertainment related to the event,” Zhang said. “The overall IT infrastructure is facing really complicated challenges to support this, both in terms of massive scale and in terms of complex tasks.”
Zhang walked Tech Wire Asia through some of the key innovations that Alibaba has made in order to cope with these new challenges. Alibaba has made investment in two key technology areas: cloud computing management, and wider adoption of artificial intelligence.
“These are key enablers for us to manage our infrastructure more reliably, improve our automation capabilities, reduce our resource use, and provide better user experience. Those are our key goals,” he said.
The ability of the company to run a finely-tuned machine is really important for a single-day event like 11.11, as a disruptive payment or browsing process could spell the difference between Alibaba’s staggering US$25 billion revenue this year and a poorer outcome than expected.
User experience really sits at the forefront of Alibaba’s thinking and planning for the event, and not just in terms of producing an overall smooth experience but also a personalized one. Zhang talked about how the company has been tinkering with how machine learning algorithms embedded into the platforms enhance customers’ experience by bringing on board personalized experiences, and learning from user histories for future product recommendations.
“Compared to years ago, when everyone saw the same recommendations, the same products when they open the Tmall or Taobao applications, now it’s more personalized,” he said, adding that users could even see their friends’ recommendations.
“We can see in recent years that we have had more personalized recommendations for customers, which imposes more challenges on our infrastructure, but on the other hand everyone has a different way of interacting with our system.”
Lessons from the previous years have shown Alibaba how to not just improve their services, but also what kind of capacity they can expect.
“The two main challenges is first, how do you keep up with the ever increasing demands while still keeping IT costs on par even as logic systems keep getting more complicated?” he said.
According to Zhang, what’s been really important for the company has been the adoption of more automation into their systems, especially in the case of automated end-to-end load testing. Testing is crucial to make sure that a system can handle what takes place during 11.11, all across Alibaba’s multiple platforms. The worst thing that could happen is that the entire system infrastructure collapses due to demand, cutting off Alibaba from its consumers.
The end-to-end tool that is used to load test the system include anonymized simulations of traffic data of all systems, including Alibaba’s overall production and network systems. The traffic data is just stuff they’ve recorded from users over the years, with their details changed out.
“We will replay those simulated traffics at different levels of load. During the testing, the machine intelligence and auto-scaling systems will react to those loads, adjust the resource allocation and make sure the system stabilizes into that structure,” he explained.
Should there be a spike in traffic, the machine algorithms will reroute any spare capacity into processing payments and orders. Zhang said that there are certain processes such as unnecessary batch analytics jobs, that can be quickly downgraded to make space. These reallocation systems are run by Alibaba’s “Daling” technology, which is part of Alibaba Cloud’s data center capabilities.
Because the system can run autonomously, it takes off a lot of burden on engineers who would otherwise have to manage it manually. As such, the testing can take place as many times as needed throughout the year so Alibaba can make constant, incremental updates. However, the key factor here is the Alibaba infrastructure is becoming increasingly automated, which is reducing the company’s dependence on manpower.
The online system itself, including key tools like auto-scaling, co-location technology, are all working autonomously, or just require very minimal human interaction. Network operations have also seen the integration of fully automated error detection and recovery tools in order to isolate network traffic issues. He said that the work that would have eventually taken 10 minutes to do now gets done in 10 seconds.
“We compared load testing this year to last year, and we estimate that the manual work for preparing the system saved as many as 1,000 man hours.”
“These basic efforts make the overall production system make it more automated, and require significantly less human intervention in the room. They’re more reliable, and this frees up the engineer’s time to improve the machine intelligence.”
The sheer scale of the issue can be fully comprehended when you realize that the 11.11 festival’s popularity has spread beyond China and spilled onto the global stage. The presence of international stars from the US and UK suggest that the Singles’ Day sale will have to cope with foreign participation as well as China’s own outsized domestic market.
Zhang said that the load testing system also takes into account having to cope with complicated logics and systems in terms of handling logistical arrangements. Alibaba’s infrastructure has to be able to engage with external partners, banks, and customs regulations because of international export demand.
“As we set the customer’s’ satisfaction and system’s smooth operation as priority, we expect improved stability of the system and a reduced operational burden,” said Zhang.