Why you need to be proactive against data scraping
DATA is the currency of the modern business. For organizations big and small alike, data now plays a big part in ensuring that a business can optimize its operations, correctly target its marketing, properly engage its customers an enable employees to collaborate. With the prevalence of mobile data connections, the Internet-of-Things, connected workflows and social networks, organizations are now more capable of building actionable intelligence around customer and operations data.
With such access to data, however, there is always the concern about security – in particular about the integrity of corporate and user data. According to a 2015 study by the Ponemon Institute and IBM, businesses incur an average cost of US$154 per record lost or leaked, up 6 percent from the previous year. For an enterprise of scale, such costs also grow as your database size increases – which can run to the millions of dollars. For a small business, any data leakage might result in a breach of customer confidence.
According to EMC, China leads the way in the number of businesses that rank ahead of the curve in the data maturity matrix, at 30 percent. However, a vast majority of businesses, at 87 percent, rank in the bottom two categories, which means that most businesses globally are not yet prepared to properly manage and secure their data.
What is data scraping?
Data scraping involves gathering either structured or unstructured data from digital sources – such as the web, databases, or other digital repositories – for the purpose of incorporating these into another database or other ends. For example, you might have data published on your website, and other parties can easily pull out this data and publish this as their own.
This usually involves bots that crawl websites or databases and parse it into their own content. While content scraping might be straightforward, some scrapers are capable of going deeper and scrape content from supposedly private databases through security flaws.
Why it’s becoming a serious concern
The rising popularity of cloud platforms and distributed infrastructure brings about increased difficulty in mitigating risks that can arise from data being transported across both encrypted and open networks. This primarily emanates from the nature of enterprise collaboration today. For example, popular BYOD policies in businesses might result in corporate data leaking through personal devices or personal connections.
Social engineering attacks are another potential vector, which can lead to attackers gaining access to business data through a legitimate user’s credentials. Data can then be scraped piecemeal and then reconstituted later on.
The obvious repercussions here involve other parties gaining access to possibly confidential or proprietary content. For example, a competitor might gain hold of your customer list or other proprietary data. However, malicious entities can also take your data hostage, sell it to another party, or leak it to the public. Take for example the Sony Pictures leak in 2014, which resulted in millions of customer and employee records leaked, along with email messages that led to a costly PR nightmare for the entertainment company.
According to Juniper Research, cybercrime will cost businesses a whopping US$2.1 trillion by 2019, mostly from attacks orchestrated by organized cybercrime groups. In fact, such activities are becoming more and more profitable for cybercriminals, given the importance that businesses place on data today. Hacker groups can either sell the data or hold it ransom, using the prospect of leakage to blackmail businesses into paying huge fees, or even simply locking down data on a user’s computer in exchange for payment.
How should I address data scraping?
Perhaps the most straightforward way to protect one’s data would be to harden the infrastructure to protect against unwanted data extractions while allowing legitimate scrapers to access your content. For example, you can filter scrapers at several levels, which can prevent these from reaching your database. However, you will need to let legitimate bots through, such as Google’s search crawlers.
This will involve an approach based on analytics – how does your system know whom to block and who to let through? Some solutions would involve using a challenge-based approach in blocking traffic, and some would use heuristics – analyzing bot behavior to determine their intent.
Another potential solution is to establish safeguards in your network topology so bots don’t ever get to reach your database. Such edge-based blocking like content delivery networks, reverse proxies and web application firewalls will also help protect against network overloads or even DDoS attacks, to some extent.
The emerging trend in data leakage prevention is shifting from manual prevention towards automatically mitigating breaches even before they happen. Chad Carr, director of Cyber Threat Detection at PriceWaterhouseCoopers, says that this will involve automation: “Integrated intelligent platforms designed to mimic the training, capabilities, and methodologies of security professionals and threat actors alike – capable of fusing end-to-end intelligence (external-to-perimeter-to-end point), all tipping-and-queuing each other, and feeding logic into active control defenses; essentially removing the human from the action loop.”
The key here is to be proactive against data scraping, leakages and loss. If you have any data to protect, you should not be passive and simply react when an incident occurs. Don’t wait for an attack to happen before acting on protecting your enterprise assets. Instead, you will need to harden your infrastructure, establish policies for ensuring data integrity, and use intelligence and analytics to your advantage.