Hey fake news, machine learning has your number
A YOUGOV survey published earlier this year showed the majority of APAC residents distrust the Internet as a source of news, with concerns about fake news much higher compared with radio and TV.
The causes of this situation, and the nature of fake news in general, is due to several factors:
- It is really easy to set up a website
- Promoting a website is quick and easy
- Selling advertising on the Internet is easy enough to make at least a small income from
- Social media is a great leveler – any opinion has potentially the same weight as another
- Social media posts are free and trivial to automate
- Automated advertising algorithms don’t differentiate well between real and fake sites
While a majority of headlines have concerned the effect of fake news on US domestic politics, the underlying issues with fake news reach much further.
While products or services can be promoted online in a genuine (and often expensive) manner, conflicting messages can also be created just as easily, and in the egalitarian world of the Internet, have equal weight.
That means it is very easy for commercial competitors to potentially distribute negative stories and materials about others in their particular market. Due to the nature of the Internet, they can do so anonymously: plausible deniability is all but assured.
A quite different effect is also apparent. Often, enterprises and organizations contract third parties to place their advertising materials online, and these are propagated by various automata, such as programmatic marketing software.
This means that products or services can be advertised quite easily on sites which contain fake news, and the brands are therefore often held to be “guilty by association”: if the site’s “organic” content is fake, the commercial messages are likewise easily held to be false.
"I realized then that truth doesn’t depend on who runs a country; it depends on who runs a country’s newspaper." Great piece by @rjvogt31 on the inside story of the muzzling of the Myanmar Times #Myanmar #pressfreedom https://t.co/k7GA8ZerhC
— Euan Black (@euanblackwrites) December 17, 2017
However, an undergraduate researcher from Finland has recently made headlines regarding false news sites, initially announcing his findings on the popular chat/sharing platform Reddit.
In his post, he outlines how machine learning can be leveraged to determine whether websites are fake or genuine.
The method he outlines comprises two processes: data collection and machine learning.
The data collection routines gather publicly-available information about a site, including its registrant, its popularity according to ALEXA page rankings, the number of ads on the site, the presence of viruses, the basic web platform and the number of advertising aggregation services supplying the site.
The machine learning script then assesses each site, after a process by which 80 percent of the sites were used to train the model and 20 percent to validate the results.
Don't worry about Fake News, we're used to it by now. We know the REAL DEAL and that is YOU and all the good you're doing for Americans. Thank you Mr. President!
— DF (@daf1224) December 20, 2017
Of the five predictive algorithms used, the least effective was still able to predict whether a website was a “fake” in 88.5 percent of cases. The most successful method hit 94.7 percent accuracy. Only one real news site provided a false negative, and all the fake news sites were detected.
While the code base, the tester’s methods and the raw results for each site will be of interest to readers of a particular technical bent, there are several takeaways which are of interest to a broader, commercially-minded reader:
- There were more unique advertising service providers for real news sites than fake news sites. This is particularly interesting, because most fake news sites seem to exist purely to sell advertising, rather than to disseminate propaganda
- Fake news sites do not last long; in fact, 50 percent of the sites tested had disappeared only a few months after initial scans
- All the fake sites were based on WordPress(!), which was often poorly configured
- Some fake news sites hid behind the guise of being satirical, plus many of them hid their WHOIS credentials
- Use of TLS/SSL by sites’ web servers was no indication of a genuine or fake site
- Some fake news sites used dubious methods to get Facebook likes
What the research shows is that the egalitarian nature of Internet content means that although all content can be trusted at first, human glance, the application of technology can reveal the egregious nature of some sites.
There is also probably a space in the market to exploit the distrust of fake news sources. Organizations could use these types of tools to judge the efficacy of their advertising agencies (checking commercial messages’ final destinations, for example). A good business model for a startup might be a service that detects where fake sites appear, which messages are promoted, and, possibly, who is behind the propagation of the “lies”.