the codist - programmerthink

Given Enough Data, Could You Build an Internet Lie Detector?

Published: 09/04/2012

Publishing less than truthful or inaccurate information has always been possible since the dawn of writing, but on the internet it's trivial.

Google News and Techmeme among other sites have algorithms to relate stories together and group them based on common topics.

I've always wondered if with enough information published over time combined with curated real facts you could develop an algorithm that could (1) identify a timeline of published facts, (2) compare them to provable information and (3) derive a way to score sources as to how truthful they really are. Clearly a huge task, but is it even possible?

Some months ago I read a long comment stream about the fatal shooting of Trayvon Martin by George Zimmerman on reddit.com where the commenters where debating the relative sizes of the two principals. Various articles and blog posts had postulated a wide variety of combinations of their sizes and thus confused many people. The comments were all over the spectrum of possibilities.

The size of both people was clearly an absolute fact which could be verified easily but for many reasons the information available on the internet and also in print and TV news was not so clear. People for many reasons, both in blogs and in news reports, deliberately posted misinformation to support their particular bias (in one case a blogger used a picture of a large football player to make the victim look scarier). Combine the deliberate misinformation with copy errors, poor fact checking, random speculation and outright fiction with the ease at which information is distributed over the internet, and truth becomes a complex concept to identify.

Would it be possible to take a huge trove of related information, possible publish dates, curated facts and a lot of analysis and build a system that could give you some kind of measurement of how likely an article is truthful or accurate? If you could figure this out, running the same algorithm on a series of articles found on a single website might give you an idea of how truthful or accurate the site could be. Clearly the output of this system would be fuzzy; articles are always a collection of topics and facts some of which might be true, some sketchy and some totally wrong. You might not be able to distinguish between errors, guesses and outright lies.

Even if you could build something like this, would people believe your conclusions? If history tells us anything it's that people can be made to believe anything and discount everything that doesn't jibe with their biases. I've always said that some people can be made to believe that the sun comes up in the west and even if you point them east in the morning they still would refuse to believe you. So sufficiently "proving" that your fancy algorithm can identify mistruths may not be enough to convince the hardened believer.

Joseph Goebbels' masterful propaganda campaign during the Nazi era relied on controlling everything the people heard, which wasn't very hard since the only information they could get was from newspapers, speeches, radio and rumors, all of which the Propaganda Ministry controlled. The internet today, with its ease of publishing, would seem immune to the highly controlled information of the Nazi era. Yet it appears to me that it's actually easier today to spread mistruths, deliberate or accidental, despite the openness. Rather than having no way to obtain alternative sources of information, today we have way too many. How do you tell if the blog post you just read is a lie? Or the news story on TV was based on inaccurate speculation? Can we devote enough time to testing everything before believing it?

The difference between the Nazi limitation of information and today's information lost in the noise of too many sources is actually fairly small. It's always easier to give up and just believe something than hide a secret BBC radio and risk dying, or today try to read enough different articles to distinguish fact from fiction. Today's equivalents to Dr. Goebbels are perfectly aware of how to put forth information they would like people to just believe (just watch a TV commercial).

So how do you go about building an algorithm, or more likely a whole system, that can look at a whole series of related articles and determine how likely they are to be truthful?

Truthfully I have no clue.

To build something this complex would first require an enormous trove of articles, a way to compare them to determine similarity, research to find facts that are verifiable, and a method to break articles down into facts; then stir and add eye of newt. Maybe not that last bit, but this isn't a simple idea one programmer can put together. Only someone like Google has the access to enough information to even begin something like this.

Maybe someone is already building this. I'd sure like to see it. Imagine being able to get a 'truthiness' score for an article or your favorite website, or identify propaganda before you even read it, or flag misinformation automatically like you can get spam flagged on Gmail or I.D. websites that might harm your computer if you accessed them.

Even if someone built this, how do you independently verify their analysis? If Google for example put red flags on suspected untruthful articles how would you believe that their system wasn't wrong, or even that the system wasn't deliberately creating misinformation itself?

Am I paranoid or is the system actually out to lie to me?

That's the problem with truth in the enormous expanse of information we have today. Do you believe what you read or hear? Can you even believe a system that purports to tell you what to believe? Do we need a system to analyze the system that tells us what the truth is?

In the New Testament Pontius Pilate is quoted as asking "What is Truth?" Even today truth is hard to define, hard to identify, and often hard to take. At no time in human history has so much information appeared in readily consumable form as on the internet today; truth, lies, errors and everything in between. It would seem that there must be a way to take all of that information, and distill from the relationships, the language, the history and the intent some way to measure how likely the content is to be truthful.

I don't think this is easy, but who would have imagined the world of today 20 years ago when the first HTML page was created? Before then information took effort to publish, analysis was painfully manual, and both lies and truth far less common. Today the magnitude of everything published every day probably exceeds the sum total of everything written in the world before I was born.

In a funny TV commercial of recent vintage a character says "they can't put anything on the internet that isn't true". Of course they can and do; can we build some way to discover the untruths?

If I tell you I've already built this, would you believe me? Chalk up one more thing on the internet that isn't true.

Tag:

submit to reddit submit to hackernews