More Fun with Big Data

As Problogger pointed out that Define Media Group pointed out, Buzzfeed’s recent picture of traffic referral sources may be slightly skewed. Their claims suggest that Facebook generates nearly triple the traffic referrals that Google does. It’s an interesting statistic, but the methodology and data sources are clearly opaque. This problem suddenly becomes compounded when publications such as Recode and The Atlantic propagate said data without verifying it.

Good vs. Evil, Facebook vs. Google, DMG vs. Buzzfeed

But could it even be possible? Facebook has 1.24 billion active users and Google has almost 12 billion monthly searches, so yeah, I guess it’s possible that highly active users post and refer more traffic. Again, I’m dubious: Buzzfeed, a player in the social arena, understandably wants to promote social media, since social media promotes their services.

Reading Recode’s original article about the Buzzfeed phenomenon, it’s hard to tell where the data comes from: “BuzzFeed’s pretty darn big, and its network has some 200 other sites in it, so while we’re not looking at all of the Web here, we’re at least looking at a good-sized chunk of it.” DMG adds more about the data sources, but not much: “According to BuzzFeed their data gathering is done via a tracking code across their network of sites of which ‘represent an audience of more than 300 million people globally.'”


via Define Media Group

Define Media Group, on the other hand, is a marketing firm that provides both search and social media marketing consulting. DMG is very explicit with their methodology and their data sources. Their data suggests results almost the opposite relationship between search and social referrals. In my mind, transparent methodology and data sources certainly lend DMG the upper hand here.

Hype and manipulated statistics have been around for quite a long time, but in the internet age, they can have a tendency to go viral and make big waves.

Surfing and Wiping Out

In Bob Hoffman’s notorious speech where he slammed new school marketing pundits, entitled, “The Golden Age of Bullshit,” he brought up the Pepsi Refresh Project.

A few years ago, to much fanfare, Pepsi dropped its marketing campaign in favor of a complete shift to social media marketing. And, after 2010, corporate social media spending climbed 64% each year for several years running, according to stats I found at Hootsuite.

We’re clearly living in a new age, right? An age of conversation, engagement, and buzz?

According to Hoffman: one estimate has it that the Pepsi Refresh Project cost the company between $50-100 million. The popular soft drink dropped from the second best-selling drink to third and lost a 5% market share before slinking back to its former paid advertising practices.

The same research companies that had proclaimed the death of traditional advertising turned around and stated that social media was a “barely negligible source of sales.”

Hoffman cites Forrester Research, which had foretold the beginning of a new age of social media marketing and “the end of the era of mass marketing” just a few years earlier. They later changed their position, and stated that email marketing was nearly forty times as effective as Facebook and Twitter combined.

What does this tell you about big data?

Big Data = Statistics

Big data is statistics with just more of them. It can be insightful and truthful, or it can be skewed and manipulative. Transparency in both methodology and data sources are vital if we are to make any useful sense of statistics that are thrown our way. Publications such as The Atlantic and Recode — not to mention anyone wielding statistics — have a responsibility to do some fact-checking and verification before propagating such big bad data.

If I had to pick one data set out of the two mentioned above, it would be DMG, because they are open about their methodology and statistics. With Buzzfeed’s info, we literally just have a picture, without understanding the methodology or numbers behind it, just as with Google Trends.

Google Trends Says Laotians Love Japanese Girls

If you already understand how Google Trends works, you can skip to the “Why Google Trends is Stupid” Section.

Everyone else, welcome to my article.

For those who don’t know, Google Trends is a Google tool that allows you to examine the relative “interest” in search terms, search topics, where those terms and topics are most popular, other related searches, and other related data.

This type of data, of course, is extremely valuable for internet marketers engaged in research…or would be if it weren’t so sketchy.

Miley, You Lose

Let’s compare the literary genre of science fiction, the anime Neon Genesis Evangelion, the search term “hunger games,” the anime genre, and the term “miley cyrus.”

miley v anime google trendNot only does worldwide “interest” in anime consistently top all other searches, it even outperforms all the others combined at least 95% of the time. On the one hand, we never really think of anime as being so popular, but when you include the entire world’s search results, you can see how it compares to other genres and titles that garner so much attention from mainstream media.

When you look at the actual charts, Google’s site correlates popularity spikes with news events for you, so you can see that miley cyrus’s biggest spike coincided with her MTV music awards. Examine the “regional interest” section and you’ll see that her biggest fans aren’t in the United States, but in Guyana, the Faroe Islands, Guam, Belize…in fact, the USA is #8 on the list.

Interesting, or confusing?

The Japanese-Loving Laotians

laotians love japanese girlsjapan laos google trendsWhile first looking at some Japan-related search trends, I noticed something else: Laos tops the search term volume for “japan,” followed by Cambodia, Myanmar, Mongolia, and so on.

At the bottom of the Laos-specific search page, you will see related searches.

Now, anyone who has spent any time in southeast Asia doesn’t need to blink twice to know something’s wrong with this picture.

The “100” next to Laos means that it has the highest search volume in the world, and the other 1-to-100 numbers are calculated against that…or so I thought, based on Google’s unclear help bubble language, which says, “Numbers represent search volume relative to the highest point on the map which is always 100. Click on any region/point to see more details on the search volume there.”

The reason for these odd-looking results?

Google normalizes its data (see below), but, even in a post titled “How Trends Data is Normalized,” it doesn’t tell you how Trends data is normalized, it just explains what the normalized results look like.

We aren’t told what the search volume is, so I went over to Google AdWords to look at numbers. There, we find that Laos manages a paltry 1,360 searches per month vs. the USA’s 199,640. When I checked search volumes for “japan girl” plus “japan girls,” I found that Laos came up with 206 searches, and when you add “google japan” to that list, you only come up with 278, vs. the USA’s 43,422 for all three search terms.

Why Google Trends is Stupid

No numbers and no pre-normalization information means no meaningful picture.

Google really needs to work on its social skills. Its inability to successfully promote its social network Google+ is one glaring example of this, and the opaque Trends interface is another. Behind the scenes I’m sure they’re working on an artificial brain that will predict the future, but we’re all left in the dark with “interest” charts. I suppose that’s normal in this age of Big Data-hoarding.

Data Normalization

Wow, guess those Aussies like GoT. Too bad there's only 22 million of them

Wow, guess those Aussies like GoT. Too bad there’s only 22 million of them. How good is there internet infrastructure, I wonder? And what about demographic data?

To normalize a data set means that, according to the map-making software folks at AlignStar, you “transform the data so it may be compared in a meaningful way.” In the example they give on the AlignStar site, we see two maps of unemployment rates. One which shows absolute values within a US state, and the other which shows normalized values.

Each map paints a different picture.

If, for example, you wanted to measure the counties of a given state to see which have higher unemployment rates, then you would measure the absolute number of unemployed against the total workforce, which is what AlignStar did in their second map. This shows a couple counties that had relatively high unemployment rates. They pointed out,

The maps above portray a very different picture of the same information. Each map could prove useful depending on the point that the map creator was trying to make.  It is important to keep this in mind when creating thematic maps. Sometimes a very small change can result in a very different picture.

What is a Trend?

game of thrones australia usaWe don’t know what data goes into these graphs or how it is being processed.

Not normalizing the data would make many Trends rather boring, however, since the USA is the biggest user of Google and has one of the most powerful — if not the most powerful — telecommunications infrastructures in the world. It would probably look like the first map on the AlignStar website.

But what do Trends’s post-processed pictures actually tell us?

I’m no statistician, but there are some pretty obvious questions that come up as to how valid or useful this tool is. In the case of Game of Thrones, we see many first-world countries popping up on the map, so it is more reasonable to assume some relative popularity correlations between countries such as the USA and Australia. But without the raw data we can’t verify anything for ourselves.

Look at Laos and Cambodia. The vast majority of the population doesn’t even have internet access.

So, once you dig a little deeper, you realize that Google’s geographical “normalization” can, at times, be misleading, pointless, and wrong. Guyana‘s and the Faroe Islands’ supposedly vast interest in Miley Cyrus, for example, doesn’t tell us how many people in said countries have access to the internet, have smartphones, speak English, use Google, use other search engines, or have ever seen a computer.

In Cambodia, Japan’s second biggest fan, most people live in rural areas with no internet access or electricity, and will likely go their whole lives without ever seeing a computer except that one time that one white guy came to take pictures of an ox with his smartphone.

When you take such a ridiculously small search sample size from small countries with small populations that live the same way they have for the past thousand years, Google’s one-size-fits-all normalization clearly tells us absolutely nothing, except maybe that some travelers, Japanese expats, or other rich folks search for “japan” with more relative frequency than other countries.

Maybe, though, that’s the just data you’re looking for.

More Fun with Big Data

It’s just Big Data, and I hate Big Data, mostly because I don’t have any.

As Jaron Lanier has pointed out, and as I will probably write about again, that sacrosanct elixir of the techtopians has got a tenuous-at-best causal relationship between the input and the output. When you hide the quantities and use unknowns to algorithmically define terms like “popularity” or “interest,” without including (in this case) vital geographical and demographic factors such as economic status, internet infrastructure, population of said country, and so forth, then you start getting unverifiable and meaningless statistics. Bad data is even worse than bad science.

As I like to say, “No! No, Big Data, no. Bad Big Data. That’s a bad, bad Big Data.”

Without the ability to see and play around with absolute values ourselves and without knowing how those values are normalized, we are left only with pretty pictures and graphs. As with the Google algorithm, we just have to take their word for it. And with Google’s attitude toward the world’s data, do you really feel like doing that?