How much Big Data is sufficient?

Data reveals insight. Historically, the Census conducted every 10 years reveals information about education levels, income levels, number of school going children, number of people employed, number of dependents etc. From all this data, the government can frame policies, directed at segments of interest. Similarly, market surveys conducted by consumer research organisations reveal information about product preferences, critical price points etc.

Today, with the generation of data from mobile phones, instant messaging systems, emails, websites, IOT devices, the volume is huge. Tech giants are using this data, and a mix of good old statistics, coupled with neural networks running on specialised hardware, to brew the magic of AI.

AI is everywhere, but this blog is not about AI per se. The question is “Is there a minimum amount of Big Data that can do the job?” or , is it true that “the more the size of big data, the more powerful the AI would be?”

This is not a trivial question, and it is already “hurting the big guys”. Google has started informing users about exceeding the free 15 GB storage limit. Whatsapp has introduced Disappearing Messages. Facebook has permitted posts with a timer limit. Is there a pattern here?

If one were to read all the emails of a particular person, obtain data about every transaction she makes, read all her messages on various messaging platforms, every picture she posts, every place she visits, dines or shops, note every message that she likes, does not like, or ignores:

  1. How long would it take to create a profile of the person regarding her place of residence, taste, her work, her socio economic status, her preferences in life, her network, her political and social views?
  2. As time passes by, and more and more data is created, is there a point where the algorithm would be learning incrementally less each time, when another big chunk of data came along? Is there a point of diminshing returns?

I believe, the answer is a BIG YES.

The early bits of information about any person reveal a huge amount of information (read insight). As more and pieces of data start coming in, a pattern starts developing, where fresh data does not give any further insight than what was already known.

Most of the readers would agree with me upto this point. The next big question is “How much is enough?”

This blog is only about determining whether a size limit exists.

It also begs the question, that an efficient profile building algorithm, would be able to “know a target” with very small amounts of data, which is processed as it is generated, refining the profile continuously with every input, and finding that beyond a certain point, fresh data does not change the model coefficients significantly. In other words, at this point the algorithm knows practically all that is required to be known.

While one has focused on an individual subject, aggregation of such subjects will similarly yield information about the residents of a town, a city, a state or a country, with the principle being the same.

Is there a clue in this blog, as to how the next generation of AI systems will be built? I would like to hear your views.

Leave a comment