Dino Pedreschi, Professor of Computer Science at the University of Pisa and co-lead of KDD Lab - Knowledge Discovery and Data Mining Laboratory, talks about big data, machine learning and intelligent systems.
Big data is essentially digital tracks of our daily activity, the recordings of whatever we do every day using digital technologies, that leaves tracks somewhere of what we have done. Maybe phone calls with our smartphones, or any kind of interaction using social media, social networks, transaction for paying or withdrawing money with our ATM cards, or navigating around with our cars with the navigation devices.
Everything that we use, that is connected to the internet, that is able to create recordings, will keep track of some of our activities to a very detailed level. Summing up all our data, with the data disseminated by everybody else, we are several billion people on this planet, it is easily imaginable that we leave an incredibly large and detailed foot print of our social activity, in a way that is totally unprecedented in our history. Why does this matter? Why it's important? Well, the point is that, precisely for the first time in history, we have a very detailed portrayal or how society works. We have a new microscope, allowing to see us as a bacteria in a culture that do whatever they do, and we can be studied for our behavior, for our choices, for our decision at the scale that is totally unprecedented so far. The word “Big Data” is due to the fact that these data is really extremely large.
We generate peta bytes of data every day, which is a really big, big number of bits that encode this information. This is extremely new, not only for the size, the volume, but also for the velocity and the variety of this data. Velocity because this data arrive at an credible high speed, imagine the video that we upload every day on YouTube, or any other video platform. Imagine all the messages that we exchange using social media, or chat services. This sum up to an incredible amount of information that is uploaded every day, creating also new challenges, and for digital technologies for having the power and the possibility to manage such a large amount and variety of information. Then the next question is: What to do with this? How can this be useful? Certainly, there are many different purposes for which this data can be used, that actually go beyond the original scope for which the data were collected.
Why this incredible availability of data makes a difference for science and society? The point is that the level of observation in complex phenomena, that we are acquiring in the last few years, thanks to big data and to what is now called data science, also, is really affecting the way we do science, the way we do business and the way we relate to each other in society very profoundly. Talking about science: science is actually not only any more validating in practice the theories, that we have thought in our offices or in our brain, but it's actually more than that. It's finding patterns, finding a scientific hypothesis in data, that can fuel the definition of novel theories that are better representation of reality. This is true in the social sciences, in the economic sciences, but also in biology and in medicine today.
Just to make one example, if we are able to find out lifestyle patterns that are related to diseases in a surprising way, this has the possibility of creating novel hypothesis for research in medicine and foster the advances that might not have been found otherwise. And similarly, for the social and economic sciences. For industries, really affected by the availability of dating many different ways, and certainly the most striking example is the advertising industry that has found a new, totally different ways of managing marketing and advertising in a way that it’s extremely effective to offer, to any individual person, the services and products that are most likely of interest for the individual person.
A kind of dream of the marketing people, that always somehow wanted to have, at the global scale, the same attitude that the proximity shop close to our house, as in understanding completely the preferences of the nearby customers that shop every day physically in the shop. This of course is not anymore possible, physically, in the global market, but big data actually made it possible in a different way. Because the desire, the interest, the aspiration of anyone of us, can actually be very well portrayed in data and therefore exploited, to make a very well targeted marketing offers, in a very personalized way. But this is only one example.
Probably, the most important example so far, of the applications of Big Data, but certainly not the only example for industry in general. Now we are really facing the industry 4.0 and the AI revolution for industry. The AI revolution is mostly a revolution based on data, based on the ability of learning intelligent behavior, intelligent machines, from examples that are precisely represented in the Big Data that we created, machine or people create every day.
There's a very strong link connecting Artificial Intelligence, Big Data and data science. It’s the amazing development that AI products, like the automatic recognition of images, the robotic perception, the driverless cars, or the domestic appliances that can assist people in intelligent ways, the automated translation of natural languages, between different languages, all these are examples of Artificial Intelligence that is fueled by data.
In all these examples, machines learn to be intelligent, by generalizing from many examples, created by people that generated labels for data to learn from. Just to make one example: how can a machine tell that in a certain image there is a cat? Well, the way to do that, is to present many different images that some human has labeled as containing cats or not containing cats. Once these images are sufficiently many and sufficiently representative of the overall situation, eventually machine learning, which is a technique developed to learn from data, is able to generalize a systematic way of telling whether a given image contains a cat or not, without the need of further human intervention.
The human intervention is in providing the examples, from which the machine learns. This is something that occurs in all the striking application of data science that we’ve seen today. This is going to disrupt many industries and also many services that we are already using or will be using continuously, very often without even noticing that this AI learning things occurs behind the scene. And this is also why our data are so important, because they provide the fuel with which the artificially intelligent tools can learn to be intelligent and can produce intelligent artifacts, intelligent machines, and intelligent services.