Privacy Policy

We speak with Adrian S.W. Tam, Ph.D., Director for Data Science of New York-based IT firm Synechron about how today’s corporate data strategists are prepared to use and gain the maximum benefit of AI.

What is the most common problem ill-prepared companies come across when it comes to data to be integrated into AI?

AI, or Machine Learning in particular, is a combination of data and the model. Either can go wrong without notice. For example, data without sufficient cleaning and preprocessing may give the wrong signal to the model, and a model of wrong assumptions cannot describe the data properly. In the case of data, for example, I worked on an exercise to use time series modeling to predict literally the price movement of stocks. Its result was worse than using technical analysis because price itself is not really informative. The better choice would be to use the moving averages of different time periods. In the case of the model, I believe all textbooks on linear regression have a chapter to describe this — that a model may not fit the data well even if some of the metrics look good. This is a case where the model made some erroneous assumptions to the data. On the contrary, if the model made too many assumptions and became very complex, it may remember too much about the past but cannot work well with future data. This is what we call “overfit”.

How important is it to have the right people in place when correcting this? Who should these people be and what skills/training do they need?

It is quite common to see people use some placeholder to mark missing data. If we are in a classroom, we may use -1 for a student’s test score to mean she’s absent because a test score should be positive. It should be OK to use the same strategy for finance data, such as interest rates, until a few years back when we began to see negative interest rates, and this is no longer hypothetical. But if you do not know that the European Central Bank imposed negative interest rates in 2014, you will not know the right way to preprocess the data. This is precisely why we need people with domain knowledge when we are implementing the AI solutions. In simple words, AI is merely a mathematical model. To make sense of it, you need some intuition beyond the equations. Therefore, no one should trust a medical diagnosis provided by AI if there is not an engineer with a medical degree behind it. On the other hand, a doctor without an engineering degree may not know how to convert her idea into equations. So that will not work either.

What are some examples of how rectifying the problem can create real world results?

I think we have heard of such stories for decades. Look at the macroeconomic data. We always use some “corrected” version for analysis, whether it is seasonally corrected, or corrected for some policy change. The economists know that for a long time it will be wrong if we take the data literally without incorporating what happened in the real world into it. In this pandemic year, I believe a lot of data or the model is going to break but it will be back to normal in the near future. If we are going to build some model next year, I would try not to use this year’s data in the same way as the data from the years before.

Whose responsibility is it for this data to be prepared?

My experience in dealing with financial data tells me that even the big name data providers are shipping a lot of typos and flaws. However, they never deny this. They will tell you they know the data that went into their system and as a user, you can decide on how to make sense of these. Therefore, I believe the leaders in the model development role need to take the responsibility to devise a data preprocessing pipeline to correct known errors and warn the model users of possible unknown errors. We have to admit the fact that we can never have perfectly correct data.

What is the role of the CEO/C-Suite in this?

To get the right people on the team and trust them. Trusting your team should be the easier part. But knowing who the right people are may be hard if you are not the expert in the particular problem. This is especially true if you are working on a very innovative project because you will never know what the correct approach is. My advice is to talk to the team and encourage them to think about what additional talent can benefit them.

What risks are there from dirty data that hasn’t been fully cleaned for duplicates and repetition?

You can imagine what can happen, but there is always a matter of scale. I believe all systems should tolerate such issues in a miniature scale, but once that dirty data becomes significant, the AI model will give a misleading result. The real risk is to rely on these results for decisions. Doing a good job in cleaning data and preventing this is, of course, the best approach. But we can never get it perfectly cleaned. I would like to see a secondary model that can provide me a confirmation, using alternative data or alternative methodology, or even audit the result to give me more confidence of the output. However, this also means extra work has to be done in building the system.

What HR and legal ramifications and concerns should be considered?

While it is fancy to say you are using AI or Machine Learning, I want to quote what my professor told me many years ago, “If we don’t understand what human intelligence is, on what ground we can understand artificial intelligence?” As AI technology has progressed in recent years, I am seeing people build up fantasies about it, just like what the Hollywood movies in the 70s or 80s showed, believing the computer is almighty. The correct attitude toward this should be seeing AI as a tool doing what humans can do. Humans make mistakes; so can the AI model. Therefore, I would not be surprised to see the result of AI coming with a legal disclaimer. The users of the result should also go with their intuition in order to spot any funny opinions.

By Adrian Tam

Adrian is a data scientist. Adrian deals with difficult problems related to data, machine learning, and AI. At the same time, he also helping others deal with computer networks as well as system design issues with clients.

Read more from Adrian Tam