It’s time to build a new programming language for big data and AI

Data and AI are becoming important assets for many companies. Among them, the data represents the company’s accumulation in specific fields and is also the company’s moat, and the AI ​​capability output represents the company’s depth of use of data assets. The two are inseparable. If you only have data without AI, it is like you have only raw materials, but you cannot process it to produce value. If you have AI without data, it is like you have dragon slaughter, but there is no dragon to kill you. Only by taking the two as a whole can it bring the most value to a company.

Problems in the fusion of big data and AI

There is an upstream and downstream relationship between data and AI itself, and there is a sequence due to historical development, which leads to them eventually becoming two relatively independent systems, which brings two big problems:

Big data and AI have different technology stacks

Big data is dominated by Java / Scala / SQL at the language level, where SQL is the interactive language, and Java / Scala is the construction language of the big data system. AI is dominated by Python / C ++, of which Python is the interactive language and C ++ is the construction language of the algorithm system. In the face of the crowd, Java / Scala is for data research and development, and SQL is for analysts, operations, and products. However, with the further development of SQL, more and more data research and development are now using SQL directly to solve their own problems. Python is more algorithm-oriented. Of course, analysts usually also have some Python, and naturally Python will also have some SQL.

Naturally, the simpler the programming language, the easier it becomes to become popular. After all, the essence of popularity is to expand the use of people. Only language with a low threshold can increase the proportion of people who use it. So SQL and Python have slowly developed into standard interactive languages ​​for big data and algorithms. In fact, these languages ​​have existed for a long time, only because of their characteristics and simplicity, they have been used in new eras.

In many companies, the data platform and the AI ​​platform are usually two platforms, and sometimes maintenance is also performed by two different teams. This kind of isolation between platforms has caused great interoperability problems, the biggest of which is data interoperability.

To complete the development of the value of data, we must first process the data, and then process it to AI to learn. The difference between the technology stack and the platform has caused the cost of AI to obtain data to be increased. Users may need to go to the data platform to write a data processing script (for example, based on Spark, etc.) to extract the data, and then transfer it to the AI ​​platform or a storage that can be shared by the AI ​​and the data platform, and then go to the AI ​​platform to write Python or library Complete the data training. Of course, as the division of labor becomes more and more detailed, in fact, the user himself cannot complete all these things. He often needs to disassemble these steps and then distribute them to each small support team, and then each small team will schedule and finally complete The whole chain work. Whether it is a large company or a small company, this will greatly consume costs.

Building data and AI platforms is still difficult

Even today, it is still difficult for many companies to build data and AI platforms: need to understand countless components, need a lot of experience and understand SQL, Java, Scala, Python development, analysts, algorithms, or need to understand Various cloud products are combined to achieve the effect you want. Stacking all these difficulties can lead to extremely high development costs. At the same time, due to IT-specific attributes, even with these investments, it may still be full of pits. In the face of countless low-level storage, how to do an effective permission control has made many developers scratch their heads.

Moreover, data and AI should be inclusive. In my opinion, data and AI should not be limited to the hands of data scientists (including analysts and algorithms). Any product, operation, or even anyone with access to data should be able to develop value for these data. . This kind of development is as small as viewing, or exporting to an Excel, or providing a query interface to the outside, and providing a precise recommendation system, which should be relatively easy to complete.

Python is still difficult for most people, and it cannot integrate the big data ecosystem well. SQL is simple, but it cannot meet the complex needs of algorithms and it is difficult to use a large number of ecology in the AI ​​field.

We hope to complete big data batch processing and AI training and prediction in one language. We hope to build a data security mechanism at the language level to solve the access control problems of countless low-level storage, so that the development is no longer scratching its head. We also hope that the built-in functions of this language are easy to expand and can make full use of the big data and AI construction language Java / Scala / C ++ for expansion, so the execution engine of this language should have a powerful plug-in capability.

Conclusion

Nowadays, programming languages ​​are becoming more and more domainized, and ultimately it is necessary to let the right language help the right field. We believe that the vigorous development of big data and AI will inevitably require a more customized language. The birth of such languages ​​represented by SQLFlow and MLSQL conforms to this trend.