Weiming Guo's ARIA project:SQL Queries over Python Objects using an Embedded Column-Based Relational Database
Modern data science applications require support for a variety of statistical operations, such as linear algebra. As traditional database engines are not equipped to support such operational primitives, the most common approach employed by data scientists is to export data outside of the database into a specialized statistical system that is tailored to support the data processing needs of the data scientists. This approach results in the duplication of data from a centralized database repository and various related overheads such as those involving data transfer. The database community has been trying to evolve the database capabilities to make in-database computation of data science workflows an attractive option.
A very recent initiative in this aspect is the AIDA data science framework that allows data scientists to write interactive Python based data analysis from their client-based computer workstations, which is transparently executed in a remote relational database management system (RDBMS) which holds the source data. AIDA’s server uses an embedded Python interpreter that resides inside the RDBMS to manage its computations. While internally, AIDA uses NumPy to perform statistical computations, it relies on the host RDBMS for relational (SQL) operations. Although the data structures used in statistical systems and a RDBMS are relatively similar, the fundamental differences in the data storage and processing needs of these systems mean they do not seamlessly integrate well with each other, requiring data format conversions between the systems. Existing research has shown that using databases that use columnar storage (each column stored as a separate array of its own) reduces the need for such conversions as the underlying data structures are very similar to that of the statistical systems. This summer, my research focused on exploring how these benefits offered by column-based databases could be leveraged for AIDA’s implementations on row-based databases such as PostgreSQL, where all the data for a record is stored continuously as a separate array and is therefore different from the storage format employed by the statistical systems.
As a computer science major junior with a particular interest in database and data science, I was immediately drawn to the research “SQL Queries over Python Object Using an Embedded Column-based Relational Database.” When I first saw the post on MyCourses, I was attracted by the design of the AIDA framework, which allows data scientists to write interactive Python based data analysis from client-based computer workstations, but transparently shifts the computation to a remote relational database management system (RDBMS), performing them near-data. Moreover, I was excited by the goal of the research: to reduce the cost of data format conversions between statistical systems and RDBMS, thus improving the efficiency of data processing. In addition, I was a junior student who had never conducted research. I hoped that helping Dr. Joseph D’silva with his research would increase my interest in research as well as aid in making decisions about my future education and career.
There were many highlights of my research in this summer. At first, I was entirely new to Python, which was the AIDA framework's primary language. With the previous experience of learning Java, I spent only two weeks learning Python, and then I spent only one week to understand the implementation of AIDA framework, which was an outstanding achievement for me. Moreover, I also learned many debug methods and Linux commands with the help of Prof. Dsilva. During the research, I was constantly motivated to acquire new skills and apply them in practice, which helped me make rapid progress with my academic studies.
The biggest challenge I faced this summer was failing to implement my initial approach, which was to embed a columnar database to AIDA framework. The high-level idea was to use this embedded database for queries on data that had been materialized in the statistical system, as this would reduce the need for data conversion costs associated with passing it to the host database. However, the embedded columnar database did not integrate well with PostgreSQL when I initialized it inside AIDA framework. As such, I decided to try another approach, which was to create two AIDA servers for both MonetDB and PostgreSQL and use MonetDB AIDA for queries over materialized data. Eventually, the second approach was implemented successfully.
It is my great honour to be selected as one of the 2020 ARIA scholarship recipients at 山ǿ. I would like to thank Mr. Harry Samuel for his generosity and investment in my future. Thanks to his scholarship, I have had the opportunity to learn about the workflow of research and acquire many research skills, which makes me more convinced that I want to attend graduate school to broaden my knowledge and undertake more academic achievements.