Maximizing SQLAlchemy: Database Query Performance

Category Database, Product engineering

A general enterprise data-driven application will have to do a lot of querying and read operation on the database. And if you are making queries through your python app then SQLAlchemy is a great tool for this purpose. A good read on how to improve the performance of SQL code is given here. In this post, we will look at those considerations and assumptions and try to find the best possible balance between the two. Also, we will learn the ways and methods to weigh the various options at our disposal.

Creating a database for the benchmarks

We will be running the measurements on a MySQL/MariaDB database. So, first of al, let’s create a database `testing` for our testing purposes.

 ➜ mysql -u root -p
 Enter password:
 Welcome to the MariaDB monitor. Commands end with ; or \g.
 Your MariaDB connection id is 5
 Server version: 10.1.24-MariaDB MariaDB Server
  
 Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.
  
 Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
  
 MariaDB [(none)]>
 MariaDB [(none)]> create database testing;
 Query OK, 1 row affected (0.00 sec)
  
 MariaDB [(none)]> use testing;
 Database changed
 MariaDB [testing]>
 MariaDB [testing]>

Next, we will create a transactions table where we will have some dummy columns. We will populate those columns with some dummy data.

 CREATE TABLE transactions (
 id int NOT NULL PRIMARY KEY,
 name varchar(255),
 description varchar(255),
 country_name varchar(255),
 city_name varchar(255),
 cost varchar(255),
 currency varchar(255),
 created_at DATE,
 billing_type varchar(255),
 language varchar(255),
 operating_system varchar(255)
 );

This will create a database with the required columns. We will now fill the rows and columns with some dummy data. To generate the dummy data we are going to use faker. Using faker we will generate fake data for the database which will be representative of real datasets.
The code that is used for this purpose is shown below. First, we will call all the dependencies in the script.

 from datetime import datetime
 import random
  
 import sqlalchemy
 from sqlalchemy import create_engine
 from sqlalchemy.ext.declarative import declarative_base
 from sqlalchemy import Column, Integer, String, DateTime
 from sqlalchemy import Sequence
 from sqlalchemy.orm import sessionmaker
 from elizabeth import Personal
 from faker import Faker

Then we can define the base class that will be used to interface with the database.

 user = 'root'
 pwd = ''
 host = 'localhost'
 db = 'testing'
 mysql_engine = create_engine('mysql://{0}:{1}@{2}/{3}'.format(user, pwd, host, db))
  
 Base = declarative_base()
  
 class Transactions(Base):
 __tablename__ = 'transactions'
  
 id = Column(Integer, primary_key=True)
 name = Column(String)
 description = Column(String)
 country_name = Column(String)
 city_name = Column(String)
 cost = Column(String)
 currency = Column(String)
 created_at = Column(DateTime)
 billing_type = Column(String)
 language = Column(String)
 operating_system = Column(String)
  
 def __repr__(self):
 """How the class Transactions is shown.
  
  :returns: return the string with the id and the name.
  
  """
 pass

And then we will leverage faker and create 50000 rows of some fake data.

 myFactory = Faker()
  
  
 Session = sessionmaker(bind=mysql_engine)
 session = Session()
  
 billing_type_list = ['cheque', 'cash', 'credit', 'debit', 'e-wallet']
 language = ['English', 'Bengali', 'Kannada']
 operating_system = 'linux'
  
 for i in range(60000, 1000000):
  
 transaction = Transactions(
 id = int(i),
 name = myFactory.name(),
 description = myFactory.text(),
 country_name = myFactory.country(),
 city_name = myFactory.city(),
 cost = str(myFactory.random_digit_not_null()),
 currency = myFactory.currency_code(),
 created_at = myFactory.date_time_between(start_date="-30y", end_date="now", tzinfo=None),
 billing_type = random.choice(billing_type_list),
 language = random.choice(language),
 operating_system = operating_system
 )
  
 session.add(transaction)
  
 session.commit()

If you want the whole script, you can click on the link here. Running this code will create a database of 50000 rows in your database.
Below are small samples of the database which has been reduced for reading purposes.

https://gist.githubusercontent.com/infinite-Joy/04dde9149aa82f4188a71c821dcaec0a

As we will be focusing on the date, we will have the database show us some sample datetimes as well.

 MariaDB [testing]> select name, billing_type, created_at from transactions limit 3;
 +------------------+--------------+------------+
 | name | billing_type | created_at |
 +------------------+--------------+------------+
 | Henry Smith | cash | 2015-05-11 |
 | Brandon Williams | debit | 2000-04-14 |
 | Natalie Smith | debit | 2007-09-02 |
 +------------------+--------------+------------+
 3 rows in set (0.00 sec)
view rawdb2.bash hosted with ❤ by GitHub

Querying the database using SQLAlchemy

Now we will bring SQLAlchemy into the picture, query the DB and filter it according to the date. The code to do this is below. The profiled function is taken from SQLAlchemy documentation with some modifications for python3.

 import cProfile
 from io import StringIO
 import pstats
 import contextlib
 import time
  
 import sqlalchemy
 from sqlalchemy import create_engine
 from sqlalchemy.ext.declarative import declarative_base
 from sqlalchemy import Column, Integer, String, DateTime
 from sqlalchemy import Date, cast
 from sqlalchemy.orm import sessionmaker
  
  
 @contextlib.contextmanager
 def profiled():
 pr = cProfile.Profile()
 pr.enable()
 yield
 pr.disable()
 s = StringIO()
 ps = pstats.Stats(pr, stream=s).sort_stats('cumulative')
 ps.print_stats()
 # uncomment this to see who's calling what
 # ps.print_callers()
 print(s.getvalue())
  
  
 user = 'root'
 pwd = ''
 host = 'localhost'
 db = 'testing'
  
 Base = declarative_base()
  
 class Transactions(Base):
 __tablename__ = 'transactions'
  
 id = Column(Integer, primary_key=True)
 name = Column(String)
 description = Column(String)
 country_name = Column(String)
 city_name = Column(String)
 cost = Column(String)
 currency = Column(String)
 created_at = Column(DateTime)
 billing_type = Column(String)
 language = Column(String)
 operating_system = Column(String)
  
 def __repr__(self):
 """How the class Transactions is shown.
  
  :returns: return the string with the id and the name.
  
  """
 return "Transaction(id=%s, name=%s)" % (self.id, self.name)
  
  
 def transactions_filtered_by_querying():
 with profiled():
 mysql_engine = create_engine('mysql://{0}:{1}@{2}/{3}'.format(user, pwd, host, db))
 Session = sessionmaker(bind=mysql_engine)
 session = Session()
  
 transactions = (
 session.query(Transactions)
 .filter(
 cast(Transactions.created_at, Date) == '2007-09-02'
 )
 .all()
 )
 return [(t.id, t.name, t.billing_type) for t in transactions]
  
  
 transactions_filtered_by_querying()
view rawquerying1.py hosted with ❤ by GitHub

This is basically passing the query to the database and letting the database do the filtering for us. Running the above code gives us the following result.

➜ python test_on_different_methods.py
8092 function calls (7813 primitive calls) in 0.078 seconds

So the query completes in 0.078 seconds. What if we try to do the same thing by querying everything and then filtering the result in the code. Common sense tells us that this should take time as databases are generally optimized for querying and filtering. Also passing a lot of data over the network will take a lot of time as opposed to having the database do the operations and passing only the relevant data over the network. Let’s take a look at the code that will be needed to make this test.

 def transactions_filtered_in_python_code():
 with profiled():
 mysql_engine = create_engine('mysql://{0}:{1}@{2}/{3}'.format(user, pwd, host, db))
 Session = sessionmaker(bind=mysql_engine)
 session = Session()
  
 transactions = session.query(Transactions).all()
 return [(t.id, t.name, t.billing_type) for t in transactions if t.created_at == '2015-05-11' ]
  
  
 transactions_filtered_in_python_code()
view rawquerying2.py hosted with ❤ by GitHub

Executing the above code will give us the following result.

➜ python test_on_different_methods.py
1624078 function calls (1623417 primitive calls) in 1.880 seconds

As you can see this took a lot more time as expected(1.880 secs as compared to 0.078 secs).
So this was expected. Now, do you sometimes doubt that this was just one call so we should obviously be deferring the filtering to the database? What if we have to make multiple filtering on the same data. Surely it would be better to make just one call, get all of the data and then do all the filtering in the code. Let’s check that hypothesis. To do this, we will take the help of the code that is there below.

 def timing_function(some_function):
  
 """
  Outputs the time a function takes
  to execute.
  """
  
 def wrapper(*args, **kwargs):
 t1 = time.time()
 some_function(*args, **kwargs)
 t2 = time.time()
 print("Time it took to run the function: " + str((t2 - t1)) + "\n")
 return wrapper
  
  
 @timing_function
 def transactions_filtered_by_querying_multiple_calls(calls):
 for _ in range(calls):
 mysql_engine = create_engine('mysql://{0}:{1}@{2}/{3}'.format(user, pwd, host, db))
 Session = sessionmaker(bind=mysql_engine)
 session = Session()
  
 transactions = (
 session.query(Transactions)
 .filter(
 cast(Transactions.created_at, Date) == '2007-09-02'
 )
 .all()
 )
 [(t.id, t.name, t.billing_type) for t in transactions]
  
  
 @timing_function
 def transactions_filtered_in_python_code_multiple_calls(calls):
 mysql_engine = create_engine('mysql://{0}:{1}@{2}/{3}'.format(user, pwd, host, db))
 Session = sessionmaker(bind=mysql_engine)
 session = Session()
  
 transactions = session.query(Transactions).all()
  
 for _ in range(calls):
 [(t.id, t.name, t.billing_type) for t in transactions if t.created_at == '2015-05-11' ]
view rawquerying3.py hosted with ❤ by GitHub

So running the above code on different amount of calls gives us the following numbers. The numbers are in absolution time in seconds.

The whole code can be checked out here. We can see that the results in both the cases are quite close (except the first result) and they are quite linear. The last result for the `transactions_filtered_by_querying_multiple_calls` failed with connection error showing that this is the upper bound with the number of connections that we can have with the SQL database. So we can see that for a small number of queries SQLAlchemy wins hands down. While for a large number of queries they are almost the same. Only when we reach the need for having around 400 connections, then we need to give this a second thought. Probably then the database should be changed and an in-memory database like Redis should be used.
So the conclusion that we should try to make the number of calls as less as possible and then defer the general processing like filtering, sorting to the database. This should help us in getting a high amount of performance from the code.

Ready to embark on a transformative journey? Connect with our experts and fuel your growth today!